XML Tips and Techniques: September 2005

Thursday, September 15, 2005

When To Use Tag Attributes Instead of Child Elements in XML Documents

In my last post, I showed how we could represent an XML file in a more compact format. One of techniques I employed was to organize "records" under their common key. In my example, I used the IP address of visitors to my website/blogsites as the common key. I took what was originally an XML attribute, clientip, in each <serverlogentry> element, and grouped them as a new attribute under a new <visitor> element. I also took the date attribute of the <serverlogentry> elements and grouped them into a new element, <date>, with an attribute value containing the grouped date.

The fact is, XML is a very open standard. I could have rearranged and regrouped the original WSML (Web Server Markup Language) file in numerous different ways, depending on my application requirements. Whether I save a value as an XML tag attribute or as an element is really up to my needs. It's arguable, but I feel that using attributes is done so primarily for human convenience. It's easier to glance at raw XML and see what some tag's attribute are, especially if you use a tool like Microsoft's Internet Explorer browser to view the XML (see halfway down this linked post). While using attributes whenever possible instead of elements actually saves file space, I seem to recall something in the XML spec stating that lack of verbosity isn't a goal of XML.

(c) Copyright 2005-present, Raj Kumar Dash, http://xml-tips.blogspot.com

Technorati : XML, child element, tag attribute

6:31 PM | Permalink

Wednesday, September 14, 2005

Revised Server Log XML Format (PDF Tutorial)

In the last post, I discussed what I'm calling WSML (Web Server Markup Language), which is in XML format. The WSML format is used for temporary XML files that transfer web server log data from the Extended Log Format into a PHP application (which I have yet to write or post to my PHP blog on).

The current WSML format is in efficient. As I mentioned in the previous XML Tips post, a WSML file representing the same information as the source web server access log takes up almost twice as much file space. There are data redundancies we could eliminate. In database lingo, we need to "normalize" the WSML-formatted log data. Note that because the WSML format is used for temporary files, normalizing the XML document structure may not be of any benefit. However, I'll do so anyway, as the techniques in this post can be used to normalize any XML file that carries consistent/ predictable redundancies. The general principle is to "collapse" the XML document-tree nodes to accumulate common child nodes under a particular category. This will become clear with an example. The remainder of this post is in a PDF file [38 Kb]. This is an experiment to see whether a combination of Blog posts and PDF tutorials is an efficient way of discussing technical concepts. If you have any comments about this method, please drop me a line at rdash001-at-yahoo-dot-ca (email mangled to fool spambots).

(c) Copyright 2005-present, Raj Kumar Dash, http://xml-tips.blogspot.com

Technorati : XML, XML tips, access log, web server

8:09 PM | Permalink

Saturday, September 03, 2005

Web Server Log Data/ Web Server Markup Language - WSML

Note: This posting is directly related to my Perl-Tips blog. I'll shortly be putting up some Perl scripts there to parse web server log files. Please keep an eye on that blog if you are interested.

In the process of parsing a web server log file to analyze visitor data to my blogs, I found myself using a temporary XML file to transfer information between a command-line Perl script and PHP web scripts. (The how and why of this is at my Perl-Tips blog.) So I came up with a very simple XML-based markup language to describe the records of a web server log. A sample is shown below. Note that the data below is based on an "Extended Format" Log File. This is similar to the NCSA Standard format, but also includes the referring web page and the user agent (type of web browser or other software used to "visit" the page). Microsoft log files follow a slightly different format and are not discussed here.

<?xml version="1.0" ?>
<serverlog_partial domain="chameleonintegration.com">
<logentries>
<serverlogentry id="1" clientip="151.203.201.149" date="29/Aug/2005" time="23:44:40" tzone="-0400">
<method>GET</method>
<protocol>HTTP/1.1</protocol>
<status>200</status>
<bytes>73616</bytes>
<requri>/blogs/blogspinner/myblog-gantt-03.jpg</requri>
<referer>http://blogspinner.blogspot.com/</referer>
<useragent>Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.7.10) Gecko/20050716 Firefox/1.0.6</useragent>
</serverlogentry>
<serverlogentry id="2" clientip="68.96.55.133" date="30/Aug/2005" time="01:05:56" tzone="-0400">
<method>GET</method>
<protocol>HTTP/1.1</protocol>
<status>200</status>
<bytes>3282</bytes>
<requri>/blog/closeup-verysm.jpg</requri>
<referer>-</referer>
<useragent>Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; .NET CLR 1.1.4322)</useragent>
</serverlogentry>
</logentries>
</serverlog_partial>

The root XML element is defined by the <serverlog_partial> tag, which has the single attribute domain. This isn't strictly necessary, but I have multiple web domains that I'll be analyzing in the future. (I like to think ahead for future use of a software component.) Currently, the only child element of serverlog_partial is <logentries>, which has no attributes. The <logentries> element can have one or more <serverlogentry> elements. Each such element has 5 attributes:

id - This is a unique id that will be used in the server log database to identify each record. (My SQL-Tips blog is not live yet. Watch the the "My Tech Blogs" links list in the right column.).
clientip - This is the ip address of the visitor. For some web servers, this is actually the hostname, depending on how the server is configured. But that requires a DNS lookup for each record, which is a waste of resources.
date - This is the date of a visit
time - This is the time of a vist
tzone - The time zone of your web server machine, relative to GMT, Greenwich Mean Time. For example, I live in zone -0500, but my webserver is in -0400, or East of me by one zone.

The <serverlogentry< element has several children elements, none of which have any attributes:

<method> - This is the method by which the page was requested. It is usually GET or POST, but there are other values I won't discuss here.
<protocol> - This is the version of HTTP that is used by my web server. The only real reason I am saving this value is for posterity. [I'm a data junkie.]
<status> - This is the web server status code of the page request. A "200" is successful. A 404 is unsuccessful. There are other codes which I'll discuss in the Perl-Tips blog at a later time.
<bytes> - This is the exact number of bytes that resulted from a page request, whether it was successful or not.
<requri> - This is the URI of the requested page. Note that an URI, or Uniform Resource Identifier, may be different from an URL, or Uniform Resource Location. In particular, at least with my web server, the "http:/" portion is missing, and the values of requri are relative to the web server root directory.
<referer> - This is the web page from which the visitor clicked a link to request the current requri value. This is value is extremely useful in data mining techniques. It tells you where the visitor came from (i.e., they found your page). It can also tell you whether different advertising campaigns are successful or not.
<useragent> - This is the web browser or other software that the current visitor used to request this requri. In some cases, the operating system of the visitor's computer is also recorded. The useragent value is also extremely valuable for data mining. For example, it tells you which browser is most popular amongst your visitors. It also tells you which search engines are indexing your site.

One thing to note is that this XML file is nearly twice as large as the original web server log file. So it's only a temporary data state in my web server log analysis system. The XML format that data is in, as shown above, is both to minimize the file size, as well as for human convenience. Particularly, if you view a syntactically correct XML file in an MS Internet Explorer browser, it gives you a display that allows you to expand and contract each node, or level, of markup. The two snapshots below illustrate. The first snapshot shows an XML file displayed in Internet Explorer with all the nodes expanded. In the second snapshot, the first few nodes have been collapsed.

Notice that in the collapsed nodes, you can see at a glance which visitor each <serverlogentry> element represents, and what the date, time and zone was. For this reason, and to reduce file size, I used attributes in <serverlogentry> instead of making each value an XML element. [I support OpenSource software, but sometimes there is functionality in commercial software that isn't found elsewhere.]

Again, don't forget that I'll soon post the Perl code that generates WSML files over at my Perl-Tips blog.

(c) Copyright 2005-present, Raj Kumar Dash, http://xml-tips.blogspot.com

5:40 PM | Permalink

XML-Tips Blog Starts

Many of the Internet-related programming problems that I solve involve either parsing data or massaging it in some way. I find that XML is a very handy means of structuring information to be reused. As I tend to use more than one language (usually Perl and PHP), I need a means of communicating data between these languages. WDDX is one method that I may talk about on one of my tech blogs in the future, but for now I'll leave it alone.

For some posts, this blog will be used jointly with one of my programming blogs (e.g., Perl-Tips or PHP-Tips). Other times, there'll be independent posts here. Most of the time, I'll outline actual case studies that I've used in real-life Internet-related problem solving. As I say on my other programming blogs, if you have a particular question, please feel free to drop me a line at rdash001-at-yahoo-dot-ca (email mangled to confuse spambots). If I can answer your question, I will do so.

(c) Copyright 2005-present, Raj Kumar Dash, http://xml-tips.blogspot.com

5:36 PM | Permalink

XML Tips and Techniques

Thursday, September 15, 2005

When To Use Tag Attributes Instead of Child Elements in XML Documents

Wednesday, September 14, 2005

Revised Server Log XML Format (PDF Tutorial)

Saturday, September 03, 2005

Web Server Log Data/ Web Server Markup Language - WSML

XML-Tips Blog Starts

About me

Last posts

Archives

Links

My Non-Tech Blogs

My Tech Blogs

Links