XML Tips and Techniques

Tuesday, October 18, 2005

New Focus and URL For Tech/Web Programming + Analysis Blogs

This blog is about to be incorporated into a single blog called "WebGuru" that will be available at my new geekSchool/MathGurus Online website (http://www.mathgurusonline.com).

The WebGuru blog will contain posts about webmastering, web programming, and website analysis in general. This includes tips and techniques for Perl, PHP, XML, CSS, mySQL, javascript, data mining/ net metrics/ web analytics, geo-plotting, RSS and more. As such, most of my technical blogs (listed in the right-hand column) are being merged into WebGuru. [Some blogs are moving elsewhere, and will be announced later.]

WebGuru will contain both "prerequisite" and "problem-solving" topics. I do not like the terms "beginner" and "advanced" for categorizing programming tips, as they sometimes scare away people who are unsure. My categorization has nothing to do with your age or your programming skills, just your knowledge of a particular web topic. Basically, if you find that you don't understand one of my "problem-solving" posts, go have a look at some of the "prerequisite" posts to either refresh your memory or learn some basic skills.

For the first few weeks, there'll be an emphasis on "prerequisite" posts so that I can later get into more complex web programming. Just watch the geekSchool website (http://www.mathgurusonline.com) for a link to WebGuru. As soon as the new blog is ready, a "WebGuru" link will appear in the left-hand navigation. Please note that my older web programming/ analysis blogs will stay as is. No more posts will be made, and commenting will be shut off, but the URL will persist.

I have 60-70 posts sketched out and want to complete a few before I go live with WebGuru (hopefully later this week or early next week). For those of you that have been searching for Perl programming/scripting tips, I have over 35 Perl tips sketched out, with more in the works.

See you at geekSchool soon.

raj

Technorati : geek school, web programming, web scripting, webmaster school, webmastering tutorials

7:15 PM | Permalink

Thursday, October 13, 2005

Back From Vacation

Hello everyone. I'm back from vacation. (Hey, I still worked 16-20 hours a day on my blogs and websites, so it wasn't really a vacation). I apologize for the batch posting of this message to all of my blogs, but I'm still madly reorganizing my blogs and this is the fastest way for me to communicate with readers... (The most current links to most of my blogs and website projects can always be found at my main website, http://www.chameleonintegration.com/.)

This is a somewhat lengthy post, but if you read any of my blogs with any frequency, my recommendation is that you read it. Otherwise, just visit keep visiting the blog(s) you're interested in :D.

I have several new websites, including a social awareness site, that I launched during the last two weeks. Some of them are still being tweaked (design and architecture). I'm also in the processing of moving some blogs, amalgamating other blogs, and creating a few new ones. I have nearly 200 blog posts sketched out across all of my blogs, but not all of these posts are in publishable format. So I do have tons of content planned, including some free ebooks, tutorials, and more. I'm just one person doing all of this, so please bear with me while I'm reorganizing.

By the way, I do try to check what people are searching for and then try to write a post relating to such topics (if I don't already have some such posts). I don't consider myself a blog network per se. I'll be straight out honest and say that I want to provide free information about several topics (food, technology, entertainment, and more), and then hope that (legitimate) ad revenue supports my writing and blogging habit. I'm a former print magazine publisher and editor, so blogs are my transition into the digital realm. My experience as a former search engine webmaster and as a programmer rounds my skills out. So blogging and websites are my ideal way to spend the day. So I'm making it my business to write about what you are looking for information on, provided it falls within my areas of interest or expertise. That said, there are a few blogs on my books that I'll be collaborating on with others, including family members, friends, and acquaintances.

So the scope of the "Chameleon Integration Systems" (CIS) blogs is expanding. I just have to keep it manageable so I can increase quality. The blog page templates I'm using will be changing on many of my blogs as I changing blogging platforms. For those that are curious, I currently use Blogger.com, WordPress and MovableType. I'll be trying out Mambo, bMachine, and others as well. Why all the platforms? Well, I have close to a decade of experience evaluating very high end ($500,000-$2,000,000) CMSes (Content Management Systems) for many large companies. Now I'm focusing on OpenSource solutions, specifically on software that can help bloggers set up both blogs and regular websites, plus online shopping. My "Chameleon Integration" motto is "Making the Internet Easy". So I'll be writing about my findings, for those that are interested.

Finally, just a note about blog posting schedules. I will not be posting on Sundays (I live in North America, time zone -0500., same zone as New York and Toronto). Sundays will be a day that I analyze stats, design new web pages, and sketch out the next week's worth of posts, and basically unwind. While I am aiming at posting daily to most blogs, I am still doing a lot of infrastructure work, so I won't be up to speed right away. I'll be posting some entries later today, but I probably won't be posting to every blog (new and old) until next week or the next. So I'll try to keep "current events" information posted at my main website, http://www.chameleonintegration.com/. I hope you'll visit again, and drop off comments about what you'd like to see information on.

cheers,

raj kumar dash

1:17 PM | Permalink

Thursday, September 15, 2005

When To Use Tag Attributes Instead of Child Elements in XML Documents

In my last post, I showed how we could represent an XML file in a more compact format. One of techniques I employed was to organize "records" under their common key. In my example, I used the IP address of visitors to my website/blogsites as the common key. I took what was originally an XML attribute, clientip, in each <serverlogentry> element, and grouped them as a new attribute under a new <visitor> element. I also took the date attribute of the <serverlogentry> elements and grouped them into a new element, <date>, with an attribute value containing the grouped date.

The fact is, XML is a very open standard. I could have rearranged and regrouped the original WSML (Web Server Markup Language) file in numerous different ways, depending on my application requirements. Whether I save a value as an XML tag attribute or as an element is really up to my needs. It's arguable, but I feel that using attributes is done so primarily for human convenience. It's easier to glance at raw XML and see what some tag's attribute are, especially if you use a tool like Microsoft's Internet Explorer browser to view the XML (see halfway down this linked post). While using attributes whenever possible instead of elements actually saves file space, I seem to recall something in the XML spec stating that lack of verbosity isn't a goal of XML.

(c) Copyright 2005-present, Raj Kumar Dash, http://xml-tips.blogspot.com

Technorati : XML, child element, tag attribute

6:31 PM | Permalink

Wednesday, September 14, 2005

Revised Server Log XML Format (PDF Tutorial)

In the last post, I discussed what I'm calling WSML (Web Server Markup Language), which is in XML format. The WSML format is used for temporary XML files that transfer web server log data from the Extended Log Format into a PHP application (which I have yet to write or post to my PHP blog on).

The current WSML format is in efficient. As I mentioned in the previous XML Tips post, a WSML file representing the same information as the source web server access log takes up almost twice as much file space. There are data redundancies we could eliminate. In database lingo, we need to "normalize" the WSML-formatted log data. Note that because the WSML format is used for temporary files, normalizing the XML document structure may not be of any benefit. However, I'll do so anyway, as the techniques in this post can be used to normalize any XML file that carries consistent/ predictable redundancies. The general principle is to "collapse" the XML document-tree nodes to accumulate common child nodes under a particular category. This will become clear with an example. The remainder of this post is in a PDF file [38 Kb]. This is an experiment to see whether a combination of Blog posts and PDF tutorials is an efficient way of discussing technical concepts. If you have any comments about this method, please drop me a line at rdash001-at-yahoo-dot-ca (email mangled to fool spambots).

(c) Copyright 2005-present, Raj Kumar Dash, http://xml-tips.blogspot.com

Technorati : XML, XML tips, access log, web server

8:09 PM | Permalink

Saturday, September 03, 2005

Web Server Log Data/ Web Server Markup Language - WSML

Note: This posting is directly related to my Perl-Tips blog. I'll shortly be putting up some Perl scripts there to parse web server log files. Please keep an eye on that blog if you are interested.

In the process of parsing a web server log file to analyze visitor data to my blogs, I found myself using a temporary XML file to transfer information between a command-line Perl script and PHP web scripts. (The how and why of this is at my Perl-Tips blog.) So I came up with a very simple XML-based markup language to describe the records of a web server log. A sample is shown below. Note that the data below is based on an "Extended Format" Log File. This is similar to the NCSA Standard format, but also includes the referring web page and the user agent (type of web browser or other software used to "visit" the page). Microsoft log files follow a slightly different format and are not discussed here.

<?xml version="1.0" ?>
<serverlog_partial domain="chameleonintegration.com">
<logentries>
<serverlogentry id="1" clientip="151.203.201.149" date="29/Aug/2005" time="23:44:40" tzone="-0400">
<method>GET</method>
<protocol>HTTP/1.1</protocol>
<status>200</status>
<bytes>73616</bytes>
<requri>/blogs/blogspinner/myblog-gantt-03.jpg</requri>
<referer>http://blogspinner.blogspot.com/</referer>
<useragent>Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.7.10) Gecko/20050716 Firefox/1.0.6</useragent>
</serverlogentry>
<serverlogentry id="2" clientip="68.96.55.133" date="30/Aug/2005" time="01:05:56" tzone="-0400">
<method>GET</method>
<protocol>HTTP/1.1</protocol>
<status>200</status>
<bytes>3282</bytes>
<requri>/blog/closeup-verysm.jpg</requri>
<referer>-</referer>
<useragent>Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; .NET CLR 1.1.4322)</useragent>
</serverlogentry>
</logentries>
</serverlog_partial>

The root XML element is defined by the <serverlog_partial> tag, which has the single attribute domain. This isn't strictly necessary, but I have multiple web domains that I'll be analyzing in the future. (I like to think ahead for future use of a software component.) Currently, the only child element of serverlog_partial is <logentries>, which has no attributes. The <logentries> element can have one or more <serverlogentry> elements. Each such element has 5 attributes:

id - This is a unique id that will be used in the server log database to identify each record. (My SQL-Tips blog is not live yet. Watch the the "My Tech Blogs" links list in the right column.).
clientip - This is the ip address of the visitor. For some web servers, this is actually the hostname, depending on how the server is configured. But that requires a DNS lookup for each record, which is a waste of resources.
date - This is the date of a visit
time - This is the time of a vist
tzone - The time zone of your web server machine, relative to GMT, Greenwich Mean Time. For example, I live in zone -0500, but my webserver is in -0400, or East of me by one zone.

The <serverlogentry< element has several children elements, none of which have any attributes:

<method> - This is the method by which the page was requested. It is usually GET or POST, but there are other values I won't discuss here.
<protocol> - This is the version of HTTP that is used by my web server. The only real reason I am saving this value is for posterity. [I'm a data junkie.]
<status> - This is the web server status code of the page request. A "200" is successful. A 404 is unsuccessful. There are other codes which I'll discuss in the Perl-Tips blog at a later time.
<bytes> - This is the exact number of bytes that resulted from a page request, whether it was successful or not.
<requri> - This is the URI of the requested page. Note that an URI, or Uniform Resource Identifier, may be different from an URL, or Uniform Resource Location. In particular, at least with my web server, the "http:/" portion is missing, and the values of requri are relative to the web server root directory.
<referer> - This is the web page from which the visitor clicked a link to request the current requri value. This is value is extremely useful in data mining techniques. It tells you where the visitor came from (i.e., they found your page). It can also tell you whether different advertising campaigns are successful or not.
<useragent> - This is the web browser or other software that the current visitor used to request this requri. In some cases, the operating system of the visitor's computer is also recorded. The useragent value is also extremely valuable for data mining. For example, it tells you which browser is most popular amongst your visitors. It also tells you which search engines are indexing your site.

One thing to note is that this XML file is nearly twice as large as the original web server log file. So it's only a temporary data state in my web server log analysis system. The XML format that data is in, as shown above, is both to minimize the file size, as well as for human convenience. Particularly, if you view a syntactically correct XML file in an MS Internet Explorer browser, it gives you a display that allows you to expand and contract each node, or level, of markup. The two snapshots below illustrate. The first snapshot shows an XML file displayed in Internet Explorer with all the nodes expanded. In the second snapshot, the first few nodes have been collapsed.

Notice that in the collapsed nodes, you can see at a glance which visitor each <serverlogentry> element represents, and what the date, time and zone was. For this reason, and to reduce file size, I used attributes in <serverlogentry> instead of making each value an XML element. [I support OpenSource software, but sometimes there is functionality in commercial software that isn't found elsewhere.]

Again, don't forget that I'll soon post the Perl code that generates WSML files over at my Perl-Tips blog.

(c) Copyright 2005-present, Raj Kumar Dash, http://xml-tips.blogspot.com

5:40 PM | Permalink

XML Tips and Techniques

Tuesday, October 18, 2005

New Focus and URL For Tech/Web Programming + Analysis Blogs

Thursday, October 13, 2005

Back From Vacation

Thursday, September 15, 2005

When To Use Tag Attributes Instead of Child Elements in XML Documents

Wednesday, September 14, 2005

Revised Server Log XML Format (PDF Tutorial)

Saturday, September 03, 2005

Web Server Log Data/ Web Server Markup Language - WSML

About me

Last posts

Archives

Links

My Non-Tech Blogs

My Tech Blogs

Links