HTML -> XML conversion with XSLT and visual tools

Step 1. Converting input HTML pages into XHTML with Tidy
Step 2. HTML structure visualization with Merlot.
Step 3.Writing XPath expression with The Xpath Visualizer
Step 4. Applying XSLT and testing the result

 
 
  Fighting chaos, case 1: XPath concat() function
Fighting chaos, case 2: <xsl:choose> element
Fighting chaos, case 3: descendant-or-self (//) axis
 
 
Our bunkhouse stretches itself on a large territory of 16 HTML files with total length about 500Kb. Page layout is done using tables and it makes corresponding HTML structure rather regular. However, the files were maintained by different people at different time and as a result, some peculiarities are observed at lower structure levels. For example, inside <td> tags the target information is normally placed in a sequence of <a><font> tags, but sometimes the sequence is reversed and we get <font><a>sequence. XSLT provides enough flexibility to write expressions and instructions matching both "regular" and "irregular" structures, but investigating such cases by reading nude HTML code would be tedious and boring. Free visual tools supplied by XML community turn this job into an exiting data hunting.
 
Step 1. Converting input HTML pages into XHTML with Tidy  
  To be able to use all power of XSLT, we have to provide an XSLT processor with well-formed XML as an input. Of course, our input files were far from well-formedness standards. For this reason, first we needed to "clean" input HTML files. I did it using Tidy - Dave Raggett's free tool, currently hosted by W3C. It reads an HTML file and converts it into a well-formed XHTML document, which can be interpreted as XML. Let's take a look at this real example:

Input HTML
:

<TD rowspan="2">
<
a target="amazon" href="http://www.amazon.com/exec/obidos/ASIN/0201616467/electricporkchop">
<
IMG height=140 src="images/practical.jpg" width=112 ></A>
</TD>


Problems:
1. tag a first coded in lower case and closing tag </A> in upper case
2. tag IMG doesn't have matching closing tag
3. Values of height and width attribute are not enclosed in quotes

Tidy output:
<td rowspan="2">
<
a target="amazon" href="http://www.amazon.com/exec/obidos/ASIN/0201616467/electricporkchop">
<
img height="140" src="images/practical.jpg" width="112" /></a>
</td>


All problems are magically fixed.
 
Step 2. HTML structure investigation with Merlot.  
 
Take a look at source of any bunkhouse page and you will think that it's easier to get lost in HTML jungles than to say anything meaningful about its structure. We can took advantage of the fact that XML editors can visualize the document structure as a tree. For this proposal Merlot, open source Java™ based XML editor, was used. (Remark: colorful markup on the picture is not a part of Merlot service and was added for better visualization)
By looking at the tree, created by Merlot, it's easy to see that our information is placed in table 3, and to write XPath expression to match it:

<xsl:template
match="
html/body/table[3]">

The same information is repeated for each row in the table, so we iterate through rows using <xsl:for-each> instruction. Inside the template we have:

<xsl:for-each select="./tr">

Further investigations shows that each book is described in a group of three rows. Information about a book's title and author, for example, is located in the first row. To check if the current row is the first in a triple, we can use XPath position() function:

<xsl:if test="position() mod 3 = 1">
 
 
Let's define where our "title-to-be" element is located. I expanded <tr> tag to investigate its inner structure. A book's title is located inside tags: <tr>, <td>, <a>, <font> respectively and author in tags <tr>, <td>. We can check this by looking at HTML code:

<!-- Practical Java -->
<
tr align="left">
<
td colspan="3" height="20">
<
a target="amazon"
href="http://www.amazon.com/
exec/obidos/ASIN/0201616467/
electricporkchop">
<
font size="5">Practical Java</font>
<br /></a> by Peter Haggar</td>
</tr>


Our XPatch expression for titles looks like: ./td/a/font/text()
(Here "." refers to the current element which is <tr> )
And for authors we have: ./td[1]/text()
   
Step 3.Checking XPath expressions with The Xpath Visualizer  
     
  Another great visual tool is The XPath Visualizer . It is designed specially for debugging and playing with XPath expressions during XSLT stylesheet development. Load your XML file, enter the expression and iterate through a collection of elements that match it (they are marked with yellow background) - great help in pattern investigation and tuning matching expressions.



 
Step 4. Applying XSLT and testing the result  
 
After all XPath expressions looked good enough, I run XSLT stylesheet and loaded the resulting XML file in Merlot for thorough control. The Xpath Visualizer is a big help when working with XML input, but it shows the input code "as is" and in our case it was almost unreadable HTML. It was easy to overlook an element that had to be, but was not "caught" by a flawed XPath expression. In contrast, the output in our case was represented by highly regular XML and Merlot showed it as a nice tree making it easy to spot missing or incorrect parts. For example, the expression matching book's authors worked fine for two files, but in the third file's output some of <author> elements got empty content.

Fighting chaos, case 1: XPath concat() function


<td colspan="3"><a target="amazon"
href="http://www.amazon.com/exec/obidos/ASIN/
188477749X/electricporkchop">
<font size="5">Java Network Programming</font></a>
_<br />
by by Hughes, Merlin / Shoffner, Michael / Hamner, Derek</td>

Looking at the input we notice that the problem is in <br/> tag that splits <td> element content in two parts. XPath expression ./td/text() matches a blank between </a> and <br/> tags (marked as "_" here) and leaves out the second part with actual information. To access the second part of <td> content, we should use text()[2] construct. To write more generic XPath expression the concat() function can be used:

<xsl:value-of select=
"concat( ./td[1]/text(), ./td[1]/text()[2])"
/>

If content is not splitted, text()[2] expression simply gives us an empty string. Finally, to get rid of extraneous space, normalize-space() function can be used
<xsl:value-of select="normalize-space(concat(./td[1]/text(), ./td[1]/text()[2]))"/>
 

Fighting chaos, case 2: <xsl:choose> element

Book titles provide us with another problem: target information was ocasionlly located in the content of <font> tag, which was eòclosed in <a> tag (variant 1), on another occasion it was the content of <a> tag, enclosed in <font> (variant 2). <xsl:choose> element came to the rescue and incorporated such diversity,

Variant 1:
<td colspan="3" height="20">
<a target="amazon"
href="http://www.amazon.com/exec/obidos/ASIN/0131103628/electricporkchop">
<font size="5">The C Programming Language</font><br />
</a> by Mark Williams Company</td>

matching XPath expression is: ./td[1]/a/font/text()

Variant 2:
<td colspan="3" height="20">
<font size="5">
<a target="amazon"
href="http://www.amazon.com/exec/obidos/ASIN/0201325829/electricporkchop">
Programming and Deploying Java Mobile Agents with Aglets</a></font><br />
by Danny B. Lange, Mitsuru Oshima</td>

matching XPath expression is: ./td[1]/font/a/text()

Solution: <xsl:choose>
Here we simply test which variant we encounter and choose appropriate XSLT instruction.

<xsl:choose>
	<xsl:when test="./td[1]/a/font">
		<title>
             <xsl:value-of select="normalize-space(./td[1]/a/font/text())"/>
         </title>
	</xsl:when>
	<xsl:when test="./td[1]/font/a">
		<title>
              <xsl:value-of select="normalize-space(./td[1]/font/a/text())"/>
         </title>
	</xsl:when>
</xsl:choose>

Fighting chaos, case 3: descendant-or-self (//) axis
Yet another problem was provided by ISBN-to-be element. Target information was located as "href" attribute of <a> tag, but <a> tag itself could happen either as a direct descendant of <td> tag (variant 1), or as a child of <font> (variant 2) - the same problem that complicated our book titles mining.

Variant 1
<td colspan="3" height="20">
<a target="amazon"
href="http://www.amazon.com/exec/obidos/ASIN/0131103628/electricporkchop">
<font size="5">The C Programming Language</font><br />
</a> by Mark Williams Company</td>

Matching XPath expression would be ./td[1]/a/@href

Variant 2
<td colspan="3" height="20">
<font size="5">
<a target="amazon" 
href="http://www.amazon.com/exec/obidos/ASIN/0201633469/electricporkchop">
TCP/IP Illustrated, Volume 1</a>
</font><br />
by W. Richard Stevens</td>

Here we have <font> tag between <td> and <a>. We could write ./td[1]/font/a/@href but there is a better solution: ./td[1]//a/@href By putting // between <td> and <a> we are saying "select all <a> tags which are descendant of <td> tag, regardless of how many depth levels they are located". This XPath expression matches both variants above.