HTML

Extracting useful data from HTML pages with XQuery

When building in-house solutions or mobile enterprise applications, you are often faced with having to deal with legacy systems and data. In some ancient systems, the data might only be available as CSV files, in other cases it might be arcane fixed-length text reporting formats, but if the legacy system is less than 20 years old, chances are pretty good that someone built and HTML front-end and so the data is available through a browser interface that renders it in some poorly formatted HTML code that loosely follows the standard. And very likely you will find the data intermixed with formatting and other information, so extracting the useful data is usually not as easy as it sounds.

In addition, when you are building mobile solutions, you may sometimes need some government data that is not yet available in XML or another structured format, so you again are faced with having to extract that information from HTML pages.

Common approaches to extracting data from HTML pages, such as screen-scraping and tagging are cumbersome to implement and very susceptible to changes in the underlying HTML.

In this video demo I want to show you a better way of extracting useful and reusable data from HTML pages. In less than 15 minutes we will build a mobile solution that - as an example - takes Consumer Price Index data from the US Bureau of Labor Statistics, parses and normalizes the HTML page, and then uses an XQuery expression to build nicely structured XML data from the HTML table that can then be reused to build a CPI chart. I will walk you through the creation of the XQuery expression step-by-step so that you can easily apply this method to similar problems of HTML data extraction:



As you can see in the above video, it was fairly easy to create nicely structured XML data from a table in the HTML page and to create a first simple chart that plots the CPI data over time.

But the true power of this approach is that you have much more flexible charting capabilities in MobileTogether and the XML data is now reusable, so you can calculate annual inflation rates directly from the underlying CPI data and plot it as well.

In this next video demo I want to show you just how to do that in less than 10 minutes. We will add a year-range selector to our chart where we can define which years to plot, and we will add an overlay chart that derives the annual inflation rate based on the underlying CPI data using XPath calculations and the plots that data:



Using this technique, you can not only extract data from singular HTML pages, but easily build a modern mobile front-end experience for many legacy systems that just offer an HTML-based browser interface at present. This will enable you to make your workforce a lot more productive and efficient, as they can now use a friendly mobile app experience to access your system rather than having to deal with a couple of HTML pages and forms in a browser on their tiny smartphone screens.

Altova StyleVision In-Depth Review

Dave Gash published an in-depth review of Altova StyleVision 2010 on the WritersUA website this past week and says in his introduction:

Altova calls StyleVision a "stylesheet designer," but that technically accurate designation doesn't really do the software justice. They could have called it a "schema-based WYSIWYG drag-and-drop XML / XBRL / database visual page editor and XSLT / XSL-FO / HTML / RTF / PDF / Word / e-forms generator," but I'm guessing that wouldn't have made it past the suits in Marketing.

I like that new product description. It’s a bit of a mouthful, but certainly brings it to the point. Really, we couldn’t have said it any better…

Dave follows this introduction with a detailed review of the design method, user-interface, formatting, and output options and covers all the exciting new capabilities of version 2010, such as the new blueprint capability.

And after going over all the relevant features Dave comes to the following conclusion:

StyleVision is one of the most interesting software applications I've seen in years. Without question, it offers a new and unique approach to XSLT transform authoring, a skill formerly reserved for beanie-wearing, pocket-protector using, syntax-obsessing code jockeys such as your humble reviewer. It allows more of the tech pubs workforce than ever to transform raw data into aesthetic, useful pages.

While some coders might lament the loss of a previously proprietary skill set to non-programmers, the fact is that spreading knowledge around is a good thing. Make no mistake: as more people use a technology, the better that technology becomes, and StyleVision's application of the WYSIWYG concept to XSLT is a shining example.

We are delighted to hear that! Please check out Dave’s review and then download a free 30-day eval version to see for yourself.