Monday, October 6, 2014

Extracting useful data from HTML pages with XQuery

When building in-house solutions or mobile enterprise applications, you are often faced with having to deal with legacy systems and data. In some ancient systems, the data might only be available as CSV files, in other cases it might be arcane fixed-length text reporting formats, but if the legacy system is less than 20 years old, chances are pretty good that someone built and HTML front-end and so the data is available through a browser interface that renders it in some poorly formatted HTML code that loosely follows the standard. And very likely you will find the data intermixed with formatting and other information, so extracting the useful data is usually not as easy as it sounds.

In addition, when you are building mobile solutions, you may sometimes need some government data that is not yet available in XML or another structured format, so you again are faced with having to extract that information from HTML pages.

Common approaches to extracting data from HTML pages, such as screen-scraping and tagging are cumbersome to implement and very susceptible to changes in the underlying HTML.

In this video demo I want to show you a better way of extracting useful and reusable data from HTML pages. In less than 15 minutes we will build a mobile solution that - as an example - takes Consumer Price Index data from the US Bureau of Labor Statistics, parses and normalizes the HTML page, and then uses an XQuery expression to build nicely structured XML data from the HTML table that can then be reused to build a CPI chart. I will walk you through the creation of the XQuery expression step-by-step so that you can easily apply this method to similar problems of HTML data extraction:

As you can see in the above video, it was fairly easy to create nicely structured XML data from a table in the HTML page and to create a first simple chart that plots the CPI data over time.

But the true power of this approach is that you have much more flexible charting capabilities in MobileTogether and the XML data is now reusable, so you can calculate annual inflation rates directly from the underlying CPI data and plot it as well.

In this next video demo I want to show you just how to do that in less than 10 minutes. We will add a year-range selector to our chart where we can define which years to plot, and we will add an overlay chart that derives the annual inflation rate based on the underlying CPI data using XPath calculations and the plots that data:

Using this technique, you can not only extract data from singular HTML pages, but easily build a modern mobile front-end experience for many legacy systems that just offer an HTML-based browser interface at present. This will enable you to make your workforce a lot more productive and efficient, as they can now use a friendly mobile app experience to access your system rather than having to deal with a couple of HTML pages and forms in a browser on their tiny smartphone screens.

No comments: