HTML5

Extracting useful data from HTML pages with XQuery

When building in-house solutions or mobile enterprise applications, you are often faced with having to deal with legacy systems and data. In some ancient systems, the data might only be available as CSV files, in other cases it might be arcane fixed-length text reporting formats, but if the legacy system is less than 20 years old, chances are pretty good that someone built and HTML front-end and so the data is available through a browser interface that renders it in some poorly formatted HTML code that loosely follows the standard. And very likely you will find the data intermixed with formatting and other information, so extracting the useful data is usually not as easy as it sounds.

In addition, when you are building mobile solutions, you may sometimes need some government data that is not yet available in XML or another structured format, so you again are faced with having to extract that information from HTML pages.

Common approaches to extracting data from HTML pages, such as screen-scraping and tagging are cumbersome to implement and very susceptible to changes in the underlying HTML.

In this video demo I want to show you a better way of extracting useful and reusable data from HTML pages. In less than 15 minutes we will build a mobile solution that - as an example - takes Consumer Price Index data from the US Bureau of Labor Statistics, parses and normalizes the HTML page, and then uses an XQuery expression to build nicely structured XML data from the HTML table that can then be reused to build a CPI chart. I will walk you through the creation of the XQuery expression step-by-step so that you can easily apply this method to similar problems of HTML data extraction:



As you can see in the above video, it was fairly easy to create nicely structured XML data from a table in the HTML page and to create a first simple chart that plots the CPI data over time.

But the true power of this approach is that you have much more flexible charting capabilities in MobileTogether and the XML data is now reusable, so you can calculate annual inflation rates directly from the underlying CPI data and plot it as well.

In this next video demo I want to show you just how to do that in less than 10 minutes. We will add a year-range selector to our chart where we can define which years to plot, and we will add an overlay chart that derives the annual inflation rate based on the underlying CPI data using XPath calculations and the plots that data:



Using this technique, you can not only extract data from singular HTML pages, but easily build a modern mobile front-end experience for many legacy systems that just offer an HTML-based browser interface at present. This will enable you to make your workforce a lot more productive and efficient, as they can now use a friendly mobile app experience to access your system rather than having to deal with a couple of HTML pages and forms in a browser on their tiny smartphone screens.

Tablet computers, video, HTML5, and the great Flash debate

Even if you are not always plugged into tech blogs or the latest social media networks, I have a short reading list for you for this weekend. There’s just a fascinating combination of interesting stories all happening in the same 48h period:

  1. HP drops the Slate project (=tablet PC running Windows 7 that was announced at CES last year by Steve Ballmer)
    http://techcrunch.com/2010/04/29/hewlett-packard-to-kill-windows-7-tablet-project/
  2. Microsoft drops the Courier tablet project (=innovative folding screen tablet computer with both hand and pen input)
    http://gizmodo.com/5527442/microsoft-cancels-innovative-courier-tablet-project
  3. HP buys Palm and is rumored to be working on a tablet computer running Palm’s WebOS
    http://www.hp.com/hpinfo/newsroom/press/2010/100428xa.html
  4. Apple’s CEO Steve Jobs attacks Flash in an open letter on the Apple website and clearly speaks out in support of HTML5 and the H.264 video standard
    http://www.apple.com/hotnews/thoughts-on-flash/
  5. Adobe’s CEO Shantanu Narayen (who?) responds to the Steve Jobs letter in a TV interview with the Wall Street Journal (and offers very weak responses only – mostly cookie cutter style)
    http://blogs.wsj.com/digits/2010/04/29/live-blogging-the-journals-interview-with-adobe-ceo/
  6. Microsoft responds to the Apple-Adobe debate on the Internet Explorer Blog and also expresses support for HTML5 and H.264, but – in an attempt to not take sides – also states that “Flash remains an important part of delivering a good consumer experience on today’s web”.
    http://blogs.msdn.com/ie/archive/2010/04/29/html5-video.aspx
  7. Apple starts shipping the 3G version of the iPad in the US today
    http://www.apple.com/ipad/

To see all these things unfold in such a short period of time is quite fascinating, and thus far Apple and the iPad are the clear winner here…

Talking of which: according to FedEx my two WiFi+3G iPads are already on the delivery truck today and should arrive at my house before 3pm.

Also, if you are interested in following more of these tech stories unfold in real-time, check out http://techmeme.com/