Wednesday, January 30, 2008

Content reuse with Open XML and XSLT

While Open XML may not yet be an ISO standard, it is already standardized by ECMA and - even more important - all documents created by Office 2007 are already stored in Open XML by default, so there is an abundance of documents whose content you can now reuse much more easily and productively than ever before. So instead of waiting for the ISO vote or paying too much attention to all the political battles being fought around it, I want to show you how you can already take advantage of Open XML (sometimes also called OOXML or Office Open XML) today.

This is the first article in a series of blog postings that I plan to write about practical Open XML tips & tricks, so I encourage you to subscribe to my XML Aficionado blog (via RSS or via e-mail), if you haven't already done so. This will ensure that you get future articles from this series automatically as soon as I post them.

So let's look at an Open XML document in our favorite XML Editor. For this example I am going to use a WordprocessingML document (.docx) that I have created with Microsoft Office Word 2007. When I open the .docx file in XMLSpy, I immediately get to see the contents of the package file, which is structured according to the Open Packaging Convention.

That's a fancy way of saying that it is a ZIP file that contains specific files and directories that make up the content, structure, styles, relationships, and other parts of the document. Using XMLSpy's built-in capability to open any ZIP-formatted archive, I can directly browse any directory structures inside the ZIP package, add new files to the package, or open any existing XML file contained in the package:


For the purpose of reusing the content from this WordprocessingML example file, I am going to open the 'document.xml' file, which contains the content of the document.

As soon as I double-click the file in the ZIP archive, the XML is displayed in a separate window just like any other XML document and I can use the powerful grid view or text view features of XMLSpy to view or edit the XML data (sometimes it may be useful to invoke the pretty-print function in text view to make the file more easily readable):


This is, of course, a live editing view, so you can not only view the Open XML data, but make any changes to the XML and save it back into the package file.

But now let's look at how we can easily reuse content from this Open XML document using XSLT. XMLSpy ships with a few Open XML example documents as well as example XSLT stylesheets for just that purpose. Let's look at the 'docx2html.xslt' stylesheet, which takes a WordprocessingML document and extracts all paragraphs to turn them into HTML. This example stylesheet is by no means intended to be a fully-featured conversion tool from .docx to HTML. Instead it serves as a blue-print of how to reuse content from a .docx file and hopefully will serve as a starting point for your stylesheet development efforts.

At the core of that XSLT stylesheet we need a <xsl:for-each> loop to iterate over all the <docx:p> elements, which it turns into simple HTML <p> paragraphs. The text inside the paragraphs is grouped into runs of characters that share common attributes, and so we need an inner <xsl:for-each> loop to iterate over those <docx:r> elements and extract the text from their <docx:t> text node children. Thus the most primitive content reuse that only extracts the text of all paragraphs looks like this:


Once we have constructed those loops, we can start to think about perhaps extracting and reusing some style information. To do that, we now emit a <span> HTML element for every <docx:r> run of characters and give it a style attribute, whose value will depend on the <docx:rPr> element, so we use <xsl:apply-templates> to decide what HTML style we want to apply to the <span> elements:


The corresponding templates for the three most common styles (bold, italic, underline) are trivially easy to construct and look like this:


With just a few lines of XSLT and a few templates we have already written a stylesheet that extracts the basic paragraphs and most important styles from a WordprocessingML document and turns them into HTML that can be viewed in the browser view - here is the result produced from running the above XSLT stylesheet on the example WordprocessingML document that you can find in the XMLSpy examples directory:


Similarly, it is quite easy to extend the stylesheet to extract meta information, other styles, or image information from the WordprocessingML document and reuse the content for any modern application scenario, from web publishing via HTML, RSS, or social media formats to mobile web applications and beyond.

"But wait! How can I apply an XSLT stylesheet to an XML document that is stored within a ZIP file?", you might ask.

You can, of course, extract all the XML files using a regular ZIP expander, but there is a much better solution: when you use the document() function in XSLT 2.0 within XMLSpy or with our royalty-free XSLT engine AltovaXML, you can directly access files contained in a ZIP archive by using the "|zip" pipe operator within the filename, e.g. "MyDocument.docx|zip\_rels\.rels" will address the Relationship file ".rels" in the archive directory "\_rels" inside the ZIP package with the file named "MyDocument.docx".

The benefits of using XSLT to reuse content from Open XML documents are obvious: because XSLT is a cornerstone of the core set of XML standards from the W3C, you can apply all your existing XML, XPath, and XSLT know-how and you can use the excellent tools support that is available for these standards. For example, you can easily develop and debug your XSLT stylesheet using the powerful XSLT debugger in XMLSpy, which allows you to single-step through the transformation, set breakpoints on XSLT instructions or even on data nodes in your Open XML document, view the partially generated output, and inspect the state of the XSLT processor in detail as the output document is constructed:


Using the XSLT Debugger eliminates a lot of the pain that is normally associated with XSLT stylesheet development and allows for a very iterative approach to creating and improving stylesheets that facilitate content reuse and repurposing.

To sum it up, reusing content from Open XML documents for a variety of web applications, mobile scenarios, or social media and Web 2.0 contexts is very easy and can be achieved with standard XML-related technologies, such as XSLT.

For additional information on Open XML and how to take advantage of all the content that is now already available in that format, please refer to the following sites:

Tuesday, January 29, 2008

Tesla Roadster takes shortcut when it comes to safety


I love technology. And I love cars. So when cool technology is used to build a cool car, such as the Tesla Roadster, I get excited. I've been following the Tesla story for the past couple of years and the idea of an all-electric car and the performance advantages and torque that they get as a result are just impressive. I can't wait to see the first units hit the road - it's supposedly going to happen in Q1 this year and all 2008 production units are sold out.

However, Engadget reports today that the Tesla Roadster will not need to meet the advanced airbag requirements that are now common for gasoline-based vehicles. In Engadget's post Paul Miller writes:

"Apparently when you're saving the planet with an all-electric car, there's no need to kill yourself over safety. The Tesla Roadster has been granted a waiver in regards to advanced air bags by the NHTSA, since the 'public interest is served by encouraging the development of fuel-efficient and alternative-fueled vehicles.' Standard air bags are still included..."

That's a bit of a disappointment, since I don't like to see it when safety needs to take a back-seat, but I guess the reality of creating an electric car company from scratch is that there are some financial and regulatory hurdles in your way.

Let's hope that they will fix this issue for the 2009 model year - they are already taking wait-list reservations...

Monday, January 28, 2008

Using XML Catalogs in XMLSpy

Jerry Sheehan has recently posted a useful article "A simple way to re-direct schema locations in XMLSpy using XML Catalogs" on his XML Scoop blog. In it he describes how to extend the CustomCatalog.xml file in your installation directory to redirect PUBLIC or SYSTEM identifiers in DTDs as well as URI references in XML Schemas to local copies of frequently-used schemas or DTDs to reduce loading time. XML Catalog support in XMLSpy is based on OASIS XML Catalogs.

Thursday, January 17, 2008

Binary and OOXML office formats - interesting news

Brian Jones has a nice post on (a) making the documentation for binary Office file formats available more easily and (b) mapping from binary formats to OOXML:

The binary formats documentation will be available publicly by February 15, 2008 and the file formats will also be subject to the Microsoft Open Specification Promise.

Regarding mapping from binary to OOXML formats (and back), Microsoft promised to start an open source effort on SourceForge for that purpose.

This also made it onto TechMeme today...

Monday, January 14, 2008

iPhone browser traffic disproportionate to market share

I've said it all along - the iPhone's UI and especially the Safari browser on the iPhone are a quantum-leap over existing other smartphone technologies (e.g. Windows mobile, Symbian, Blackberry).

The NY Times has an article today on iPhone traffic on Google and confirms this by stating that despite a market-share of only 2% (compared to 63% for Symbian and 11% for Windows Mobile) the majority of mobile browsing traffic on Google over Christmas came from iPhones - that is simply astounding: more than 50% of the traffic from iPhones that have only a 2% market share!

The article also cites an analyst opinion:

"The iPhone has taken the frustration out of browsing on a mobile phone, said Charles Wolf, an analyst with Needham & Company."

Related discussions and other blog links can be found on TechMeme, as well as in previous iPhone-related posts on this blog...

Saturday, January 12, 2008

OOXML and ODF report from Burton Group: "What's Up, .DOC?"

Burton Group has released a new report "What’s Up, .DOC? ODF, OOXML, and the Revolutionary Implications of XML in Productivity Applications" this week. Written by Peter O'Kelly (blog) and Guy Creese (blog), the report provides a deep and insightful analysis of the current state of ODF, OOXML, and other document formats (W3C, PDF), the history of those formats, and then continues to parse through the FUD and standardization games to arrive at a set of projections regarding the success of OOXML and ODF, as well as a set of practical recommendations.

I've read the report already, but am not going to spoil the fun for you and reveal all the conclusions here - the report is well worth your time and you should read it yourself! However, I will say this much: the report validates some of my thinking on the subject that I have expressed in various previous blog posts on OOXML here.

There is a quick overview of the report on the Burton Group's Collaboration and Content Strategies blog, and you can download the entire report for free after filling out a registration form.

Other early blog reactions to the new report are here:

And I am sure more blog reactions will follow next week.

As always, if you are interested in working with Office Open XML (OOXML) files, a great place to get started is to look at the OOXML support in XMLSpy and to download a free evaluation version of Altova's XML Editor and give it a try.

Thursday, January 10, 2008

Semantic Web Killer App - where art thou?

Alex Iskold has a nice blog post today about the simple, but elusive question "Semantic Web: What is the Killer App?"

He looks at both the holy grail of classic A.I. applications, such as natural language understanding and what he calls the "genie in a bottle", and then proceeds to look at more realistic apps, such as semantic knowledge databases, semantic search, social graphs, and shortcuts.

On area that he misses is in the enterprise integration space and is sometimes called semantic mapping, ontology mapping, or semantic integration.

Other useful discussions of possible application areas and interesting adoptions are here:

So it seems that we still need to wait a bit for the killer app of the Semantic Web. One thing is clear, though: whoever creates that killer app will probably be using Altova SemanticWorks for their RDF and OWL editing, and ontology creation in the process. While the killer app is not here yet, the perfect developer tool to create that killer app already exists... :)