Wednesday, October 24, 2007

Relational Data as XML

Everything is fodder for argument with XML. Note these 50 pages of point-counterpoint that discuss nearly every aspect of XML's use and usability - it's a pretty quick read [http://www.zapthink.com/report.html?id=ZT-XMLPROCON].

The short of this entry is that storing data on the server-side as XML documents is a very flexible, readable, maintainable, and most importantly scalable option for a Web application. Document size issues that tend to accompany XML-based storage can be avoided by leveraging the size of documents vs. external contextual data (i.e. indexing information). Performance and scalability for even the most demanded applications can be as outstanding as any highly-trafficked Web site, if the metaphor of serving your data in XML documents as Web pages can be kept in mind.

The traditional rule of thumb is to put "data" in a database and store documents on a file system. But breaking up a document's worth of data to shove into a database is not a difficult task, and composing a table or two into a document of information is also not difficult. Where things become complicated is when mounds of contextual metadata is generated to cope with relational data. Separating contextual data from content can give you immense power over what you serve, when you serve it, and how you manage it.

-----

There are many time-sinks and concerns when using XML to describe your data, a few of which are engineering the documents' structures, handling data serialization, performance issues, and information bloat. Before you can massage your data into XML, you need to know what structure the document should be - sometimes this is not trivial, especially if representing relational data, and will take time. Taking into account the hitch that binary serialization is brittle and not always platform/hardware independent, having to massage your data so it can be serialized as String representations may also take time; complex objects need to be broken up into primitives to be represented nicely.

Once this is finished, immediately one might notice that some XML representations are thousands of lines long (I made this wonderful mistake more than once), slowing your parser and increasing the transfer time towards O(eternity); the more complex the object, the more contextual descriptions and metadata will be needed to describe the object. Ugh.

Since there is (obviously) a connection between the size of your XML document and the time it takes to parse its content, a good way to increase performance is to break your data into finer levels of granularity. For instance, if you have a document representing the furniture in your house by room, you can easily break the document up into many documents, each representing one room.

With each room separated out, a natural idea would be to add an additional tag under the "room" parent node (or as an attribute) that identifies which house it belongs to. If we break things down even farther, and separate out each piece of furniture into its own document, we would need to add a tag that identifies which room each piece of furniture belongs in.

This method of "contextualizing" XML documents has a problem. Since the information is internal to the document, each time the context of a document needs to be determined it must be parsed and searched. Additionally, this method only allows the "child" document to know who its "parent" is - a house would not know what furniture it has, only the furniture would know which house it belongs to. There are at least two ways to solve this, and the first is obvious - after breaking up a large document, build a "virtual" document that has references to its pieces and parts. This is not a terrible way of doing things - it can, however, lead to an enormous amount of files.

The alternative is to externalize all "virtual resources" into one document, namely a large indexing file. So if multiple house documents - "FooHouse" and "BarHouse" - have each of its rooms stored as separate files, a master document will subsume the identifiers for both room documents under their particular identifiers. When a user requests the resource "FooHouse", the master document (which is assumedly kept in memory for quick traversal) will either assemble the document, or - assuming a screen could not handle showing every room's contents all at once - simply serve each room document when requested.

This solution can scale well, requires no particular storage model (the master document could refer to tables in a database, files on a disk or even a URL) and allows a conceptual resource to be as complex as needed. It also allows for performance tuning, as the client can receive only the portions of the resource pertinent to their particular task.

Tuesday, October 23, 2007

It's the Internet, son

If you're reading this, you most likely know what TCP is. You also might know what UDP is, and what the differences between the two are. If you don't, read a little of the Wikipedia entry on the Internet [http://en.wikipedia.org/wiki/Internet], and skim the links about these network protocols. The Internet is the embodiment of these standards, and this is where we will start.

Web applications that we are concerned with are first and foremost applications, layers of software that provide some sort of service to a person or group of people. A Web application is only special insofar as it is deployed on the Internet and is accessible by contacting a hosting server via some sort of network browser.

"Right thing" [http://www.jwz.org/doc/worse-is-better.html] advocates would say the goal of a Web application is to provide the client the illusion that the application is native to their machine, providing seamless access to data and functionality that is, in actuality, housed on a set of machines somewhere far across the Internets. This is admirable, and should be the goal of all Web developers. They might continue on to say the interface should be simple, consistent, and complete at all costs, even if the implementation of the application suffers from complexity.

Of course, there are many open questions when we descend from our 10,000 ft. goal - if our data set is large, how do we serve that across millions of miles of cable without the user waiting for it? If the user has a high-latency satellite connection to the Internet, how do we get around her experiencing round-trip-times in the seconds? How do we provide uniform access to resources across disparately performant physical mediums? ("Right thing" advocates are probably too busy maintaining, debugging and securing their RPC stub generators to answer these questions, so I'll do it for them.)

The obvious answer is that you can probably spend all of the world's software contractors' budgets combined and never be able to do things the "right way." The interface will never be simple enough to completely obfuscate the idea that a Web application is deployed on the Internet. The less obvious answer is that we can still do things simply, chiefly because the Internet already has mechanisms in place to do this for us - remember TCP and UDP? They're the ones that were used by HTTP to serve this very Web page, this (hopefully X)HTML document. They are the Postal Service of the Internet - with the power of the OSI model layers [http://en.wikipedia.org/wiki/OSI_model] combined, your documents can be delivered to and fetched from your clients the best way the Internet knows how - GETs, POSTs and PUTs.

So why shouldn't your Web application make use of this document delivery service?

If your data is already in or is easily convertible to a Web friendly XML or JSON format, you are in business. Boot up a Web server to serve up those documents, create a client-side application that can communicate via GETs, POSTs and PUTs (GWT [http://code.google.com/webtoolkit/] is a good, AJAX-y option) and you have an architecture that lays flat against the original design principles of the Internet.

Of course, you can optimize the Web server by designing your own file system to version, cache, and pre-fetch, your documents. Or if you're less adventurous, you can download a lightweight servlet framework that can do much of this for you [http://www.restlet.org/].

This discussion is far from over, of course. The next post will discuss the benefits and drawbacks of maintaining data as XML documents, and how fine-tuning the granularity of contextual metadata will determine the performance of your application.