Large XML processing


Was reading a bit about options available for processing large XML document. To constrain the use-case: What is the best way to process a large XML document (over 1 GB in size) which basically is a XML representation of a collection of records.

If the “process” involves reading each record, doing something with it and then discarding it.  SAX should be a good option. The SAX parser fires events for every XML node being processed and the event handler code can then construct every record as  it is read and hand it over for processing.

If the “process” involves reading each record, doing something with it and then say modifying some content in the file (say a flag):   Stax would be my API of choice since SAX can’t write and DOM loads the whole document into memory.

Stax also differs from Sax in the sense that it basically is a pull-based model – where the application code loops through the document looking for nodes which match certain criteria. (Whereas in SAX, the parser takes over control of the program flow and calls the application handlers).

If the “process” requires multiple searches on the document based on various search criteria, occasional modifications to parts of the document

Stax does not work very well – because the entire document needs to be scanned for every search and update.

Option 1 : DOM.  But this causes a memory foot-print 5 times the document size. Also if the document updates need to be saved – the entire DOM object has to be written to disk. A part of the XML file cannot be modified using DOM.

Option 2 : VTD-XML. http://vtd-xml.sourceforge.net/userGuide/0.html A relatively new API (well it’s apparently been around since 2004) which allows a indexed read-write access to the XML document. It uses a non-extractive style of parsing the document : which means  that while parsing the document instead of loading the entire content into memory, it loads the offset+size of each XML element it encounters into a index. A good article on non-extractive parsing is here – http://www.xml.com/pub/a/2004/05/19/parsing.html

Once parsed, this index is ready to traversed, either using a top-down descent, or using Xpath, or using a index-lookup.   The index itself can be persisted to a file and can be used for subsequent processing of the document without need for parsing the document.

Also writes are also efficient since the only the contents of the modified element can be written-to without the need for re-parsing the entire document.

A note on how VTD-XML compares with other XML APIs http://vtd-xml.sourceforge.net/userGuide/5.html

[ i have not yet written any code using VTD-XML, planning to get my hands dirty sometime this week. will upload the snippets]

Advertisements

About saratnathb

Building SOA solutions using Oracle Fusion Middleware technology stack.
This entry was posted in Uncategorized and tagged , , , , . Bookmark the permalink.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s