Let’s suppose for a minute that …
- You’re a Pythonista wanting to process the content in some really big XML documents (10s of Mb or more)
- Most of the content isn’t important, so you can just drop it, but …
- There are ‘islands’ of XML content that are both the heart of your task, and structurally complex
(Consider that introduction fair warning that there’s heavy programmer talk ahead …)
If you’ve done XML parsing before, #1 and #2 will probably make you think of SAX, the Simple API for XML. Since reading an XML document into memory can take up to 10x the size of the original document, the alternative approach to SAX (DOM processing) isn’t feasible for really big files. And using SAX would be okay, but …
#3 should make you think of XPath, since that’s a much clearer (and more declarative) way to express the semantics of what you want out of that complex XML. However, XPath processing requires you to have the full fragment of interest in memory.
What you’d really like as a general approach is to process the document with SAX until you find an island of interest, then capture that whole fragment in memory so you can do something structural with it. After that, you can go back to SAX parsing again until another interesting island arises.
How do you get the benefits of both approaches?
Well, if you’re like me, you
- scratch your head a bit
- think through, and perhaps try, some home-brewed approaches
- get frustrated that they’re not general enough
- do some more head-scratching
- think to yourself “somebody, somewhere, must have already solved this!”
- spend some time looking around
- finally find a Better Way (or two)
The first Better Way is Python’s standard pulldom module (Uche Ogbuji has a nice illustration of its usage here). But, as he himself points out here, the pulldom approach doesn’t offer much to aid the clarity of your code.
The Even Better Way, particularly if you’re already using 4Suite‘s XML tools (and you should really consider it if you’re not), is to use their Sax.DomBuilder() method. That’s mentioned in their documentation, but with only a single sentence. Here’s a little more detail on how you might use such an approach. I assume you already know how SAX parsing works (there are plenty of other resources out there if you don’t: you might start with this recipe).
Sax.DomBuilder() is meant to essentially mirror the structure of a normal SAX processor. When the usual SAX events fire (like the start of an element), Sax.Dombuilder() turns these into their corresponding activities (like starting a new child element) to build a fresh document from scratch.
So the conceptual challenge is figuring out how to switch mid-stream from one handler to another. It’s not good enough to set up two handlers, and simply switch at a beginning of an island of interest in the stream of SAX events. For my application (and probably most others), you also have to switch back once you’re done (and perhaps repeat the cycle). That means you need an additional level of control above either of the handlers (the “normal” SAX callbacks and the DomBuilder variant).
If you’re thinking in state-machine terms (SAX tends to do that to you), you might add conditionals or flags to your code. I think the approach below is even more better, though. The standard SAX callbacks are defined with a layer of indirection, so the “real” definition for starting an element is in _startElementNS
(note the leading underscore). Then when you find the beginning of the content of interest (in this example, the details
element and its children), you simply switch the callback definitions for the DomBuilder ones.
The key reason to use this indirection is so you can detect when you’re done and switch back.
If you’ll run this example (which is laced with some print statements to make the execution flow clear), you’ll see that the “normal” callbacks fall silent once the details
element is passed. Also, note that you have to test for the entry condition at the beginning of the startElementNS
, but at the end of endElementNS
, to make sure the right things do (and don’t happen). In this example, i only collect one island of interest and print it out at the end: in a real application, you’ll probably update some data structure as you go.
Note that Uche’s Amara toolkit offers even more capabilities, and i suspect Amara’s Bindery may make this kind of task even more straightforward. But i haven’t crossed that bridge yet.
(If you’re a Perl programmer, this whole story should make you think of XML::Twig. That’s a fair answer, though Twig’s API is a bit non-standard. However, you may already be working in Python for plenty of other good reasons – I’m not trying to start a language war, though, honest! – which reason enough to use this approach.)
Amara does in fact have the very powerful pushdom feature that lets you open a file and get list of nodes back based on an xpath expression. Here is a quick example:
import amara
prefixes = {u'a': u'http://www.w3.org/2005/Atom'}
entries = amara.pushbind('feed.atom', '/a:feed/a:entry', prefixes=prefixes)
I personally have used it for quickly grabbing bits of XML files for things like indexes. Good article!