XML Schema with Optional, Unbounded, Unordered Elements

This is so obscure i hesitate to blog about it, except that it took me so long to figure out that i’d love to save somebody else the trouble. You won’t care unless:

  • You’re designing an XML Schema definition (.xsd) to validate an XML file
  • You’re defining an element to contain regular text, or multiple elements, in any order, from zero to many times

Here’s an example: suppose you have a plain text description of events that includes people, places, and Bible references.

Jesus heals Simon’s mother-in-law (Matt 8:14-17; Mark 1:29-34; Luke 4:38-41)

You want to link person references with a Link element, Bible references with a Reference element, and otherwise leave the plain text as is. This results in something like this (using square brackets since otherwise WordPress gets confused):

[Link]Jesus[/Link] heals [Link]Simon[/Link]’s mother-in-law ([Reference]Matt 8:14-17[/Reference]; [Reference]Mark 1:29-34[/Reference]; [Reference]Luke 4:38-41[/Reference])

Now imagine several of these in the same element, so potentially you can have any arbitrary sequence of Links, References, and plain text, in any order, any number of times. Describing this with a BNF grammar is trivial:

LinkRef ::= Link | Reference
TextItem ::=  ( text | LinkRef )+

A cursory reading of the XML Schema description (which i’d never actually done before, instead depending on XMLSpy which generally lets me avoid thinking that hard) might make you think grouping models like sequence, choice, and all in conjunction with attributes like minOccurs and maxOccurs would do what you need. But there’s a surprisingly complex set of interactions between these, that i still don’t really understand, and so what seemed so simple proved surprisingly hard. Here are a few examples of what i tried, where XMLSpy’s validation model for XSD files (which i’m assuming is correct) wouldn’t allow it:

  • while all is for an unordered group of elements, it’s restricted to maxOccurs=1. So it doesn’t handle unbounded occurrence (though it does allow minOccurs=0, e.g. optionality). Furthermore, it can’t be nested inside other model groups like sequence.
  • choice groupings can be neither optional nor unbounded.
  • trying to specify multiple occurrences of both Link and Reference, each both optional and unbounded, is flagged as an ambiguous model.

The solution i finally discovered (after embarrassingly many other permutations, more by trial and error than anything else):

  • define a LinkRef group that allows a sequence of either Link or Reference, both optional and unbounded (zero to many occurrences)
  • the TextItem (enclosing parent) element allows an optional and unbounded sequence of LinkRef groups.

For the more visually oriented, here’s how it looks in XMLSpy:

TextItem and LinkRef Grouping

Addressability Matters

Ever since Adam named the beasts (Gen 1:19-20), labels have mattered to humanity: it’s pretty hard to hold a conversation if you have to start with “you know that really big gray beast with the cute little ears that sits in the river all day with just its eyes showing?”, instead of just “hippopotamus”.

Information on the web works the same way. Most (but not all!) web pages have the equivalent of a name, their Uniform Resource Locator (URL), which tells your browser how to bring up the page. But too many conversations about web pages are still like the hippopotamus conversation: “just go to www.frooble.com, then type ‘shebang’ in the search box, and look about half-way down the page on the left side …”. In other words, that little tidbit of information isn’t addressable: i can’t give you a name for it, i can only tell you to travel over the river, through the woods, and then turn left at the 3rd oak tree.

Though there’s usually no good technical reason, this is still so often true for our web-enabled world. For example, i admit to my chagrin that i only just now figured out the URL for my Facebook profile, even though i had looked for it (half-heartedly) several times previously. (I happened to stumble over somebody else’s, saw the pattern, and then plugged my own name and ID in the URL instead). Having a URL that’s both explicit and understandable enables this kind of URL hacking, which is a really powerful technique.

Here’s a small example (combined with a shameless plug). The HTML designers for the upcoming Bible Tech conference have added page targets for speakers to the Speakers page. So even though there’s one long list, you can get to just the right spot on the list by following the link to my talk. And if i show you the URL

http://www.bibletechconference.com/speakers.htm#SeanBoisen-2009

and explain the schema ([baseURL]#[Firstname][Lastname]-year), you can get to my talks from last year too. That’s a nice bit of design, and part of a much larger and important architectural practice called Representational State Transfer or REST. As another example, you can probably figure out how to change this URL

http://bible.logos.com/passage/NIV/Ge 2.19-20

to get you to Mark 4.1-12 in the ESV instead (though you might stumble if you use a colon instead of a period to separate chapter and verse).

A lot of important things only become possible once you start to provide names for your resources. That’s a big part of the justification for the complex tangle of ideas called the Semantic Web, or if that’s too high-falutin’ for you, just call it smarter web design for information integration.

PS: i realized later it wasn’t just that i couldn’t figure out how to construct a Facebook URL: you have to make a badge first to get an addressable URL, which seems pretty non-obvious!

XSLT and Namespace Matching

I have to admit to something of a love-hate relationship with programming in Extensible Stylesheet Language Transformations (XSLT). On the love side, it’s a very powerful language, and its structure-transforming orientation makes certain tasks easy that would be really hard in a more procedural approach. So when i started getting involved with lots of XML data a number of years ago, it wasn’t long until a basic knowledge of XSLT became an essential part of my toolkit.

However, some of its power comes from the fact that there’s a fair amount of implicit processing going on behind the scenes. Of course, when that does just what i need, that’s great: but when it doesn’t, and i don’t understand why, i quickly slide into “this language drives me nuts”. That problem is aggravated by the fact that i don’t use it all the time: so i don’t get really good at it, my understanding of the processing model is just deep enough to accomplish my current task, and the little tricks that are an essential part of using any language well recede too quickly into the dim mists of my brain.

My latest love-hate experience comes from transforming some XAML code (the details of why we need to do this are too painful to recount, but are all too typical of commercial data-slinging environments like Logos). Here’s a simplified fragment of the input XML, where the basic task is recomputing the height and width:

XML fragment
You might naively think that a matching statement (the heart of a lot of XSLT processing) like xsl:template match="/Viewbox" is the way to get started with this. I thought so too, but then spent an embarrassingly long time getting no output whatsoever because it didn’t match. I could find the first element with tricks like match="/*", but couldn’t find it directly.

Those of you who are a little more XSLT-savvy than I are now shaking your heads and tsk-tsk’ing at my obvious mistake: the XML document (and hence the Viewbox element) has its own namespace (xmlns="http://schemas.microsoft.com/winfx/2006/xaml/presentation"). It’s a small comfort to me that i lingered longer in my error than i might have because, testing my XPath expressions first in XMLSpy (as i often do), the expression /Viewbox matches just like you’d expect. But the XPath evaluation uses the document’s namespace, while the XSLT processing doesn’t (unless you tell it to). So first i had to realize that one tool wasn’t quite telling me the same truth as the other.
But even after guessing it was a namespace problem, it still took me far longer than it should have to figure out a solution. I tried a few stab-in-the-dark approaches like putting namespace declarations in the stylesheet, and looked (in vain!) in my XSLT book trying to find a clear explanation of matching and namespaces. I’m sure it’s there somewhere, but it’s a big book, and i often have a hard time finding the right information in it (maybe the second edition does better with this, i only have the first edition handy).

This post provided one solution: ignore the namespace altogether with a matching expression like match="/*[local-name()='Viewbox'].Though perhaps a little clunky, that does the job. I found a slightly better approach here, which is what i finally adopted: define a namespace prefix in the stylesheet (not the same as just copying the namespace declarations!), and then put that prefix in your matches. Specifically, i added xmlns:x="http://schemas.microsoft.com/winfx/2006/xaml/presentation" to the namespaces for the stylesheet (different than its definition in the XML source, but that’s ok), and then used match="/x:Viewbox" in the xsl:template. After this epiphany, the rest worked as i knew it would all along :-/.

Notes to self:

  • when things don’t work the way you think they should, stop just trying different approaches you don’t understand, and figure out the underlying problem
  • if you can’t find what you need in one reference book, try another
  • repeat out loud as needed: namespaces are a pain, but they’re good for me …
  • go to Google sooner, since it knows all (if you can just find it!)

PS: does anybody else (besides me and this guy) have the problem that WordPress always eats your XML code? I haven’t yet figured out a way to get it past the editor and posting process, which is why it’s just an image (!) above.

A RefTagger Hack

Like Logos’ RefTagger? Me too, so much so that i’m starting to feel ripped off in contexts where i don’t have easy access to the text behind a Scripture reference. If you’ve only got one reference, no big deal: you just look it up in Libronix or some other Bible software, or one of the many excellent sites on the web. But if you’ve got a whole string of them, and you’re just scanning rather than doing in-depth study, it’s a pain to have to look them up one by one.

Since i’m spending a lot of time right now reviewing Scripture references for entities in the Bible Knowledgebase, i wrote a quick CGI hack i call the RefTaggerizer to fill the gap. It couldn’t be much simpler: there’s an HTML form that accepts a block of text as input, and then redisplays itself on the resulting page (a ‘self-posting form’, in the jargon i learned today). The magic is the embedded RefTagger code: any Bible references in the text that you submit get turned into dynamic links. There’s no real novelty here: it’s just a convenient way to transform a string containing references into hyperlinks (without the bother of creating a full-up HTML page on a server).

To use this, you need to:

  • have a web server running on your local system that supports CGI (i use Apache’s HTTP Server, which is a breeze to download and install: IIS ought to work just as easily).
  • configure the server to allow execution of CGI files (the CGI Tutorial for Apache 2.2 is here). You probably have to have mod_python installed as well.
  • download the Python code reftaggerize.py and put it in your CGI directory (For Apache 2.2 on Windows, that’s probably C:\Program Files\Apache Software Foundation\Apache2.2\cgi-bin)

This is a really simple script, and if you’re a Perl-monger instead of a Pythonista it should be easy to rewrite. Alas, i tried installing this on my web site (at http://asamasa.net/cgi-bin/reftaggerize.py), but i haven’t done any CGI here before and it doesn’t work: it just displays the code in the browser rather than executing it. I don’t know if this is a configuration problem with python, or what. So i can’t give you a snazzy demonstration (but if somebody out there sets it up and gets it working, let me know and i’ll post the URL).

LinuxFest Northwest — Next Weekend

If you’re anywhere near Fourth Corner next weekend (that’s Bellingham/Whatcom County in Washington State, for you outlanders), you may want to consider attending Linuxfest Northwest (LFNW), a 2-day showcase of open source and Linux technology at Bellingham Technical College.

Yours truly will be presenting Saturday from 11-12:30 in Haskell 204 on Natural Language Processing in Python using NLTK. Basically, i’ll be explaining and showing how some really great (and sophisticated) tools for processing language that used to be the exclusive province of PhDs in Computational Linguistics are now within reach of ordinary mortals (as long as those mortals know some Python: otherwise you’ll mostly be hearing about why you should learn Python!). I’m hoping it will also be a good introduction to what natural language processing is all about, and why programmers might care.

If NLP isn’t your thing, don’t tell me and burst my bubble, but there are still dozens of other really interesting sounding talks (including a couple in my timeslot that i’d go to if i were otherwise occupied!). I’m really looking forward to the weekend: come find me if you’re there! Al Castle also has a LFNW post on his blog (with a bit more background).

(i expect to post slides from my talk on the web afterwards, so even if you can’t come you can get some of the information.)

Software Idioms for Data Exploration in Python

This is mostly a working note for myself of a sequence of steps i frequently go through in looking at a dataset and trying to understand its characteristics. Jon Udell gets credit (in my consciousness, at least) for the notion of a “software idiom” (a series of repeated steps to accomplish some purpose), and also for encouraging the narration of the work we do as a means of tranmitting knowledge to others. I happen to use Python in the examples that follow, though they would work equally well in Perl or any number of other languages. Python works especially well, though, because the interpreter makes the process highly interactive: the meta-process for Python development i tend to use is

  • play with the individual steps until you get them right
  • tie them together into small bundles (this is made easier when you can retrieve them with a history mechanism in your IDE)
  • if you might use the whole package again, tidy it up and persist it as code in your library.

I haven’t turned the data exploration idioms here into code because they’re always just a little different, and they always seems simple enough that it’s not worth trying to generalize them (though doubtless many others have, and i’ve just never stumbled over their code). But i want to narrate my process with this post, if only to help me remember it (and if it helps you too, so much the better).

The Data

In the basic data exploration scenario, you’ve got a list of items, a variable-length list of attributes for each item, and you’d like to summarize it in some fashion. For this discussion, i’ll use my most recent experience with our Biblical People data: there are about 3000 people, with anywhere from one to hundreds of references to them in the Biblical text. But this is a very simple and general data model that fits lots of different cases. FOr this example, the data looks something like this:

Person ID References
Aaron Ex 4:14, Ex 4:27, Ex 4:28, Ex 4:29, (and many many more)
Abagtha.1 Esth 1:10
Abda.1 1Kgs 4:6
Abda.2 Neh 11:17

and so forth (the numbers on the end of the Person IDs are for convenience in referring to different people who share the same name: but they could just as easily be bp1, bp2, etc.).

By far the most flexible way to represent this kind of data is in a Python dictionary whose keys are the person IDs (this assumes they’re unique), and whose values are the list of references (assume for this purpose those are strings like “Ex 4:14”, though other representations are possible). The rest of my discussion assumes you’ve got such a dictionary, named allrefs (you’re on your own for constructing it). Already you’ve got one useful summary of the data in the dictionary itself: the length of the dictionary is the number of keys (which, because of the way dictionaries work, are all distinct), 2986 in all.

>>> len(allrefs)
2986

Idiom #1: Find Singletons

A number of scholars (notably Pareto and Zipf) have contributed to the general observation that many events in the human sphere, rather than being normally distributed (the traditional bell curve), have a power law distribution, with a few very high frequency items, and then (to use Chris Anderson’s term) a long tail of very low frequency items. A rule of thumb is that roughly half the items occur just once (“singletons”) in such distributions. Identifying the singletons (and counting them) is often an important part of exploring the data.

For this task, the singletons are those with only a single reference, e.g. those keys for which the length of the value list is 1.

>>> singletons = filter(lambda (k,v): len(v) == 1, allrefs.items())
>>> len(singletons)
1635

Each item of singletons is now a (key, value) tuple, but only those with a single value, and we find there about as many as our rule of thumb would predict (1635 is roughly half of the total of 2986). This is an important subclass of the dataset: they dominate the data in quantity (more than half), though each individual item has little to add (only a single reference).

Idiom #2: Invert the Dictionary

allrefs is organized by people IDs: so it’s easy to ask questions about a given ID, but you can’t explore what’s true of a given value (reference). So a typical next step would be turn the data around (invert it) and organize it by reference instead. First i define a utility function extendDictToList for populating a dictionary whose values are a list rather than a single item, without having to separately first initialize the empty list. This can be done directly in one line of Python, but somehow i find my rephrasing of it here easier to grasp and use correctly.

>>> def extendDictToList(dict, key, value):
... """Extend DICT as necessary to include KEY, adding VALUE to the
... list of values (so KEY must be hashable). """
... dict.setdefault(key, []).append(value)
...
>>> foo = {}
>>> extendDictToList(foo, 'bar', 'value 1')
>>> extendDictToList(foo, 'bar', 'value 2')
>>> foo
{'bar': ['value 1', 'value 2']}

This makes it easy to accumulate several values under the same key. With this in hand, we can invert allrefs by individual references:

>>> allrefsinv = {}
>>> for (person, references) in allrefs.items():
... for ref in references:
... extendDictToList(allrefsinv, ref, person)
...

So now allrefsinv is a dictionary inverting allrefs, with references as keys, and a list of the people mentioned in that verse as the value for each key. This data has different dimensions from allrefs, of course: there are 9617 references, each including one or more people. You could go back to idiom #1 at this point and find the singletons in this dictionary (that is, references to verses that only mention one person) if that were useful.

Idiom #3: Bin by Frequency

By looking at singletons, we’ve started down the road of examining the frequency of data items, and how the frequencies are distributed (noting that the special case of keys with only a single value accounts for roughly half of all the keys). While the individual singletons themselves might be interesting, it’s also useful to see how the sizes of all the values are distributed. To do this, we construct a frequency distribution from the original allrefs dictionary.

While you could do this directly using dictionaries, i find it more convenient to just use the FreqDist class in the probability module of the Natural Language Toolkit (NLTK). Digression: NLTK is a very useful package for a wide variety of language-related processing tasks (in fact, i’ll be giving a presentation about it April 26th – 27th, 2008 at LinuxFest Northwest (Bellingham Technical College, Bellingham WA) ). Though it’s clearly overkill for this small problem, reusing well-defined code is a good practice. If you don’t want to mess with the whole package, just look at the source for probability.py: it looks like you could just pull out FreqDist, though i haven’t tried.

Let’s bin the inverted dictionary allrefsinv, to get a sense of how often verses refer to one or more people. Once you’ve got the FreqDist class defined, the steps are

  1. create an empty instance of a frequency distribution (fd)
  2. go through the value list (the people) for each key (the reference) and count how many values
  3. increment the bin count for that number

>>> fd = FreqDist()
>>> for (ref, people) in allrefsinv.items():
... fd.inc(len(people))
...
>>> len(fd)
15

It’s easy to get confused about this point: the keys in fd (which is just a dictionary with some additional methods) aren’t individual data items, but bins of a given size (there happen to be 15 of them). The counts for each bin represent the number of keys that have a value whose length is the bin size. So all the keys for which len(people) == 1 are counted by fd[1], the values with length 2 are in fd[2], etc. In this sample there aren’t any cases with binned size 0 (though in other data there might be: if we had that data here, it would be the verses that reference no people at all).

Now we can look at the counts for individual bins (fd[1] is the (inverted) singletons, references for verses that only refer to a single person), look at the maximum value for the entire distribution (it happens to be the singleton bin), and in fact print out the entire distribution.

>>> fd.samples()
[1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 16]
>>> fd[1]
5459
>>> for sample in fd.samples():
... print "%d\t%d" % (sample, fd[sample])
...
1 5459
2 2341
3 949
4 424
5 209
6 102
7 60
8 40
9 12
10 7
11 7
12 2
13 2
14 2
16 1

Note there’s a gap in sequence: there didn’t happen to be any references to 15 people, so you’d have to use a different iteration construct if you want to show those zeros. Since this is aggregate data, you can’t directly find which verse is the whopper with 16 names in it, but filtering allrefsinv tells you it’s 1Chr 25:4.

>>> from pprint import pprint
>>> pprint(filter(lambda (k,v): len(v) == 16, allrefsinv.items()))
[('1Chr 25:4',
['Heman.2',
'Bukkiah.1',
'Mallothi.1',
'Mahazioth.1',
'Hanani.2',
'Eliathah.1',
'Hothir.1',
'Uzziel.4',
'Romamtiezer.1',
'Giddalti.1',
'Mattaniah.1',
'Joshbekashah.1',
'Hananiah2.3',
'Shebuel.2',
'Jerimoth.3',
'Jerimoth.5'])]

Of course, this is only the starting point: there’s a lot more you could do beyond this to characterize the data, and maybe your head is already swimming from the data permutations. But i’ve found this 3-step process very helpful on lots of occasions where i want to understand a data set better. Gathering these few characteristics provides a very helpful overview to get started.

PS: Jon Udell’s been encouraging the accumulation of what he terms “unexamined software idioms” under the del.icio.us tag unexamined-software-idioms. I’m not completely sure this qualifies, but i’m tagging it that way just in case.

PPS: if you’ve been following the Bibleref meme, you might notice i didn’t annotate the references above. That’s because here i’m using them as data items, not real references. But thanks for noticing (if you did) …

Google Charts API

Google announced today their API for embedding charts in webpages, using nothing more than a standard HTTP URL with parameters. If you like showing data, you’re going to love this feature. You’ve probably already seen these charts on various Google properties like Google finance. While this won’t replace Excel, it includes the basic chart types: line charts, bar charts, pie charts, scatter plots, even Venn diagrams. Here’s their API documentation.

Here’s a simple example:

Google Chart example Venn diagram

Here’s a walkthrough of the link text that generated this (split onto multiple lines for exposition and annotated with comments: what you put in a browser’s address box obviously needs to go on a single line, without the comments):

http://chart.apis.google.com/chart? the base URL
cht=v& chart type: v for Venn
chs=400×200& chart size: 400 by 200 pixels
chd=t:119,96,43,67,22,38,16& chart data: the 3 circle sizes, and the intersection areas: see the API for details
chdl=Redundant|Doesn’t-start-pericope|Has-Meta-comment& chart legend
chf=bg,s,efefef chart fill: background = grey

This is really a killer feature when you consider the amount of work it would take to create a diagram like the one above yourself. You’d have to

  • Make a chart in some other application (like Excel) and paste in a picture of it
  • Fool around with SVG or some other complex graphics language (and still have the problem of people not having SVG-capable browsers to see your chart!)

(The best general-purpose and web-friendly charting tool i’ve found is IBM’s ManyEyes site, which has a much richer palette of chart types: but that requires you to upload data tables and do some other configuration work.)
For data geeks like me, another key benefit that may not be obvious is that all the data is exposed here. If i paste in a picture of a chart, the viewer can only get the underlying data (and re-use it, or re-chart it) if i decide to supply it separately, or annotate my picture with it somehow. Here, the data is in the chart reference itself. That also means you can easily generate such charts programmatically.

Since we live in a data-rich world, i’m a big fan of using visualizations to make data easier to understand. Tools like Google Chart and ManyEyes take away one excuse, by making it easier to show what your data means.


About the chart above: you’d have to look at my talk with Steve Runge at the recent SBL meeting to make sense of this, but it shows the following characterizations, for 159 occurrences of vocatives (or nominatives of address) in the New Testament epistles:

  • Redundant: many vocatives in the epistles like “brothers” or “beloved” don’t distinguish who’s being addressed (e.g. they’re not a typical “vocative of address” case). 119 of the 159 vocatives (75%) are like this.
  • Doesn’t start pericope: some have posited that vocatives function to signal textual transitions like pericope breaks. While that’s sometimes true, the 96 (of 159, or 60%) shown here do not.
  • Meta-comment: vocatives often co-occur with meta-comments like “I want you to know that …” (43 of 159, or 27%) See Steve’s work (and forthcoming resources from Logos) for more about this discourse function.

Fun with XML Parsing in Python

Let’s suppose for a minute that …

  1. You’re a Pythonista wanting to process the content in some really big XML documents (10s of Mb or more)
  2. Most of the content isn’t important, so you can just drop it, but …
  3. There are ‘islands’ of XML content that are both the heart of your task, and structurally complex

(Consider that introduction fair warning that there’s heavy programmer talk ahead …)

If you’ve done XML parsing before, #1 and #2 will probably make you think of SAX, the Simple API for XML. Since reading an XML document into memory can take up to 10x the size of the original document, the alternative approach to SAX (DOM processing) isn’t feasible for really big files. And using SAX would be okay, but …

#3 should make you think of XPath, since that’s a much clearer (and more declarative) way to express the semantics of what you want out of that complex XML. However, XPath processing requires you to have the full fragment of interest in memory.

What you’d really like as a general approach is to process the document with SAX until you find an island of interest, then capture that whole fragment in memory so you can do something structural with it. After that, you can go back to SAX parsing again until another interesting island arises.

How do you get the benefits of both approaches?

Well, if you’re like me, you

  • scratch your head a bit
  • think through, and perhaps try, some home-brewed approaches
  • get frustrated that they’re not general enough
  • do some more head-scratching
  • think to yourself “somebody, somewhere, must have already solved this!”
  • spend some time looking around
  • finally find a Better Way (or two)

The first Better Way is Python’s standard pulldom module (Uche Ogbuji has a nice illustration of its usage here). But, as he himself points out here, the pulldom approach doesn’t offer much to aid the clarity of your code.

The Even Better Way, particularly if you’re already using 4Suite‘s XML tools (and you should really consider it if you’re not), is to use their Sax.DomBuilder() method. That’s mentioned in their documentation, but with only a single sentence. Here’s a little more detail on how you might use such an approach. I assume you already know how SAX parsing works (there are plenty of other resources out there if you don’t: you might start with this recipe).

Sax.DomBuilder() is meant to essentially mirror the structure of a normal SAX processor. When the usual SAX events fire (like the start of an element), Sax.Dombuilder() turns these into their corresponding activities (like starting a new child element) to build a fresh document from scratch.
So the conceptual challenge is figuring out how to switch mid-stream from one handler to another. It’s not good enough to set up two handlers, and simply switch at a beginning of an island of interest in the stream of SAX events. For my application (and probably most others), you also have to switch back once you’re done (and perhaps repeat the cycle). That means you need an additional level of control above either of the handlers (the “normal” SAX callbacks and the DomBuilder variant).

If you’re thinking in state-machine terms (SAX tends to do that to you), you might add conditionals or flags to your code. I think the approach below is even more better, though. The standard SAX callbacks are defined with a layer of indirection, so the “real” definition for starting an element is in _startElementNS (note the leading underscore). Then when you find the beginning of the content of interest (in this example, the details element and its children), you simply switch the callback definitions for the DomBuilder ones.

The key reason to use this indirection is so you can detect when you’re done and switch back.

(Download the code.)

If you’ll run this example (which is laced with some print statements to make the execution flow clear), you’ll see that the “normal” callbacks fall silent once the details element is passed. Also, note that you have to test for the entry condition at the beginning of the startElementNS, but at the end of endElementNS, to make sure the right things do (and don’t happen). In this example, i only collect one island of interest and print it out at the end: in a real application, you’ll probably update some data structure as you go.

Note that Uche’s Amara toolkit offers even more capabilities, and i suspect Amara’s Bindery may make this kind of task even more straightforward. But i haven’t crossed that bridge yet.

(If you’re a Perl programmer, this whole story should make you think of XML::Twig. That’s a fair answer, though Twig’s API is a bit non-standard. However, you may already be working in Python for plenty of other good reasons – I’m not trying to start a language war, though, honest! – which reason enough to use this approach.)

Fun With the Libronix Object Model

Most users only interact with Logos Bible Software as an ordinary desktop application. But underneath the hood is a powerful API called the Libronix Digital Library System Object Model (LDLSOM) that programmers can use to do all kinds of fancy tricks (you can download the documentation for free). Lately i’ve been working on automated processing of Bible dictionary content, and consequently i’ve been learning how to program against the LDLSOM. The climb is a little steep in the early part of the learning curve, but once you’ve ascended far enough there are some neat things you can do with it. And since everything is exposed as COM objects, any language with a COM interface (JavaScript, Perl, Python, even (gasp) Java) can be used with it.

Here’s a simple Python example, using an approach my colleague Rick Brannan developed, for extracting the ISBN numbers of resources in your Libronix library. Why? So you can upload them to LibraryThing. Here’s the deal: LibraryThing is a great social bookmarking service for bibliophiles, where you can find out who else has the same books you do and use that to discover common interests, popular titles, reviews, etc. A lifetime membership for LibraryThing is only $25, which seems like a bargain. But the other cost of admission is entering your library information. For dead tree books, you can use a CueCat scanner with the barcode to get a list of ISBN numbers (which you cue up so LibraryThing can look up and import the other metadata). But for electronic books like those in Libronix, how do you get the ISBNs? The best Libronix packages like Scholar’s Gold and Silver have thousands of resources, and looking each one up by hand is a daunting task.

LDLSOM to the rescue! The following snippet of Python code goes through each resource in your library, extracts the metadata (you can view this in the application with Help > About This Resource), and prints out the ISBN numbers for resources that have them (that’s a couple hundred for the version i have here at work). This assumes you have a standard Python installation (i use the excellent one from ActiveState) and one additional package for XML parsing: see the comments in the code. Unlike Rick’s version, there’s no user interface or other niceties: it just prints out the list. While i don’t want to provide basic Python tutorials (there are plenty of great ones out there already), if you know Python but can’t get this working, let me know (sean at logos here’s-the-dot com) and i’ll try to help. (i can’t get the indentation right, which Python cares about but WordPress loves to mangle: download the code here.)

If you use this to load your LibraryThing catalog, let me suggest you tag your books with “libronix“, which seems to be the most popular LIbraryThing tag at present for Libronix titles (though “ldls” and “Libronix ebook” are also in use).

import win32com.client
# requires 4suite.org XML processing library, available from
# http://4suite.org/?xslt=downloadproduct.xslt&show=http%3A%2F%2F4suite.org%2Fdata%2Fsoftware%2F4Suite or
# http://sourceforge.net/project/showfiles.php?group_id=39954
from Ft.Xml import Parse
def toascii(str):
"""Silly hack for producing a mostly correct ASCII version of a
Unicode string."""
return str.encode('ascii', 'xmlcharrefreplace')
def main():
# get a handle for the application, starting it if necessary
ldls = win32com.client.Dispatch('LibronixDLS.LbxApplication')
# for each resource, get the DC metadata in XML, parse it, and extract
# the ISBN number (if any)
for res in ldls.Librarian.Information.Resources:
metadata = ldls.Librarian.OpenResourceMetadata(res.ResourceID)
dcm = metadata.XML.selectSingleNode("dc-metadata").xml
# presumably any unicode characters won't be in the ISBN
dcmobj = Parse(toascii(dcm))
isbnnode = dcmobj.xpath(u"/dc-metadata/dc-element[@name='dc.identifier:isbn']/text()")
if isbnnode:
print '%st"%s"' % (isbnnode[0].nodeValue, toascii(ldls.Librarian.GetResourceTitle(res.ResourceID)))
if __name__ == "__main__":
main()