God’s Word | our words
meaning, communication, & technology
following Jesus, the Word made flesh
December 11th, 2008

XSLT and Namespace Matching

I have to admit to something of a love-hate relationship with programming in Extensible Stylesheet Language Transformations (XSLT). On the love side, it’s a very powerful language, and its structure-transforming orientation makes certain tasks easy that would be really hard in a more procedural approach. So when i started getting involved with lots of XML data a number of years ago, it wasn’t long until a basic knowledge of XSLT became an essential part of my toolkit.

However, some of its power comes from the fact that there’s a fair amount of implicit processing going on behind the scenes. Of course, when that does just what i need, that’s great: but when it doesn’t, and i don’t understand why, i quickly slide into “this language drives me nuts”. That problem is aggravated by the fact that i don’t use it all the time: so i don’t get really good at it, my understanding of the processing model is just deep enough to accomplish my current task, and the little tricks that are an essential part of using any language well recede too quickly into the dim mists of my brain.

My latest love-hate experience comes from transforming some XAML code (the details of why we need to do this are too painful to recount, but are all too typical of commercial data-slinging environments like Logos). Here’s a simplified fragment of the input XML, where the basic task is recomputing the height and width:

XML fragment
You might naively think that a matching statement (the heart of a lot of XSLT processing) like xsl:template match="/Viewbox" is the way to get started with this. I thought so too, but then spent an embarrassingly long time getting no output whatsoever because it didn’t match. I could find the first element with tricks like match="/*", but couldn’t find it directly.

Those of you who are a little more XSLT-savvy than I are now shaking your heads and tsk-tsk’ing at my obvious mistake: the XML document (and hence the Viewbox element) has its own namespace (xmlns="http://schemas.microsoft.com/winfx/2006/xaml/presentation"). It’s a small comfort to me that i lingered longer in my error than i might have because, testing my XPath expressions first in XMLSpy (as i often do), the expression /Viewbox matches just like you’d expect. But the XPath evaluation uses the document’s namespace, while the XSLT processing doesn’t (unless you tell it to). So first i had to realize that one tool wasn’t quite telling me the same truth as the other.
But even after guessing it was a namespace problem, it still took me far longer than it should have to figure out a solution. I tried a few stab-in-the-dark approaches like putting namespace declarations in the stylesheet, and looked (in vain!) in my XSLT book trying to find a clear explanation of matching and namespaces. I’m sure it’s there somewhere, but it’s a big book, and i often have a hard time finding the right information in it (maybe the second edition does better with this, i only have the first edition handy).

This post provided one solution: ignore the namespace altogether with a matching expression like match="/*[local-name()='Viewbox'].Though perhaps a little clunky, that does the job. I found a slightly better approach here, which is what i finally adopted: define a namespace prefix in the stylesheet (not the same as just copying the namespace declarations!), and then put that prefix in your matches. Specifically, i added xmlns:x="http://schemas.microsoft.com/winfx/2006/xaml/presentation" to the namespaces for the stylesheet (different than its definition in the XML source, but that’s ok), and then used match="/x:Viewbox" in the xsl:template. After this epiphany, the rest worked as i knew it would all along :-/.

Notes to self:

  • when things don’t work the way you think they should, stop just trying different approaches you don’t understand, and figure out the underlying problem
  • if you can’t find what you need in one reference book, try another
  • repeat out loud as needed: namespaces are a pain, but they’re good for me …
  • go to Google sooner, since it knows all (if you can just find it!)

PS: does anybody else (besides me and this guy) have the problem that WordPress always eats your XML code? I haven’t yet figured out a way to get it past the editor and posting process, which is why it’s just an image (!) above.

December 9th, 2008

A RefTagger Hack

Like Logos’ RefTagger? Me too, so much so that i’m starting to feel ripped off in contexts where i don’t have easy access to the text behind a Scripture reference. If you’ve only got one reference, no big deal: you just look it up in Libronix or some other Bible software, or one of the many excellent sites on the web. But if you’ve got a whole string of them, and you’re just scanning rather than doing in-depth study, it’s a pain to have to look them up one by one.

Since i’m spending a lot of time right now reviewing Scripture references for entities in the Bible Knowledgebase, i wrote a quick CGI hack i call the RefTaggerizer to fill the gap. It couldn’t be much simpler: there’s an HTML form that accepts a block of text as input, and then redisplays itself on the resulting page (a ‘self-posting form’, in the jargon i learned today). The magic is the embedded RefTagger code: any Bible references in the text that you submit get turned into dynamic links. There’s no real novelty here: it’s just a convenient way to transform a string containing references into hyperlinks (without the bother of creating a full-up HTML page on a server).

To use this, you need to:

  • have a web server running on your local system that supports CGI (i use Apache’s HTTP Server, which is a breeze to download and install: IIS ought to work just as easily).
  • configure the server to allow execution of CGI files (the CGI Tutorial for Apache 2.2 is here). You probably have to have mod_python installed as well.
  • download the Python code reftaggerize.py and put it in your CGI directory (For Apache 2.2 on Windows, that’s probably C:\Program Files\Apache Software Foundation\Apache2.2\cgi-bin)

This is a really simple script, and if you’re a Perl-monger instead of a Pythonista it should be easy to rewrite. Alas, i tried installing this on my web site (at http://asamasa.net/cgi-bin/reftaggerize.py), but i haven’t done any CGI here before and it doesn’t work: it just displays the code in the browser rather than executing it. I don’t know if this is a configuration problem with python, or what. So i can’t give you a snazzy demonstration (but if somebody out there sets it up and gets it working, let me know and i’ll post the URL).

December 1st, 2008

Automated Sermon Transcription

Don’t get your hopes up: i’m not about to reveal some secret wisdom that will let you painlessly get nice sermon transcripts from your recordings. But i got an email from someone using Dragon Naturally Speaking 10 to automatically transcribe sermons from an audio recording. He asked some reasonable questions, and others might be interested in my answers.

Here’s a sample transcription he provided:

Children this morning I had intended on staying all the things that Bruce about five minutes ago regarding last week, which tells you I was not stepped in Wednesday morning. I but I do want to add just one thing as one share with you one comment from one guest last week as person came a little late and as result of that had to sit somewhere in the back of no exactly where she sat but it was in the back. And during our time of worship, she told me afterwards that at one point she looked up at the guys serving in the sound and much to her amazement expecting to see three or four heads this kind of buried in buttons.

Though i haven’t heard the original audio for comparison, this transcription is not too edifying, not to mention barely comprehensible! Unfortunately, while commercial speech-to-text systems have made a lot of progress, the current state-of-the-art is too often a source of amusement rather than usable transcriptions.

He comments:

As you can see, much work needs to be done to massage this transcript into final form. Some of the initial work is:

  • Correct incorrectly transcribed words/phrases.
  • Correct punctuation/sentence breaks.
  • Define paragraph breaks.

My question is: could the resources available in the NLTK be used to automate some of the editing? Perhaps you have suggestions as to how one could most efficiently arrive at a final product, given the attached input.

Though i haven’t used Dragon’s system, i’ve spent a fair amount of my professional career working with speech-to-text systems, and unfortunately, i don’t know of any easy solution to this problem.

As commercial systems go, Dragon’s is probably about as good as you can get. While there are better performing systems in the research labs (my former colleagues at BBN Technologies have one of the best), they’re focused on customers with different requirements and much larger budgets than pastors. You should definitely spend the time to provide training samples of your speech (under the same acoustic conditions): that should pay off in better results. You might also get slightly better results with careful microphone placement: though our ears are very forgiving (and our interpreting brains very good at guessing), that’s not true of speech-to-text systems. In general, a close-talking mike at a constant distance will work better than one fixed to a podium.

There is another approach: CastingWords uses Amazon’s Mechanical Turk system to engage human transcribers in transcription. Their advertised budget transcription rate (if you’re not in a rush) is $0.75 per minute: so for $15, you could get the transcript for a typical 20-minute sermon (i’m sure you don’t go longer!). That’s likely to be more cost-effective than having somebody clean up transcripts as poor as the one above. You can even provide them with the URL for an audio file and they’ll take it from there. Disclaimer: i’ve never used their service, but i’ve heard others say they’re happy with the results.

While there is a lot of capability in the Python-based Natural Language Toolkit (NLTK), which i highly recommend for programmers interested in natural language processing, it doesn’t provide any silver bullets that i’m aware of.