God’s Word | our words
meaning, communication, & technology
following Jesus, the Word made flesh
December 11th, 2008

XSLT and Namespace Matching

I have to admit to something of a love-hate relationship with programming in Extensible Stylesheet Language Transformations (XSLT). On the love side, it’s a very powerful language, and its structure-transforming orientation makes certain tasks easy that would be really hard in a more procedural approach. So when i started getting involved with lots of XML data a number of years ago, it wasn’t long until a basic knowledge of XSLT became an essential part of my toolkit.

However, some of its power comes from the fact that there’s a fair amount of implicit processing going on behind the scenes. Of course, when that does just what i need, that’s great: but when it doesn’t, and i don’t understand why, i quickly slide into “this language drives me nuts”. That problem is aggravated by the fact that i don’t use it all the time: so i don’t get really good at it, my understanding of the processing model is just deep enough to accomplish my current task, and the little tricks that are an essential part of using any language well recede too quickly into the dim mists of my brain.

My latest love-hate experience comes from transforming some XAML code (the details of why we need to do this are too painful to recount, but are all too typical of commercial data-slinging environments like Logos). Here’s a simplified fragment of the input XML, where the basic task is recomputing the height and width:

XML fragment
You might naively think that a matching statement (the heart of a lot of XSLT processing) like xsl:template match="/Viewbox" is the way to get started with this. I thought so too, but then spent an embarrassingly long time getting no output whatsoever because it didn’t match. I could find the first element with tricks like match="/*", but couldn’t find it directly.

Those of you who are a little more XSLT-savvy than I are now shaking your heads and tsk-tsk’ing at my obvious mistake: the XML document (and hence the Viewbox element) has its own namespace (xmlns="http://schemas.microsoft.com/winfx/2006/xaml/presentation"). It’s a small comfort to me that i lingered longer in my error than i might have because, testing my XPath expressions first in XMLSpy (as i often do), the expression /Viewbox matches just like you’d expect. But the XPath evaluation uses the document’s namespace, while the XSLT processing doesn’t (unless you tell it to). So first i had to realize that one tool wasn’t quite telling me the same truth as the other.
But even after guessing it was a namespace problem, it still took me far longer than it should have to figure out a solution. I tried a few stab-in-the-dark approaches like putting namespace declarations in the stylesheet, and looked (in vain!) in my XSLT book trying to find a clear explanation of matching and namespaces. I’m sure it’s there somewhere, but it’s a big book, and i often have a hard time finding the right information in it (maybe the second edition does better with this, i only have the first edition handy).

This post provided one solution: ignore the namespace altogether with a matching expression like match="/*[local-name()='Viewbox'].Though perhaps a little clunky, that does the job. I found a slightly better approach here, which is what i finally adopted: define a namespace prefix in the stylesheet (not the same as just copying the namespace declarations!), and then put that prefix in your matches. Specifically, i added xmlns:x="http://schemas.microsoft.com/winfx/2006/xaml/presentation" to the namespaces for the stylesheet (different than its definition in the XML source, but that’s ok), and then used match="/x:Viewbox" in the xsl:template. After this epiphany, the rest worked as i knew it would all along :-/.

Notes to self:

  • when things don’t work the way you think they should, stop just trying different approaches you don’t understand, and figure out the underlying problem
  • if you can’t find what you need in one reference book, try another
  • repeat out loud as needed: namespaces are a pain, but they’re good for me …
  • go to Google sooner, since it knows all (if you can just find it!)

PS: does anybody else (besides me and this guy) have the problem that WordPress always eats your XML code? I haven’t yet figured out a way to get it past the editor and posting process, which is why it’s just an image (!) above.

Tags: , ,
December 9th, 2008

A RefTagger Hack

Like Logos’ RefTagger? Me too, so much so that i’m starting to feel ripped off in contexts where i don’t have easy access to the text behind a Scripture reference. If you’ve only got one reference, no big deal: you just look it up in Libronix or some other Bible software, or one of the many excellent sites on the web. But if you’ve got a whole string of them, and you’re just scanning rather than doing in-depth study, it’s a pain to have to look them up one by one.

Since i’m spending a lot of time right now reviewing Scripture references for entities in the Bible Knowledgebase, i wrote a quick CGI hack i call the RefTaggerizer to fill the gap. It couldn’t be much simpler: there’s an HTML form that accepts a block of text as input, and then redisplays itself on the resulting page (a ’self-posting form’, in the jargon i learned today). The magic is the embedded RefTagger code: any Bible references in the text that you submit get turned into dynamic links. There’s no real novelty here: it’s just a convenient way to transform a string containing references into hyperlinks (without the bother of creating a full-up HTML page on a server).

To use this, you need to:

  • have a web server running on your local system that supports CGI (i use Apache’s HTTP Server, which is a breeze to download and install: IIS ought to work just as easily).
  • configure the server to allow execution of CGI files (the CGI Tutorial for Apache 2.2 is here). You probably have to have mod_python installed as well.
  • download the Python code reftaggerize.py and put it in your CGI directory (For Apache 2.2 on Windows, that’s probably C:\Program Files\Apache Software Foundation\Apache2.2\cgi-bin)

This is a really simple script, and if you’re a Perl-monger instead of a Pythonista it should be easy to rewrite. Alas, i tried installing this on my web site (at http://asamasa.net/cgi-bin/reftaggerize.py), but i haven’t done any CGI here before and it doesn’t work: it just displays the code in the browser rather than executing it. I don’t know if this is a configuration problem with python, or what. So i can’t give you a snazzy demonstration (but if somebody out there sets it up and gets it working, let me know and i’ll post the URL).

Tags: ,
December 1st, 2008

Automated Sermon Transcription

Don’t get your hopes up: i’m not about to reveal some secret wisdom that will let you painlessly get nice sermon transcripts from your recordings. But i got an email from someone using Dragon Naturally Speaking 10 to automatically transcribe sermons from an audio recording. He asked some reasonable questions, and others might be interested in my answers.

Here’s a sample transcription he provided:

Children this morning I had intended on staying all the things that Bruce about five minutes ago regarding last week, which tells you I was not stepped in Wednesday morning. I but I do want to add just one thing as one share with you one comment from one guest last week as person came a little late and as result of that had to sit somewhere in the back of no exactly where she sat but it was in the back. And during our time of worship, she told me afterwards that at one point she looked up at the guys serving in the sound and much to her amazement expecting to see three or four heads this kind of buried in buttons.

Though i haven’t heard the original audio for comparison, this transcription is not too edifying, not to mention barely comprehensible! Unfortunately, while commercial speech-to-text systems have made a lot of progress, the current state-of-the-art is too often a source of amusement rather than usable transcriptions.

He comments:

As you can see, much work needs to be done to massage this transcript into final form. Some of the initial work is:

  • Correct incorrectly transcribed words/phrases.
  • Correct punctuation/sentence breaks.
  • Define paragraph breaks.

My question is: could the resources available in the NLTK be used to automate some of the editing? Perhaps you have suggestions as to how one could most efficiently arrive at a final product, given the attached input.

Though i haven’t used Dragon’s system, i’ve spent a fair amount of my professional career working with speech-to-text systems, and unfortunately, i don’t know of any easy solution to this problem.

As commercial systems go, Dragon’s is probably about as good as you can get. While there are better performing systems in the research labs (my former colleagues at BBN Technologies have one of the best), they’re focused on customers with different requirements and much larger budgets than pastors. You should definitely spend the time to provide training samples of your speech (under the same acoustic conditions): that should pay off in better results. You might also get slightly better results with careful microphone placement: though our ears are very forgiving (and our interpreting brains very good at guessing), that’s not true of speech-to-text systems. In general, a close-talking mike at a constant distance will work better than one fixed to a podium.

There is another approach: CastingWords uses Amazon’s Mechanical Turk system to engage human transcribers in transcription. Their advertised budget transcription rate (if you’re not in a rush) is $0.75 per minute: so for $15, you could get the transcript for a typical 20-minute sermon (i’m sure you don’t go longer!). That’s likely to be more cost-effective than having somebody clean up transcripts as poor as the one above. You can even provide them with the URL for an audio file and they’ll take it from there. Disclaimer: i’ve never used their service, but i’ve heard others say they’re happy with the results.

While there is a lot of capability in the Python-based Natural Language Toolkit (NLTK), which i highly recommend for programmers interested in natural language processing, it doesn’t provide any silver bullets that i’m aware of.

Tags: , , , ,
November 5th, 2008

Me and Barry

Given all the election hoopla, you could be forgiven if you missed an important detail about Barack Obama’s successful
path to the presidency. No, i don’t mean his massive fundraising, or his compelling oratory about change. As you’ll learn from his Wikipedia page, though he graduated from Columbia University, he started his education with two years at Occidental College, a small liberal arts school in southern California. His earliest interest in public service developed during his two years there, 1979-1981. What you won’t learn from Wikipedia, though, is who else was there during those formative years: me.Barack

That’s right, Barry (that’s what people called him back then) and i were classmates at Oxy (that’s what we alumni call Occidental), where i was a student from 1976-1980. I was a senior during his freshman year, and i was also around for his sophomore year (after graduating, i worked at Oxy for three additional years as a campus minister with InterVarsity Christian Fellowship).

Barry started out as a basketball player — something i was never any good at — but, as he said in a May 18 speech at Wesleyan University “I began to notice a world beyond myself … I became active in the movement to oppose the apartheid regime of South Africa” (citation). In 1977, just two years previously, i had withdrawn my funds from the local Bank of America in protest, after learning about divestment as an economic tool for fighting against apartheid. While little has been said publicly about my actions and their possible impact on Barry, do you think it’s just coincidence?

As a senior, i was putting in my best efforts as a student, but that wasn’t true initially for Barry. As detailed in a Newsweek story about his Occidental years, he met in the Cooler (an on-campus cafe that i frequented as well) with Roger Boesche, professor of politics, to complain about a poor grade. As Boesche reported to Newsweek:  “I told him he was really smart, but he wasn’t working hard enough”.

Obama confirmed that he transferred to Columbia in 1981 partly “because Occidental was so small, I felt that I had gotten what I needed out of it and the idea of being in New York was very appealing.” But another reported reason was that he had many older friends who were graduating: like i had done, the previous year.

I can’t take credit for everything. For example, i started out as a diplomacy and world affairs major, but later changed to linguistics. Barry went the opposite direction: after leaving Oxy for Columbia, Barry decided to focus on political science. And it was his own idea to go from Barry back to Barack: as recorded in his autobiography, “It was when I made a conscious decision: I want to grow up.” And of course, much of what i knew about presidential politics came from an earlier model: Jack Kemp, Occidental class of ‘57 and Republican vice-presidential nominee in 1996.

Like i always say, give credit where due.

Tags: ,
November 4th, 2008

More BibleTech 2009 Topics

Like a presidential candidate, i’m down to the wire for deciding what BibleTech talks to propose (happily, unlike them, i haven’t been campaigning for my ideas for months now!). The two i posted about last week — Bible Knowledgebase and Libronix Controlled Vocabulary — are the strongest contenders. But here’s a grab bag of some additional topics i’ve thought about, for you to cheer for, sneer at, or go off and implement yourself (so i don’t have to!). Let me know what you think.

Web Search for Bible References

At BibleTech:2008 i gave a talk on Bibleref: a Microformat for Bible References, as one approach to the problem of how content providers (web site authors, bloggers, etc.) can identify Bible references in what they create. Reftagger provides a different, more automated approach to the same problem.

But as i pointed out last January, that’s really only part of the problem — in fact, the smallest part. Because for every one blogger who adopts bibleref markup or installs Reftagger,  there will be 1000 more who’ve never heard of either one. In the earliest days of the web, you had to add keywords to your HTML to make it easy for search engines to find you: now, Google finds most plain text without any special work on your part. How do we accomplish the same thing for Bible references out on the web, so they can be reliably found regardless of whether the author took special care to identify them, despite differences in abbreviations or punctuation style, and being smart about verse ranges?

Making Bible2.0 Work

For years now, multiple sites on the web have offered Bible texts (in case you didn’t notice, Logos launched a beta version of their own, bible.logos.com, recently). More recently, in the last few years several sites have gone beyond that to a Web 2.0 style that i call “Bible 2.0“, by allowing users to contribute their own content through tags, external links, personal comments, etc. When Web 2.0 first became cool, several arose quickly and then withered (like xpound.org), while others are still out there. YouVersion is perhaps the best developed (Blogos post); Bibleserver is another.

While the concept is still fairly new, there’s enough out there to begin to evaluate:

  • what works well about these sites? what doesn’t work so well?
  • what’s missing to get these kinds of sites to have the same value as more popular Web 2.0 sites like del.icio.us, flickr, etc.?
  • what are some new ways in which Bible2.0 could support Bible study in small groups, informal web communities, etc.?

Audience Choice

If you follow Blogos, what would you suggest i talk about? Any ideas for a visualization you’d love to see, or some other algorithmic/data topic?

Tags: , , , ,
October 30th, 2008

Logos Conference Specials at ETS/SBL

Alas, i won’t be going to this year’s Society for Biblical Literature or Evangelical Theological Society conferences, even sadder because this year’s SBL is in Boston, where i used to live. I enjoyed the three previous SBL conferences i attended (where i made presentations, all of which you can find here), but this year i’ve got my nose to the data grindstone churning out new projects.

But if you’re going, be aware that the folks at Logos are planning a dozen “conference special” bundles with very attractive pricing, covering some important titles in Greek and Hebrew, NT studies, theology, apologetics, etc. If you’re interested in more details, you can call Academic Sales, or just go by the booth at the conference.

Disclaimer: i work for Logos, so i’m not a disinterested party. But i don’t work on commission :-)

Tags: ,
October 30th, 2008

Logos RefTagger

Maybe, like me:

  • you’re a WordPress blogger and you include Bible references like Psalm 119:60 in your posts
  • you’ve been saying to yourself “yeah, i really should install RefTagger, i’ll get around to that one of these days …”
  • you haven’t actually gotten around to it yet

Well, today was the day i finally got around to it: Blogos is now powered by RefTagger. It took me all of 5 minutes to

  1. Download the WordPress plugin from this page.
  2. Unzip it and upload the folder and its contents to my hosting service (i use and recommend FileZilla for stuff like this), typically (myblogdir)/wp-content/plugins/.
  3. Go to the WordPress admin page at (myblogURL)/wp-admin/plugins.php and activate it.
  4. Go to the WordPress admin options page at (myblogURL)/wp-admin/options-general.php?page=reftagger/RefTagger.php and do any customization (though none is required)

You won’t see RefTagger links on old posts that i linked by hand (by design, it doesn’t override those), but you should see it clearly on future references. I’ve added Libronix links to my configuration (the little L icon) for you Libronix users.

If you’ve been following the Bibleref thread, you may wonder how these two approaches interact. RefTagger will pick up anything marked as class=”bibleref”, so if you’ve been diligently using Bibleref markup, you’re good (actually, you’re better, since Bibleref allows you to mark some less common types of references that RefTagger can’t pick up). I plan to keep using Bibleref markup myself: but i also recognize it’s a lot easier for blog authors to not bother, and just let RefTagger do the work. And neither approach solves the big problem of searching for Bible references across the web, as i’ll discuss in a future post. Also, i haven’t yet added RefTagger to SemanticBible, my main site: that’s not quite as easy as WordPress, so i’ll have to get around to that one of these days.

October 28th, 2008

BibleTech 2009 Topic: the Libronix Controlled Vocabulary

Next in my experiment to gather feedback on possible BibleTech 2009 topics: the Libronix Controlled Vocabulary. This is the second of my two major activities over the last year (the other was described in my previous post), and therefore a pretty strong contender for a BibleTech presentation.

Unlike the Bible Knowledgebase, which is about real-world entities in the Biblical text, the Libronix Controlled Vocabulary (LCV) organizes terminology from the field of Biblical studies, principally Bible dictionaries, encyclopedias, and other kinds of subject-oriented reference works. A controlled vocabulary identifies, organizes, and systematizes a specific set of terms for indexing content, capturing inter-term relationships, and expressing term hierarchies. Like other kinds of metadata, this infrastructure then supports applications in search, discovery, and general knowledge management. The initial version of the LCV was built by merging content from 7 of the most important Bible dictionaries in Libronix, and currently comprises some 11k terms: i expect it will eventually grow to 15k or perhaps more.

One interesting aspect of working in the specific domain of Biblical studies is that there is a core set of subjects that are common to many or most Bible dictionaries. This includes named individuals and places in the Bible, but also subjects like Heaven or Heresy. But while one dictionary has an article on Heresy (NBD [Libronix link], or Eastons [Libronix link]), another might have one entitled “Heresy and Orthodoxy in the NT” (Anchor [Libronix link]). These articles may have both common content but also significant differences, stemming from their intended audiences (scholarly vs. popular), theological orientation, comprehensiveness, etc. The LCV provides a way to capture some of these similarities, as well as enabling some interesting new capabilities for machine learning from existing prose content. For example:

  • what are the prototypical Bible references, names, or phrases used to discuss a topic?
  • can we learn anything about the importance of topics by looking at how much is written about them, how many dictionaries cover them, and other kinds of automated analysis?
  • what knowledge can be gleaned from the topology of terminology linkage (what links to what)?

I’m not sure i’ve provided enough information here to give a clear sense of what might be covered in such a talk, but i welcome any feedback from potential BibleTech attendees (or others) as to whether this sounds interesting, and which aspects of it you’d most like to learn about.

Tags: ,
October 28th, 2008

BibleTech 2009 Topic: the Bible Knowledgebase

My most significant activity at Logos over the last year and a half has been building a database of people, places, and things i call the Bible Knowledgebase (BK). I’ve posted on numerous aspects of this project before (collected in this category), and thanks to lots of hard work by a number of individuals, we’re closing in on a relatively complete internal version. This won’t be released until the next major version of Logos software, so it’s public debut is still some ways off.

So one strong candidate for a BibleTech talk is a review of the BK, a machine-readable knowledge base of semantically-organized Bible data that is linked to Biblical texts to support search, navigation, visualization. The thousands of entities in the BK (people, places, and things, along with their names) have a variety of attributes that are appropriate to their type: people have family relationships, places have geo-coordinates, etc. Relationships between entities support discovery and exploration.
Unlike knowledge expressed in prose (like Bible dictionaries), BK data provides reusable content that can serve a variety of purposes. It also provides an important integration framework for Libronix resources, in the general spirit of Tim Berners-Lee’s Linked Data idea.

Some other topics the talk might address:

  • visualizing and learning from the graph of relationships
  • BK as an information architecture for other Libronix resources
  • challenges in building and using BK
  • some specific tools that have proved useful in managing BK development
  • a possible future for community participation in BK extension

So now, the audience participation portion of our program:

  • would you be interested in hearing a talk like this at BibleTech 2009?
  • what aspects are most/least interesting to you?

I’d encourage you to post a comment with your responses.

Tags: ,
October 24th, 2008

BibleTech 2009

Things have been silent at Blogos for several months now: i needed to take a break and focus more intensely on moving along some of our major data projects at Logos (like the Bible Knowledgebase).

But i’m ready to get back to a more regular blogging schedule, and nothing gets the creative juices flowing like the prospects of another BibleTech conference! The first BibleTech (this past January) was one of the highlights of my year: here’s a list of 2008 speakers, including two presentations by me (you can find links to the slides here, and there’s an MP3 for the Zoomable Bible talk here, though be warned that it’s 150Mb and non-streaming). So i’m really looking forward to the next one, March 28-29 in Seattle.

The call for presentations has gone out, and so i face the dilemma of choosing among lots of different ideas and topics, and deciding what to propose. So many smart people attended the last conference that i’d love to just sit around and talk tech for several days straight, but i probably have to focus on just one or two topics.

So here’s your chance to give me some feedback (and for me to learn whether anybody’s still listening!). I’m planning to blog about some of my presentation ideas in subsequent posts, and i’d love to hear your comments about them. Does the topic make sense? Would you want to hear about it? Is it compelling, relevant, important, “cool”? Is it too obscure, too far out there, too geeky? What can i improve from last year (if you attended one of my talks)? It would really help me to have some feedback on these questions, especially from those who attended last year and therefore have a good feel for what the conference is all about (but i’ll take any comments i can get).


If you’re on Facebook, please join the BibleTech group.

Maybe you should be presenting at BibleTech 2009 too! The call for participation is open until Nov 3, and describes what we’re looking for, so get those abstracts in. And if i happen to mention a topic that you’re interested in presenting on, let me know and then go for it! There’s no shortage of things i’d like to talk about …

Tags: