God’s Word | our words
meaning, communication, & technology
following Jesus, the Word made flesh
February 11th, 2010

Bookmarklets Redux

Time spent on the web can be oh-so tedious if you’re constantly cutting things from one page and pasting them elsewhere just to get to another, related page. Someday Linked Data may make this all better, but until then, we all get by with helpful tricks.

Bookmarklets are one essential weapon in the arsenal of the web-info-warrior. Usually they’re little JavaScript programs stored as a bookmark in Firefox, providing one-click access to some simple functionality like looking things up elsewhere, resizing your window, etc. I’ve blogged previously about bookmarklets to find local library sources for a book on an Amazon page (or PaperBackSwap).

I dusted off my bookmarklet skills this past week and came up with some nifty tools that i wanted to share.

First off, imagine you’re looking at a website with Bible references whose benighted author somehow failed to include RefTagger. So rather than a nice pop-up with the text of the reference, or even a helpful link to that text on some Bible site, you’re just looking at a inanimate, unlinked string: boo. The Bible Reference Bookmarklet to the rescue! Simply select the text of the reference, click the bookmarklet, and you’ll be whisked off to that reference at Bible.Logos.com. If you haven’t selected any text first, you get a dialogue box asking for it.

To get this goodie in Firefox, first make sure the Bookmarks Toolbar is showing (View > Toolsbars > Bookmarks Toolbar must be checked). I’d love to give you a link to just drag onto the toolbar, but i don’t seem to be able to get the code past WordPress. So go to Bookmarks > Organize Bookmarks, and select Organize > New Bookmark. Give it a useful name like “Bible Reference Lookup”, and paste the code below in Location field.

javascript:(function(){%20function%20getSearchString%20(promptString)%20{%20s%20=%20null;
if%20(document.selection%20&&%20document.selection.createRange)%20{%20s%20=document.selection.createRange().text;%20}%20
else%20if%20(document.getSelection)%20{%20s=%20document.getSelection();%20}%20
if%20(!%20(s%20&&%20s.length))%20{%20s%20=prompt(promptString,'');%20}
%20return%20s;%20}%20searchString%20=%20getSearchString('Bible%20Reference%20to%20look%20up%20:');%20
if%20(searchString%20!=%20null)%20{%20if(searchString.length)%20{%20location%20='http://bible.logos.com/#ref='+escape(searchString);%20}%20
else%20{%20location%20='http://bible.logos.com/';%20}%20}%20%20})();

After you’ve clicked ok, you should see it on your toolbar.

You can do similar tricks for a wide variety of strings that you just want to look up elsewhere (i discovered one here while writing this post that lets you look up articles on Wikipedia). This isn’t fundamentally different from copying the string into a search box: but sometimes it’s more convenient.

Descending into more esoteric purposes (to give you ideas for your own bookmarklets): as part of an earlier post on Tools for Personal Knowledge Management, i mentioned my use of TiddlyWiki for quick organization of hyperlinked notes. Like other wiki software, TiddlyWiki has its own link syntax, that looks like

[[Link text | URL]]

When linking to lots of other web pages, i was getting tired of copying the URL, pasting that in, then typing the square brackets, link text, vertical bar, and more square brackets, all in the right format. Wouldn’t it be more convenient to just construct this expression from the title of the page and its URL, rather than having to type it myself? YES! and the TiddlyWiki Page Link bookmarklet does just that, putting the result in a little pop-up window where a triple-click selects the whole thing, ready to copy and paste into your tiddlywiki (and tailor as desired: the title isn’t always what you want, but it’s often easier to edit and throw things out rather than type afresh). This one you can just drag to your bookmarks toolbar and use right away.

TiddlyWiki Page Link

Also, i’ve switched to a much better library lookup bookmarklet (and a service to help you create one for your local library) from WorldCat. Among other things, it generates the list of all the different ISBNs that might exist for a title (which can be very long indeed), and when there are many, it provides links for alternate searches in case the first group comes up empty handed.

Some other cool bookmarklets in my collection include:

  • CiteULike Popup Post and kin to make it easy to add (certain kinds of) articles to your reading list management. Adds more value for sources whose structure it understands.
  • Show del.icio.us citations of the current URL (you can find it there)
  • Resize your browser window to 1024 x 768 (if you want to see how a page will look on a smaller monitor or projector): the bookmarklet follows, just drag to your toolbar. 1024 x 768
  • A CSS validator for the current page: see Pete Freitag’s page.

Hat tips:

November 1st, 2009

Bible Chatbots

Suppose you had a database listing authorship and reported speech in the Bible, so that, for each set of words, you know who said or wrote them (the ESV folks did this using Amazon’s Mechanical Turk a few years back, and Jim Albright’s Dramatizer has similar data embedded in it). I assume the speakers have standardized identifiers.

Now imagine a matching algorithm (there are lots of candidates out there) that, when provided with either a question or a list of words, and optionally a speaker, retrieves passages that best match the input.

Example: “why does God allow evil?” might return

  • Eliphaz the Temanite: Job 15:14-16
  • the woman of Tekoa: 2 Sam 14:14
  • the apostle John: 1 John 3:11-17

Querying about “what about God and evil?” with speaker=Jesus might (in the best case) give answers like

  • Matt 5:45
  • Matt 12:35

Apart from how accurate such answers might be (that depends on the sophistication of the matching algorithm), you’ve now got the engine for a chatbot that gives Biblical “answers” . Aside from perhaps being an interesting hack, would this be useful? Lazyweb, are you listening?

June 16th, 2009

http://ref.ly for Bible References

My colleagues at Logos have launched http://ref.ly, a URL shortening service for Bible references: see this blog post. It provides the convenience of TinyURL (turning long unreadable URLs into something much more manageable), but unlike that service also provides readable, understandable content. Once you get past the prefix, you won’t have any trouble figuring out what verse http://ref.ly/Mk4.9 is referring to.

If you’re a Twitter person trying to shoehorn your message into 140-character tweets, you’ll like the fact that this gives you a brief and unambiguous way to both specify a Bible reference and link to the content behind it (the references resolve to the actual verse text at bible.logos.com). Since addressability matters, this is a good thing.

But it has precisely the same utility even if you’re not a Twitterhead (i’m not):

  • it clearly marks a string of characters as a Bible reference
  • it also normalizes the reference into a form that can be automatically processed

While it’s not quite a microformat, it’s really only a small step away from things like bibleref. In particular, if lots of people start using ref.ly references, it will be possible to process that content and understand things like what verses are most popular.

What’s more, editors that recognize and automatically link URLs (like MS Outlook for HTML-based email, and MS Word) will now automatically make Bible links for you (like RefTagger does for blog posts), as long as you’re willing to tack on “http://ref.ly/” and live with the slightly non-traditional format. You don’t need to know anything about how to make a hyperlink in HTML: just a little extra syntax (14 characters, to be precise) moves these references toward much greater usefulness.

June 12th, 2009

Reading Tab-Delimited Data in Python with csv

I had a head-slapper this morning when i realized i’d been using custom code for a long time to do something that’s in a standard Python module. Here’s the sorry tale, in hopes of saving others from a similar fate.

I regularly use tab-delimited files for data wrangling: it’s a nice, lightweight format for table-structured data, and Excel makes a good enough editor for non-programmers to change things without messing up the format. Here’s a simple example, with a set of identifiers in the first column: a typical use case would be that somebody is editing the second column so you can map old identifiers to new ones.

Old New
Aphek1 AphekOfAsher
Aphek2 AphekOfSharon
Aphek3 AphekOfAram

It’s also very easy to read and write this kind of data in Python:

for row in open('somefile.txt', 'rb'):
    old, new = row.split('\t')
    # do something useful here

So i have a little utility reader module doing only a little more than this, stripping out comment lines, returning a list or a dict, etc., and i use this code all over the place. Then i recently needed to read some CSV (comma separated values) files, and stopped to ask The Question, which every programmer should ask before writing new code:

Hasn’t somebody else solved this problem already?

In the case of reading and writing CSV files, the answer was a quick and clear “yes”: there’s a standard Python module called csv that does just that, and nicely. So, reformatting the earlier data example as CSV would look like this:

"Old", "New"
"Aphek1", "AphekOfAsher"
"Aphek2", "AphekOfSharon"
"Aphek3", "AphekOfAram"

and there’s a nice DictReader method that (assuming your columns are unique and your first row identifies them) makes working with this data even easier.

import csv
reader = csv.DictReader(open('somefile.csv', 'rb'))
for row in reader:
    #do something more useful here
    print row.get('new')

If the first row doesn’t contain column headers, you can supply them to DictReader. This looks like overkill for this simple problem, but once you have multiple columns, need to check values or map them onto something else, or add other logic and processing, life is just much easier with a dictionary structure (for one thing, you get rid of meaningless mystery indexes and stop asking “what the heck is in row[1]“?).

Now comes the embarrassing part: i quickly breezed through the documentation, accomplished my immediate task, and moved on, missing one important detail that i just now (a month later!) figured out. Tab-delimited files are just a special case of a CSV file. My original, tab-delimited file works just the same way, once i construct the reader with tabs (rather than the default of commas) as the delimiter.

import csv
reader = csv.DictReader(open('somefile.txt', 'rb'), delimiter='\t')
for row in reader:
    #do something more useful here
    print row.get('new')

There are a few other gotchas, the most important of which for me is that csv doesn’t handle Unicode. So if you have to read Unicode data, you’re back to reading the data directly, splitting lines on tabs, etc.

The best code is usually the code you didn’t write and don’t have to maintain. No matter how many times i stop and ask The Question, i still don’t do it enough.

May 22nd, 2009

The Most Important Verses? It Depends What You Mean

The title of this post is a deliberate take-off from a recent post at OpenBible.info entitled “What Are the Most Popular Verses in the Bible? It Depends Whom You Ask”. That post combines data from an earlier ESV analysis of search results, TopVerses.com, a BibleGateway (internal) study, and OpenBible data to present a list of 278 verses, all of which occur in the top hundred of at least one source’s “top 100″ list. It’s interesting to see both how much disparity there is (only 13% occur in at least three of the four lists), but also how uneven the distribution is. As one commenter points out, it’s somewhat surprising that there are no verses from Revelation, and Old Testament narrative in particular is largely absent except for Genesis. John’s gospel has about as many popular verses as all the other gospels combined: there are only four verses from Mark (two of them from the often-questioned ending). Less surprisingly, perhaps, there are none from the shortest NT books (Philemon, Jude, 2-3 John). Altogether it’s an interesting study.

The larger question this raises for me is how we might come up with a comprehensive, global score for verses to indicate their importance for a variety of purposes. As the OpenBible post suggests, this depends both on what the source of the data is, but also on what your purpose is and what you mean by “important” (which is certainly different from “popular”, though not completely unrelated).

One useful purpose is ranking verses to present them in response to searches: TopVerses.com is explicitly organized this way, as indicated in this news article about the site. They don’t go into much detail about how they gathered their data, though the scope (37M references scoured from the web) is impressive. But there’s a subtle disparity here: their data is based on counting mentions (citations) in published web pages, but their use case is prioritizing search results, and these may be out of sync. The fact that a given verse is frequently published on the web doesn’t necessarily mean it’s the one you want at the top of the list when you’re doing a word-based search, for example. The other three sources seem perhaps better matched to ranking search results, since they’re derived from searches themselves.

Another key hitch is these endeavors is how to handle range references, both in processing source data and (for search purposes) in handling queries. For example, many Bible dictionaries frequently reference ranges of verses, sometimes extensive, multi-chapter ones. If you’re going to count these, you need to think carefully about how you do the counting so you don’t introduce bias (or, better, you select the bias that’s best suited to your purposes).

For example, in the TopVerses.com ranking John 3.1 is #26, despite the rather plain descriptive content with little obvious spiritual impact.

Now there was a Pharisee, a man named Nicodemus who was a member of the Jewish ruling council. (John 3.1, NIV)

While i can’t be sure, i strongly suspect this high rank is an unintended consequence of  dis-aggregating ranges and whole chapter references from John 3. In fact, scanning top verses by chapter from John, the first verse in each chapter is very often the highest or second-highest ranked, and near always among the top ten. This probably says more about the counting methodology than the significance of those verses in particular. The Bible Gateway study focuses on ranges of no more than three verses to explicit mitigate this problem.

Other Measures of Importance

Moving from popularity to importance, i can imagine several different factors that might be combined to produce a more general importance score:

  • citation frequency (based on some corpus). In the TopVerses.com approach, these are web pages, which provides a very large set of observations. A number of other digital text collections would also suit this purpose, and even allow segmentation by genre: for example, you get a very different ranking from the Anchor Yale Bible Dictionary compared to Easton’s (and neither have John 3.16 at the top of the list). See below for more about this.
  • search frequency, the basis for the other three sources in the OpenBible.info post. This could be refined further given data on follow-up activities. For example, depending on your application, verses searches whose results are then expanded into a chapter view or followed to the next verse might get a boost compared to those with no further action (this seems like a variant of “click through” rates used in search engine advertising)
  • content analysis (context-independent): this could have several different flavors.
    • word count: though John 11:35 gets mentioned more than you’d expect precisely because it’s the shortest verse in the (English) Bible, in general longer verses are more likely to be important. This could be refined further given a metric for important words (but now we’ve introduced a new problem: where does that data come from?), which could be used for weighting the counts.
    • We could do even better if, instead of counting words, we count concepts (and weight them). Assuming we think the concept of HUMILITY is important, we’d want verses expressing that concept to rank more highly, regardless of whether they used a more common word like “humilty”, or a less common one like “lowly”. Converting words to concepts is a difficult challenge, however.
    • Connections to other data also affect importance. In some sense, every verse that reports words of Jesus is probably more important to a Christian than one whose importance is otherwise comparable, which is why we have the convention of printing Bibles with the words of Christ in red (a binary system for visualizing importance).
    • We might even consider negative factors: a lower rank for unfamiliar, hard-to-pronounce names, or “taboo” words.

Unlike TopVerses.com, i don’t see a particular need to provide a unique rank for each verse. If each verse has a score (to simplify the math, a decimal between 0 and 1 is a common approach), you can simply pick the top n verses that fit your purpose, and then order any ties canonically.

Comparing Dictionary Reference Citations

I did a small experiment to compare the most frequent reference citations in seven Bible dictionaries that are incorporate in Logos’s software (so this is citation frequency, not search frequency). I extracted and counted all the references, and then aggregated the counts across all seven: the top 20 references are shown below, along with how many “votes” they received in the OpenBible.info list. In the case of whole chapter references (four of the top ten), i’ve indicated with yes/no whether any verse from that chapter occurs in the OpenBible list.

There’s relatively little overlap between the two lists: only seven of these are in the OpenBible list. Many of these make sense given the different purposes of reference works: for example, Is 61.1 is a key messianic text. The high rank for 2 Ki 15.29 is initially puzzling, but probably results from being commonly cited in discussions of the conquests of Tiglath-Pileser and the Babylonian exile. Overall, this is probably much too small a sample to show the correspondences: i presume we’d find much more overlap in the top few hundred.

Reference Aggregate Count Count In
OpenBible List
Jn 1:14 169.5 1
2 Ki 15:29 165.2 0
Is 61:1 159.8 0
Ac 1:13 151.7 0
Ge 1 150.0 yes
Ac 15 143.0 no
Ge 2:7 142.3 no
Ge 46:21 139.3 no
Jn 3:16 137.8 4
Ge 1:26 135.2 3
Is 7:14 134.3 1
Mt 28:19 130.2 3
Da 7:13 130.0 0
Ps 2:7 129.8 0
1 Pe 2:9 126.3 0
Ac 20:4 124.3 0
Lk 3:1 123.8 0
Mk 10:45 123.7 0
1 Sa 1:1 121.5 0
Ac 1:8 120.8 3

Details:

Conclusions

None of this is meant as criticism of the particular sites mentioned above. I strongly believe that any user-oriented, empirically-based data set is better than nothing, and in most endeavors like this, “the best data is more data”. * But with more data comes more complexity, and i’ve only scratched the surface here in considering some of the different factors.

The key point is this: if we want to measure something, we need to be clear up front about exactly what it is, and also what purpose we hope it will serve. I never stop being amazed at how often “obvious” approaches to data problems produce surprising results.


* In my recollection, this quote is attributed to Bob Mercer, a leading researcher in statistical language processing who was part of the IBM research group in the 1990s. I haven’t been able to verify a real source, however.

March 5th, 2009

Using GATE for Simple Text Mining

The Punditry Propagation Principle (P3)

If you say enough, long enough and loud enough, people start to believe you know something.

In accordance withthe P3, people occasionally ask me things, apparently because they’ve been fooled into thinking i know something. But these conversations sometimes produce useful blog posts, so it’s not a total loss (for me, i mean: i can’t say for them).

In yesterday’s example of P3, an acquaintance of a Logos colleague had read some press release about my hiring, thought i knew something, and wanted to talk with me about a task he was performing for a client. He was charged with using Word to search documents for keywords from a particular subject domain, looking for occurrences of some information of interest. Then he would mark those instances with some special code, and in a subsequent second pass, somebody else with more domain expertise would go through, find the special codes, and pull out some bits of information. These would then get pasted into an Excel spreadsheet.

While the technically sophisticated might scoff at this low-tech approach to the problem of information extraction, I have no doubt that there are people all over the business world doing similar things. There’s simply too much information locked up in prose inside documents, created with no thought about how you might later extract structured data from them, and only the crudest of tools are easily available to help with the task.

Well, in this case, i actually did know something about the subject: i was an active researcher in the field of information extraction for quite a few years at BBN Technologies. We chatted for a while, and i gave him a number of caveats as to why this approach might be too heavyweight or otherwise inappropriate for his task, and then recommended he check out the General Architecture for Text Engineering (GATE), developed over several years by natural language processing researchers at the University of Sheffield. While GATE is not the most sophisticated of information extraction tools, it has several features to recommend it:

  • it provides important basic capabilities right out of the box, including I/O in standard formats, integration of a number of other useful tools
  • it includes a visual development interface
  • best of all, it’s open-source and freely  available for Linux, Windows, and Mac OS X

I’m not going to provide a tutorial for using GATE (they’ve got plenty of documentation available already), but here’s a high-level overview of the steps, to help you decide if GATE might be a good fit for your task.

  1. download and install. There are a lot of different pieces, most of which you don’t need for the simplest tasks.
  2. Get the data you want to process in some structured form. GATE can process XML, HTML, SGML, email, and plain text.
  3. Define a new Language Resource for your document. At this point you can also view it in the GUI, and if there were existing XML annotations, you can view them
  4. Select one or more existing Processing Resources: out of the box, there are sentence splitters, tokenizers, and even ANNIE, a basic information extraction system
  5. Create an Application consisting of one or more processing steps, select your document, and run it.
  6. Now when you go back to view your document, you’ll see additional annotations have been added.

You can also use GATE as a manual annotation tool, and save the results in various formats. GATE’s written in Java and includes a plug-in architecture: so Java programmers can add new capabilities.

This screen shot shows a portion of Romans 16 from the World English Bible (one of the few modern Bible texts that’s freely available in XML form), annotated (imperfectly) using names from the Bible Knowledgebase (salmon), and highlighting the original notes (green).

GATE processing example

If any of you are still with me, let me emphasize this caveat: GATE is not a tool for casual use, and you should bring some technical expertise along with the expectation that you’ll have to invest a fair amount of time in figuring out how things work. But if these are the kinds of capabilities you need, it may make a lot more sense to start from GATE that to reinvent them yourself.

Postlude

This whole experience was a good example for me of the Principle of Reciprocity that Jesus teaches in Mark 4:24-25.

“Consider carefully what you hear,” he continued. “With the measure you use, it will be measured to you—and even more. Whoever has will be given more; whoever does not have, even what he has will be taken from him.”

I didn’t expect this initial conversation to produce anything useful for me: i just did it to help a friend of a friend. But it wound up “giving me more”, not only this blog post, but renewing my interest in GATE as a possible tool for related future work.

January 14th, 2009

XML Schema with Optional, Unbounded, Unordered Elements

This is so obscure i hesitate to blog about it, except that it took me so long to figure out that i’d love to save somebody else the trouble. You won’t care unless:

  • You’re designing an XML Schema definition (.xsd) to validate an XML file
  • You’re defining an element to contain regular text, or multiple elements, in any order, from zero to many times

Here’s an example: suppose you have a plain text description of events that includes people, places, and Bible references.

Jesus heals Simon’s mother-in-law (Matt 8:14-17; Mark 1:29-34; Luke 4:38-41)

You want to link person references with a Link element, Bible references with a Reference element, and otherwise leave the plain text as is. This results in something like this (using square brackets since otherwise WordPress gets confused):

[Link]Jesus[/Link] heals [Link]Simon[/Link]’s mother-in-law ([Reference]Matt 8:14-17[/Reference]; [Reference]Mark 1:29-34[/Reference]; [Reference]Luke 4:38-41[/Reference])

Now imagine several of these in the same element, so potentially you can have any arbitrary sequence of Links, References, and plain text, in any order, any number of times. Describing this with a BNF grammar is trivial:

LinkRef ::= Link | Reference
TextItem ::=  ( text | LinkRef )+

A cursory reading of the XML Schema description (which i’d never actually done before, instead depending on XMLSpy which generally lets me avoid thinking that hard) might make you think grouping models like sequence, choice, and all in conjunction with attributes like minOccurs and maxOccurs would do what you need. But there’s a surprisingly complex set of interactions between these, that i still don’t really understand, and so what seemed so simple proved surprisingly hard. Here are a few examples of what i tried, where XMLSpy’s validation model for XSD files (which i’m assuming is correct) wouldn’t allow it:

  • while all is for an unordered group of elements, it’s restricted to maxOccurs=1. So it doesn’t handle unbounded occurrence (though it does allow minOccurs=0, e.g. optionality). Furthermore, it can’t be nested inside other model groups like sequence.
  • choice groupings can be neither optional nor unbounded.
  • trying to specify multiple occurrences of both Link and Reference, each both optional and unbounded, is flagged as an ambiguous model.

The solution i finally discovered (after embarrassingly many other permutations, more by trial and error than anything else):

  • define a LinkRef group that allows a sequence of either Link or Reference, both optional and unbounded (zero to many occurrences)
  • the TextItem (enclosing parent) element allows an optional and unbounded sequence of LinkRef groups.

For the more visually oriented, here’s how it looks in XMLSpy:

TextItem and LinkRef Grouping

January 8th, 2009

Addressability Matters

Ever since Adam named the beasts (Gen 1:19-20), labels have mattered to humanity: it’s pretty hard to hold a conversation if you have to start with “you know that really big gray beast with the cute little ears that sits in the river all day with just its eyes showing?”, instead of just “hippopotamus”.

Information on the web works the same way. Most (but not all!) web pages have the equivalent of a name, their Uniform Resource Locator (URL), which tells your browser how to bring up the page. But too many conversations about web pages are still like the hippopotamus conversation: “just go to www.frooble.com, then type ’shebang’ in the search box, and look about half-way down the page on the left side …”. In other words, that little tidbit of information isn’t addressable: i can’t give you a name for it, i can only tell you to travel over the river, through the woods, and then turn left at the 3rd oak tree.

Though there’s usually no good technical reason, this is still so often true for our web-enabled world. For example, i admit to my chagrin that i only just now figured out the URL for my Facebook profile, even though i had looked for it (half-heartedly) several times previously. (I happened to stumble over somebody else’s, saw the pattern, and then plugged my own name and ID in the URL instead). Having a URL that’s both explicit and understandable enables this kind of URL hacking, which is a really powerful technique.

Here’s a small example (combined with a shameless plug). The HTML designers for the upcoming Bible Tech conference have added page targets for speakers to the Speakers page. So even though there’s one long list, you can get to just the right spot on the list by following the link to my talk. And if i show you the URL

http://www.bibletechconference.com/speakers.htm#SeanBoisen-2009

and explain the schema ([baseURL]#[Firstname][Lastname]-year), you can get to my talks from last year too. That’s a nice bit of design, and part of a much larger and important architectural practice called Representational State Transfer or REST. As another example, you can probably figure out how to change this URL

http://bible.logos.com/passage/NIV/Ge 2.19-20

to get you to Mark 4.1-12 in the ESV instead (though you might stumble if you use a colon instead of a period to separate chapter and verse).

A lot of important things only become possible once you start to provide names for your resources. That’s a big part of the justification for the complex tangle of ideas called the Semantic Web, or if that’s too high-falutin’ for you, just call it smarter web design for information integration.

PS: i realized later it wasn’t just that i couldn’t figure out how to construct a Facebook URL: you have to make a badge first to get an addressable URL, which seems pretty non-obvious!

December 11th, 2008

XSLT and Namespace Matching

I have to admit to something of a love-hate relationship with programming in Extensible Stylesheet Language Transformations (XSLT). On the love side, it’s a very powerful language, and its structure-transforming orientation makes certain tasks easy that would be really hard in a more procedural approach. So when i started getting involved with lots of XML data a number of years ago, it wasn’t long until a basic knowledge of XSLT became an essential part of my toolkit.

However, some of its power comes from the fact that there’s a fair amount of implicit processing going on behind the scenes. Of course, when that does just what i need, that’s great: but when it doesn’t, and i don’t understand why, i quickly slide into “this language drives me nuts”. That problem is aggravated by the fact that i don’t use it all the time: so i don’t get really good at it, my understanding of the processing model is just deep enough to accomplish my current task, and the little tricks that are an essential part of using any language well recede too quickly into the dim mists of my brain.

My latest love-hate experience comes from transforming some XAML code (the details of why we need to do this are too painful to recount, but are all too typical of commercial data-slinging environments like Logos). Here’s a simplified fragment of the input XML, where the basic task is recomputing the height and width:

XML fragment
You might naively think that a matching statement (the heart of a lot of XSLT processing) like xsl:template match="/Viewbox" is the way to get started with this. I thought so too, but then spent an embarrassingly long time getting no output whatsoever because it didn’t match. I could find the first element with tricks like match="/*", but couldn’t find it directly.

Those of you who are a little more XSLT-savvy than I are now shaking your heads and tsk-tsk’ing at my obvious mistake: the XML document (and hence the Viewbox element) has its own namespace (xmlns="http://schemas.microsoft.com/winfx/2006/xaml/presentation"). It’s a small comfort to me that i lingered longer in my error than i might have because, testing my XPath expressions first in XMLSpy (as i often do), the expression /Viewbox matches just like you’d expect. But the XPath evaluation uses the document’s namespace, while the XSLT processing doesn’t (unless you tell it to). So first i had to realize that one tool wasn’t quite telling me the same truth as the other.
But even after guessing it was a namespace problem, it still took me far longer than it should have to figure out a solution. I tried a few stab-in-the-dark approaches like putting namespace declarations in the stylesheet, and looked (in vain!) in my XSLT book trying to find a clear explanation of matching and namespaces. I’m sure it’s there somewhere, but it’s a big book, and i often have a hard time finding the right information in it (maybe the second edition does better with this, i only have the first edition handy).

This post provided one solution: ignore the namespace altogether with a matching expression like match="/*[local-name()='Viewbox'].Though perhaps a little clunky, that does the job. I found a slightly better approach here, which is what i finally adopted: define a namespace prefix in the stylesheet (not the same as just copying the namespace declarations!), and then put that prefix in your matches. Specifically, i added xmlns:x="http://schemas.microsoft.com/winfx/2006/xaml/presentation" to the namespaces for the stylesheet (different than its definition in the XML source, but that’s ok), and then used match="/x:Viewbox" in the xsl:template. After this epiphany, the rest worked as i knew it would all along :-/.

Notes to self:

  • when things don’t work the way you think they should, stop just trying different approaches you don’t understand, and figure out the underlying problem
  • if you can’t find what you need in one reference book, try another
  • repeat out loud as needed: namespaces are a pain, but they’re good for me …
  • go to Google sooner, since it knows all (if you can just find it!)

PS: does anybody else (besides me and this guy) have the problem that WordPress always eats your XML code? I haven’t yet figured out a way to get it past the editor and posting process, which is why it’s just an image (!) above.

December 9th, 2008

A RefTagger Hack

Like Logos’ RefTagger? Me too, so much so that i’m starting to feel ripped off in contexts where i don’t have easy access to the text behind a Scripture reference. If you’ve only got one reference, no big deal: you just look it up in Libronix or some other Bible software, or one of the many excellent sites on the web. But if you’ve got a whole string of them, and you’re just scanning rather than doing in-depth study, it’s a pain to have to look them up one by one.

Since i’m spending a lot of time right now reviewing Scripture references for entities in the Bible Knowledgebase, i wrote a quick CGI hack i call the RefTaggerizer to fill the gap. It couldn’t be much simpler: there’s an HTML form that accepts a block of text as input, and then redisplays itself on the resulting page (a ’self-posting form’, in the jargon i learned today). The magic is the embedded RefTagger code: any Bible references in the text that you submit get turned into dynamic links. There’s no real novelty here: it’s just a convenient way to transform a string containing references into hyperlinks (without the bother of creating a full-up HTML page on a server).

To use this, you need to:

  • have a web server running on your local system that supports CGI (i use Apache’s HTTP Server, which is a breeze to download and install: IIS ought to work just as easily).
  • configure the server to allow execution of CGI files (the CGI Tutorial for Apache 2.2 is here). You probably have to have mod_python installed as well.
  • download the Python code reftaggerize.py and put it in your CGI directory (For Apache 2.2 on Windows, that’s probably C:\Program Files\Apache Software Foundation\Apache2.2\cgi-bin)

This is a really simple script, and if you’re a Perl-monger instead of a Pythonista it should be easy to rewrite. Alas, i tried installing this on my web site (at http://asamasa.net/cgi-bin/reftaggerize.py), but i haven’t done any CGI here before and it doesn’t work: it just displays the code in the browser rather than executing it. I don’t know if this is a configuration problem with python, or what. So i can’t give you a snazzy demonstration (but if somebody out there sets it up and gets it working, let me know and i’ll post the URL).