God’s Word | our words
meaning, communication, & technology
following Jesus, the Word made flesh
May 31st, 2007

Hebrew Cryptography

You might infer (correctly) from the things i blog about that i spend a lot of time looking at Bible names, many of which are rather obscure. Today i looked at a name so obscure, it’s not quite a name at all: instead, it’s a codeword!

“Thus says the Lord: “Behold, I will stir up the spirit of a destroyer against Babylon, against the inhabitants of Leb-kamai,” (Jeremiah 51:1, ESV)

You won’t find “Leb-kamai” in your Bible atlas: it’s a codeword using a substitution cipher called Atbash (in English: most Bible dictionaries have “athbash”). Atbash is a one-for-one substitution of letters, where the first letter of the alphabet becomes the last, the second becomes the second-to-last, and so on. So the Hebrew letters lbqmy become ksÌ“dym (Chaldeans). The name is derived from the first two pairs of substitutes: aleph for tau, beth for shin.
Scholars don’t really know why Jeremiah decided to use code words in just a few cases (25:26, 51:1, and 51:41): he names names in plenty of other instances.

May 27th, 2007

More on Bible Reference Microformats

OpenBible.info picked up on my post about a microformat for Scripture references, and very helpfully spelled out a number of important details, along with some clearer thinking on a few points.

  • Use the cite element (rather than abbr as i suggested). There’s more information here at microformats.org about citation formats. I agree this is a better match with the intended semantics of the element (what was i thinking?).
  • Treat the title attribute optional (as it is for XHTML in general): once you’ve established (via the class attribute) that something’s a Bible reference, you only need a title if the associated reference isn’t clear.
  • For links, use the a element with class="bibleref".
  • Forget about biblerange for indicating spans of text: once identified as a Bible reference, any reasonable representation of a range can probably be parsed. (I didn’t feel very strongly about this one to start with). I also think the suggestion for treating compound references makes sense: essentially, provide a title to disambiguate sub-elements that aren’t clear.
  • The suggestion for putting an optional translation identifier first in the title attribute seems reasonable to me too.

I’m still a bit uneasy about internationalization issues, though. title = "John 3:16" adheres to existing standards in the English speaking world, but French would have Jean 3:16, Spanish Juan 3:16, etc. Though any of these are relatively unambiguous as identifiers within their own language context, there’s always a political side. In principle, a properly-formatted web  page would indicate the language of the content, which a parser could then use to parse references in a language-specific fashion.
Frankly, i’m still a little unclear about whether (in microformat terms) this is “semantic XHTML” or a microformat (though i think it’s the former). But this proposal seems clear enough to me to move forward with broader adoption: bibliobloggers, are you in? Some possible next steps:

  • convince others to adopt this, and try to gather momentum
  • design and promote a badge to indicate your blog uses the bibleref standard?
  • lobby authors of Bible reference plugins for blogging platforms to adopt this
  • a conversion service to take RSS feeds (in the several popular formats) that use bibleref markup and enhance them with links to an online Bible (similar to a RSS to GeoRSS converter): this would help demonstrate the utility of the additional markup effort
  • once the standard is more widely adopted, see if Technorati and other aggregators would agree to pick it up in their meta-data crawling

As an aside for those who use Technorati-style tags in their posts (i do, though i haven’t properly exposed them in my WordPress template yet): i’m going to use “bibleref” as a tag for additional posts in this area, and i encourage you to do the same.

May 24th, 2007

Semantic Technology Conference

I’m just wrapping up at the Semantic Technology conference in San Jose, having given my presentation this morning on the Bible Knowledgebase work i’m doing at Logos. I haven’t gotten the slides up yet, but they’ll go here once i do:

http://semanticbible.org/other/presentations/2007-semtech/BibleKnowledgebase.ppt

I was pleased to have two former BBN colleagues, and real Semantic Web gurus, Mike Dean and Matt Fisher in attendance, and there was quite a lot of interest as well as good questions.

Some other random posts on SemTech 2007:

  • InfoWorld article noting how many more suits and shoes are present this year (the VC $$$’s are coming)
  • Chris Halaschek from the MindSwap group at University of Maryland
  • Chiara Fox finally “gets” ontology
May 24th, 2007

Annotating Scripture References in Blog Posts: a Modest Proposal

Lots of people write blog posts that contain references to the Hebrew or Christian Bible (henceforth I’ll simply say “Bible”). I’d like to propose that blog authors adopt a few very simple conventions that, in the spirit of microformats, would add semantic richness and extended value to these posts without requiring fancy new languages or a lot of author overhead. This seems like an ideal opportunity to add value, since

  • there are already longstanding and widespread conventions for representing these references textually (the principle of “adapt to current behaviors”)
  • the format requirements are quite simple (maybe this is more like a “nanoformat” than a microformat) (the principle of “as simple as possible”)

Benefits:

  • You could more readily find other blog posts that are talking about passages you’re interested in
  • Software agents could automatically insert hyperlinks to any of several publicly-available web sites with Bible texts like Bible Gateway, the ESV bible, the NET Bible, etc.
  • We’d have a source of data to tell us which passages are talked about most (or least)
  • Unlike blog prose, such an approach could be language-independent (though this might require cross-language agreement on book names, or at least language indicators)

Approach:

  • Bible references have a simple structure, in the simplest case the three elements of book name, chapter, and (optionally) verse. Book names can either be spelled out, or use an abbreviated form like “Rev” for Revelation. There’s at least one existing standard for book names, embedded in the (300 page!) SBL Manual of Style.
  • Simple references specify a given single text (at some level of granularity). Range references identify a contiguous span like John 3:1-20. Compound references combine one or both of these, e.g. Eph 1, 2:8-9. These are in order of decreasing importance: just starting with simple references would be a big step forward.
  • An optional identifier for a specific version or translation (e.g. KJV for King James Version, ESV for English Standard Version, etc.) would be valuable, though it’s not essential.
  • Perhaps the authors of plug-ins for popular blogging platforms (like Scripturizer for Typepad, or the ESV plugin for WordPress) could be persuaded to include the microformat in their output: then users of those plug-ins wouldn’t even have to take any special steps.

Candidate Format

Use the abbr element from XHTML, with “bibleref” as the class attribute, and a normalized notation as the title attribute.

This follows the spirit of the abbr-design-pattern, with human-friendly text and a machine-readable title attribute (so it doesn’t matter how the textual content is formatted, as long as the title is machine-parseable, or reasonably so). This could be extended with something like class=”biblerefrange” for range references like Eph 2:8-9. While we could try to forge agreement on how to format the title attribute, in practice it won’t matter except for very obscure cases. Simply indicating that it’s a bible reference will be enough to render 99% of the cases fully parseable (assuming the book and chapter are indicated: a reference to “3:16″, in the context of a discussion of John’s Gospel, wouldn’t work).

See Also

May 23rd, 2007

Conrad’s Triangle of Knowledge Representation

Sitting in Chimezie Ogbuji’s talk at the Semantic Technology conference about “Tools for Next Generation of Content Management Systems: XML, RDF, and GRDDL”, he put up this picture from Conrad Barski, a nice graphic comparison of how the Guy in the Garage, the Writer, and the Scientist view the general problem of Knowledge Representation.
Conrad's Somewhat Accurate Triangle of Knowledge Representation�

You need to read the whole article to understand what’s being shown here, but it’s a nice clear exposition of the subject, and i wanted to capture the graphic.

May 23rd, 2007

Counting Scriptures in Blog Posts

Matt Dabbs has an interesting series of posts about using Google’s blog search to determine how frequently different chapters of the Bible are referred to by bloggers, starting with the most blogged scriptures (and then some followups on least blogged Scriptures).

This data would be really interesting to nail down, but (ever the data quibbler) i have some questions about how this works. While i don’t know the details of Matt’s methodology, i expect some typical keyword search problems take their toll here too:

  • There are multiple ways to specify a reference (John 3, Jn 3): this has the potential to reduce recall
  • Some matches for a given reference may not actually count: for example, the 12th match i found for “John 3″ is actually a roster for a Civil War infantry regiment containing the phrase “Adams, John: 3/9/1864″. It’s only one case, but i wonder how many more are lurking.
  • A similar problem occurs with some Scripture references themselves: “John 3″ also matches “1 John 3″! I wonder if that helps to account for the popularity of John 1, 2, and 3, all of which made the top 15? In these cases, you could subtract the count for “1 John 3″ from the count for “John 3″
  • I wondered why only John and Matthew’s Gospels made the top 15: but queries for Mark (or Mk) produce results that are full of non-Biblical acronyms and other misses, and i’ll bet Luke does too.

None of this is to put down Matt’s efforts: even noisy data is more instructive than silence. But this kind of counting can be very tricky business. The ESV blog discussed this topic based on Bible searches on their site a while back.
Given the full text of all these posts, i’ll bet the vocabulary is distinct enough that a statistical text classifier could be trained to determine with high reliability which ones actually referred to Biblical discussions, and which ones didn’t.

May 16th, 2007

Visualizing Text with LiveInk

Bob Pritchett recently blogged about visual text formatting: one example is LiveInk. I’ve wondered for some time how we might improve reading now that we don’t really need to have one-fontsize-fits-all, linear textual arrangements (for example, sizing text by prominence).

Apropos of this, i’m reading Robin Williams’ Non-Designers Design Book (hat tip to Coding Horror), a good starting point for people who aren’t professional designers but still have to do some kind of design (pretty much all of us these days). Two of the basic principles are Alignment and Proximity: elements that are close or aligned will seem related (whether they really are or not!).

Back to LiveInk: here’s one of their demo examples.

LiveInk Sample

While breaking the sentence up definitely makes it more scannable, i have some trouble parsing the result, and i think Williams’ principle of Alignment helps explain it. For example, the alignment of “means” and “among adults” makes me think they’re somehow related. But they’re not: “among adults” modifies “physical activity”, and the linguist in me thinks it ought to therefore be moved farther to the right. Of course, you can only push right so far before running out of room, and maybe that’s the practical explanation for the alignment here (LiveInk’s site suggests they have solid research behind what they do).

May 16th, 2007

Bible Knowledgebase: What, Why, How

(Post 2 in a series on building the Bible Knowledgebase, unfortunately delayed by a plague of web hosting problems)

The What: BK (Bible Knowledgebase) is reference information about the world of the Bible using Semantic Web standards and tools. The Semantic Web refers to moving from a world of networked pages displayed for humans (HTML, the vast majority of the current World Wide Web), to semantically-characterized information that is machine-readable (and therefore supports a variety of uses like search, browsing, visualization, etc.). Tim Berners-Lee likes to describe it as moving from a web of documents (meant to be read by humans) to a web of data (meant to be read by computers).

Initially, the scope is every named thing in the Bible (people and places are the bulk of the cases, but there are also languages, ethic groups, holidays, and numerous others). Eventually i hope to extend this to unnamed but described entities: for example, the Samaritan woman of John 4 is never named, but we know her ethnicity, where she lived, some people she interacted with, and other facts.

The Why: the Bible Knowledgebase will support

  • Knowledge exploration and discovery: just as hyperlinked web pages lead you to new information, linked facts about individuals will lead to other individuals or resources about them.
  • Smart (semantic) indexing: for a given passage, you’ll know which John/Mary/James is referred to, not just the collection of individuals who share that name. Searching will provide more precisely targeted results, because reference material will be disambiguated.
  • Visualization: rich data sets support graphic displays that given an overview of information that would otherwise be scattered across numerous different passages

The How:

  • I’ve designed an OWL ontology that captures an initial set of entity types and relationships between them
  • Information from Logos’ Biblical People feature and New Testament Names has been merged into the initial dataset
  • Both the ontology and the instance data will be extended to incorporate additional information. There’s no principled stopping point, but i expect to grow BK from its current size of ~100k RDF triples to perhaps 100-1000 times that size.

In the (perhaps unlikely) event that you’re in the intersection between

  • Blogos readers, and
  • attendees at the Semantic Technology Conference in San Jose next week

i’ll be giving a presentation about this work Thursday morning.

May 14th, 2007

Can Your Editor Do This?

If my blog had aural feedback, no doubt i’d hear a few snickers for saying this out loud, but here goes:

i use Emacs.

There, i said it. With all the tightly integrated development environments available these days (Visual Studio, Eclipse, etc.), you may wonder why anybody would use such an old-school tool. In fact, i’ve been using Emacs in various flavors for more than 20 years now: this request for an Emacs mode for an exotic programming language called Icon might even be my oldest extant trace on the Internet (though we didn’t call it that back then, kids). I’m pretty stale now, but at one point i considered Emacs Lisp one of the programming languages i’d put on a resume (if that sentence doesn’t make sense to you, it shows that you don’t know enough about Emacs to snicker at it).

Sure, it’s got a steep learning curve, it’s really geeky, and it’s not the hammer for every nail. I don’t write UI tools in it anymore, though for a while it was a pretty good choice for that. But there are still things i can do easily in Emacs that i don’t know how to do elsewhere without a lot more work: that’s one definition of what makes a good tool.

One use case i encounter a lot when groveling through data is progressive refinement. Typically that means a large data set (thousands or more), where i need several steps to filter out certain values (that i don’t know in advance: that’s one reason an editing environment is a good choice). For example, my current task is finding funky Unicode characters encoded as XML character entities, and replacing them with ASCII equivalents (i know that’s not good form, but for this particular string-matching task, it’s good enough). I’ve got a few 10s of thousands of lines of data, and i want to find all the different #&-encoded values so i can create a mapping table.

A simple pattern match for &# returns some 800 hits, and i don’t really want to look through all of them (particular when there’s a typical 80/20 distribution: i’ll get the major ones, but miss some long-tail cases once i start scanning quickly). So here’s the easy trick in Emacs i find myself using a lot:

  1. M-x occur creates another buffer (called *Occur*) with all the lines that match a given regexp (i just use &#)
  2. I scan the first page of results to see what looks like the most common value (in this case, ’, equivalent to an apostrophe), and add it to my map
  3. Here’s where it gets cool: the *Occur* buffer is a filtered view of my data, and i can work on it directly (once i toggle it the read-only status, a minor annoyance). So i switch to the *Occur* buffer, and then do M-x flush-lines for the value i just captured. This removes all the lines matching that case (about 400 of them for this first example), without damaging my original data (i’m in a different buffer).
  4. I go back to step 2 for a new value and repeat. Each time i’m capturing some large percentage of my data and then excluding that value from further consideration, getting a narrower and narrower view.
  5. At some point the view is narrowed down to a dozen or two lines, at which point i capture any remaining cases (all now in plain view), and i’m done.

This is completely interactive, the possibilities are always in plain sight so i can make decisions about where to go next, and i don’t have to go hunting around. If i make a mistake, i can undo, or just back up and start over. And the values are right there for easy cut-and-paste into another buffer where i’m writing my code. (Caveats: this approach really only works with line-oriented data, multiple matches per line make it more complicated, and of course you need to figure out suitable regexps) Most of the time, i find about 10 or so cycles is enough for me to find all the values i care about, out of an original set of thousands.

Can your editor do that?

May 9th, 2007

Learning by Doing

Jon Udell writes about discovering an unanticipated benefit of screencasting (in his case, recording an elaborate route on a map):

… the most valuable part of this process might not be the use of the final output, but rather the act of producing it.

This hit on something i’ve been mulling over about electronic Bible study and the learning process. Back in the day, i led inductive manuscript studies with InterVarsity. We’d go away for a while and spend a long time marking up a wide-margin paper copy of the Scriptures with lots of colored marking pens. There was no “right” way to mark them: but the process of highlighting themes, connecting thoughts, outlining bits of grammar, looking up Old Testament references, etc. (and doing it in color, on paper) got us engaged with the text in wonderful ways far beyond passively listening to lectures.
Fast-forward 25 years to the present, where our software already knows connections between hundreds of resources at the level of words, references, topics, etc. It provides an enormous pool of information, but also tends to short-circuit the “hands on” process of exploration that made manuscript study engaging. How do we provide the benefits of personally exploring and organizing the material, without losing the benefits of all the other organization that’s now available?