BibleTech:2010 Debrief

The BibleTech conference is an annual highlight for those of us who work at the intersection of Bible stuff and technology, and last week’s meeting in San Jose was no exception. This was the third BibleTech — i’ve been fortunate to have attended (and presented at) them all — and there’s always a great mix of new ideas, updates on ongoing projects, and lots of interesting people to talk to. (some other reviews: Rick Brannan, Mike Aubrey, Trey Gourley)

Some of the talks i liked best this year:

  • I was already interested in Pinax before hearing James Tauber’s talk on Using Django and Pinax for Collaborative Linguistics: now i’m itching to get started!
  • Stephen Smith had a nice analysis of the most frequently tweeted Bible passages (though the evidence of vast swaths of Scripture that get very little attention was perhaps a bit depressing).
  • Neil Rees showed Concordance Builder, a program that lets you use a Swahili concordance to bootstrap one for Welsh (or any other pair of languages) with no linguistic knowledge. Building on the Paratext tool, it leverages the verse indexes along with approximate string matching and statistical glossing (technical paper by J D Riding) to produce results that are about 90-95% correct out of the book. This can reduce concordance development to a matter of weeks rather than years.
  • There were several talks related to semantics in addition to mine: Randall Tan talked about more automated methods and fleshed them out relative to the higher-level structure of Galatians, and Andi Wu gave what looked like a really interesting presentation on semantic search based on syntax and cross-language correspondence (alas, i missed it).
  • Weston Ruter talked about APIs they’re developing at OpenScriptures.org (and brought in the Linked Data idea). Logos also unveiled their new API for Biblia.

I felt my talks went well and i got some good feedback. My slides are now posted (if you wrote down URLs at the conference, i didn’t get them quite right 🙁 but here they’re correct):

(As with some previous talks, i did my presentation with Slidy (previous post): i feel like it’s going a little more smoothly each time.)

Ruter: Open Scriptures: Picking up the Mantle of the RE:Greek-Open Source Initiative

The background of this talk: Zack Hubert’s talk from the last BibleTech. Zack developed a very useful web site which ultimately failed because he couldn’t maintain it, and couldn’t get other developers to pitch in and help.

The vision: an open web repository for integrated scriptural data and a platform for building applications of scripture (OpenScriptures.org). What kinds of data? Manuscripts, translations, versification systems, morphosyntactic parsings, user tags/annotations/cross-references. But it takes a lot of effort to get started with all this data, each of which is typically in its own format, and unlinked to other data.

Linked data principles (from timbl):

  • use URIs as names for things
  • use HTTP URIs so that people can look up those names
  • provide useful information behind the URIs
  • and links to other URIs so they can discover more things

“… the more things you have to connect together, the more powerful it is.” Can we connect things together through a unified manuscript that links together semantic units (words, phrases, clauses)?

Manuscript unification: normalize a manuscript (lowercase and remove diacritics: no spelling normalization yet), insert and save links to the unified manuscript. Then for additional manuscripts, normalize, merge links, and save them. Now you’ve got all the attested readings linked together. This unified manuscript now has an automated critical apparatus. [demo here of the manuscript comparator]

Potential applications include:

  • translation comparator (can also help with the versification problem)
  • comprehensive concordance
  • translation-independent cross-references (e.g. NT quotations of the OT)
  • interlinear/bilingual editions

You can automatically link manuscripts in the same language, but not different languages. Use collective intelligence to capture semantic linking between languages. Use the “games with a purpose” (GWAP) approach to gather links.

Copyright is a major challenge: you can’t link texts together if you can’t access them, and you can’t share them if they’re not open. Recently MorphGNT texts have been taken down from several sites because they’re not freely sharable. If the key benefit is connections between data, then data (including texts) should be more valuable if they’re sharable and connected. One solution: an Open Scriptures Platform that connects content owners, developers, and end-users. Passionate developers could build applications based on content licensed to Open Scriptures (as a proxy), and Open Scriptures makes sure than end-users provide revenue to content owners.