Kurt Fuqua: High-Quality Machine Translation Using a Semantic Interlingua

Many languages still require a Bible translation, and the traditional approach is time-consuming and therefore not yet up to scale. Machine translation through an interlingua offers another approach to meeting the demand for new translations. Analysis of the source (exegesis) is the hardest part: but this only has to be done once, and then re-used multiple times in generating new translations.

Having a good interlingua is critical. InterLing, a predicate logic interlingua, was created in 1996 as an open, freely available standard. One intermediate goal short of full translation is generating an interlinear: not completely fluid, but a useful tool for leaders. You can automate layout using XML with word correlations: the same is true for general Bible publication. For example, cross-references are language independent. Languages that have only a New Testament translation can bootstrap Old Testament translations from the NT grammar and related resources.

Andi Wu: Treebanks of Biblical Texts

Andi’s from the Asia Bible Society, and is building Bible treebanks that give the syntactic structure of verses, a joint project between Asia Bible Society and Groves Center at Westminster Theological Seminary. Treebanks support data-driven analysis, study tools, syntactic search, and tree alignment for evaluating translations, concordancing, and other applications.

Dynamic treebanking: use a parser to directly generate trees, but rather than editing the trees, correct the lexical attributes which guide the parser to get the correct analysis. [so it’s a compilation process rather than static data annotation] Editing trees directly is painful (and can be dangerous), and inter-annotator consistency can be a problem.

Current status: Hebrew prosodic treebank is completely parsed. Hebrew syntactic treebank is parsed (except for Daniel), and being manually checked. Greek syntactic treebank is parsed and being checked. For the prosodic treebank of Hebrew (using Masoretic cantillation marks), every verse has been successfully parsed, and more than 99% of the verses received one and only one parse.

Syntactic treebanks use a combination of phrase-structure grammar (and the phrase level) and dependency grammar (at the clause level), since Hebrew and Greek are non-configurational languages at the clause level. They use multiple trees for ambiguous parses. He showed an interesting dynamic interlinear: selecting a word in one language highlights the corresponding words in the other language.

Open issues: verse- versus sentence-based parsing, discontinuous constituents. Ultimately he’d like to get to a logical form (language-independent) representation. “This kind of work is much more fun that Microsoft”!

Karl Hofmann: Building Community or Building Babel?

His hope is that technology practitioners will think more carefully about how technology impacts our approach to the Bible. Strong influence: Ivan Illich, a Catholic priest who was involved in the early discussions that led to Vatican 2, and left that process. “The distortion of the best thing becomes the worst thing.” Two books:

Also Rivers North of the Future, conversations with Ivan Illich.

How is the Bible technology? (audience participation produced some of the comments here) Examples: many of the Biblical texts were meant to be heard (not read). Our manuscript traditions and text-critical process affect the “output” . We need to be involved with the Author, not just the text, but also with the right hermeneutic. There are other “forms” of the Bible than print, and software especially is pushing on that definition. Web technology brings new opportunities for community.

The story of Babel from Gen 11 talks about technological development and people “making a name for themselves” through language. What is it that we want to build, and how will enable us to communicate with each other? The “reification of a community of belief”. Technology can be a double-edged sword.

The technology that’s been added to the Bible (typography, chapter and verse divisions, etc.) allows us to get away from what’s actually being said. “The magic is at run time”, the impact of the Word on the hearer. Just adding vowels to Hebrew can distinguish communities of faith.

James Tauber: MorphGNT

The beginnings of MorphGNT were seeing the functional annotation of the Friberg AGNT, and CCAT at UPenn in Beta code. An early realization in working with CCAT is that the data would be much more usable if regularized into one lemma per line: you can then use Unix command line utilities. Further investigation revealed thousands of errors (though some were systematic and hence easily fixed), including some deeper analytic ones. Building systems to generate data from scratch has been an important part of the process of identifying errors.

Much of the content of Mounce’s book (with reference to morphology) could be replicated with a single awk command on the MorphGNT data.

Helped start the Electronic New Testaments Manuscript Project in 1996: but it was too early, and people didn’t understand what it meant to put things on the web.

Much of the early challenge was simply putting Greek on the web. This led to GreekGIF, a series of images of Greek letters that enabled more readable representations. “I’m relieved to say this is no longer necessary”!

Early involvement with XML put PhD plans on hold. Around 2002, started working on automatically generating inflected forms (initially driven by Mounce’s classes). In 2004, released v5 of MorphGNT, now with Unicode. zhubert.com was the next major development in the use of MorphGNT, and a milestone event. Since then, been working on other corrections (which haven’t yet led to a new version), started PhD studies, and also started collaborating with Ulrik Sandborg-Petersen. A current interest is splitting the text from the analysis: you really need an additional field to identify the analysis for a particular form to eliminate all ambiguity. Also working on splitting off the lexicon: morphological analysis, semantic domain, and other attributes, as well as standardizing lexical representations.

The myth of vocabulary coverage: “the 100 most common words account for 66% of the text”. But these words typically don’t have information content, and you really need about 95% of the words in a verse to understand it (according to learning theorists). Really, we need a new kind of grader reader that’s optimized for early comprehension, clause-based, form (rather than lexeme) based, and gives context in English. Progressive substitution of Greek phrases into an English text helps provide a gradual transition.
Web site: http://morphgnt.org.

Blogging BibleTech:2008

It’s only 10AM day one, and i’ve already had several great conversations here at BibleTech, where some of the leading figures in the intersection of technology and Biblical study are gathered.

I’m planning to informally blog a few of the talks, just so those of you who didn’t come will know what you’re missing! You’ll find these (and hopefully other posts about the conference) here behind the Technorati tag “bibletech08”.

Countdown to BibleTech:2008

Things have been very quiet on Blogos for the last few weeks, as i’ve been cranking away on a prototype for my Zoomable Bible talk at BibleTech:2008. While i’ve always loved learning new things, over the last month i’ve been positively cramming on a multitude of totally new subjects to me:

  • programming in C# (i’ve been spoiled by Python)
  • Using Visual Studio as an IDE, including integration with MySQL databases
  • Basics of 2D graphics
  • Layout algorithms for treemaps (major kudos to the University of Maryland’s Human-Computer Interaction Lab for not only pioneering this area, but even providing open source implementations for people like me to learn from)
  • Using the excellent (but rich and hence challenging) Piccolo 2D toolkit for building zoomable user interfaces (also from the UMd HCIL group)
  • loading up a variety of Bible data (since visualization requires something to visualize!)

I also have a separate presentation about Bibleref: a Microformat for Bible References, and some related recent developments at Logos that will help make the world of on-line information about the Bible more searchable and usable than ever before. You’ll learn more at the conference about some of our plans in this area.

There will also be time Friday night for “birds of a feather” sessions to informally gather people around topics of common interest. I’m hoping to bring together people to talk about developing common naming conventions for people and places in the Bible. If you’ve been following my posts on the Bible Knowledgebase, you know an essential part of this work is simply identifying and disambiguating named people and places: which Judah, or Zechariah, or Gaius, or Jabneel, is which? I think some simple agreement on identifiers, and principles for constructing them, would make sharing such data much easier, and Logos is prepared to start by sharing our own sets of identifiers. So be sure to find me there if you’d like to talk more about how to make this happen. (By the way, i was tickled to see that my post on the most important person in the Bible was #7 in Logos’s list of the Top Ten Blog Posts for 2007 (most viewed)).

As i told one of the speakers in an email earlier today, i’m feeling a little giddy about what a great conference this promises to be. BibleTech, and the interesting and diverse group of people who are coming, really encompasses all the things that brought me to Logos in the first place, and that define my current professional endeavors as well as my personal interests.

It’s not too late! Come join us this Friday and Saturday at the SeaTac Hilton in Seattle (registration details).