Using Word Tree Visualization for Checking Title Consistency

I’ve gotten a lot of positive comments on my Zoomable Bible talk from BibleTech:08. While the prototype i showed was little more than a conceptual toy, i think people liked it because

  1. animated visualizations are just plain cool, but even more importantly,
  2. visualizations (like zoomable user interfaces) provide a different view of the text than our linear print legacy has previously encouraged.

However, the real test of a visualization isn’t its coolness, but rather whether it helps you understand things that are otherwise difficult to grasp. I had a good example of that this morning, and walking through it might help others see the value of this tool.

I wrote a year ago about IBM’s Many Eyes site, which provides a host of easy-to-use visualization tools: you upload your data set, choose a visualization technique, and voila, you’ve got a sharable visualization! I’ve posted a few data sets and visualizations previously, like:

(the entire collection of my data and visualizations is here), and lots of others have posted interesting visualizations of Bible data as well. Of course, if you want fine control over the visualization, you’re probably not going to get it from these pre-packaged techniques. But it’s pretty impressive how much you can do with what’s there, and this is an easy way to learn about and sample different visualization techniques: if you’re a data-oriented person, i’d strongly encourage you to check it out.

One of their text oriented visualization techniques is the word tree, which provides a kind of visual concordance for free text. This example of the KJV text of Genesis is a good illustration: type a word in the search box at the top and hit return, and you can see all the phrases that start with that word. You can also turn it around and find phrases ending with a word, and sort by frequency. James Tauber has also used the word tree technique for visualizing NT Greek nominal suffixes.

I found a new use for word trees today, in reviewing titles for the Composite Gospel Index (CGI). One motivation for creating the CGI a few years back was to make it easier to get an overview of the combined content of the four Gospels. Pericope titles are meant to help with this by effectively summarizing the content of a single story, and i deliberately tried to regularize their content. In particular, i wanted as many as made sense to start like “Jesus …”, to try to show the commonality: “Jesus teaches about …”, “Jesus heals …”, “Jesus tells the parable of …”, etc.

Word trees are a perfect tool for data like this, because they make it easy to find phrases that start the same. Conversely, they tend to visually isolate phrases that start the same but then end differently. I’ve created a word tree for titles from CGI pericopes (unfortunately, i haven’t figured out how to embed the visualization live here in my blog: WordPress keeps eating the script element). The input data to word trees are normally free text, but in my case each title is a complete unit: so i just appended special tokens +start+ and +end+ to each one, making the input data look like this (except that, as viewed raw on the site, it’s all wrapped and hence not so readable).

+START+ Jesus is the Word +END+
+START+ God became a human being +END+
+START+ Jesus’ ancestry back to Adam +END+
+START+ Jesus’ ancestry from Abraham +END+
+START+ Luke’s purpose in writing +END+
+START+ The angel Gabriel promises the birth of John to Zechariah +END+

etc., for all 355 pericopes.

So if you enter “+start+ jesus” in the search box (or just click on Jesus in the default view), you’ll see the various titles that start with the word Jesus (255 of 355, or 72%: punctuation becomes a separate token, so a few starting with “Jesus’ …” aren’t included). This works even better sorted by frequency: here you can clearly see the most frequent pericope title is “Jesus teaches …”, and clicking on “teaches” narrows the view further (which you pretty much have to do to see the details: results over 30 or 40 aren’t really visible). One advantage of this representation is that it gives you some help in knowing what to explore (in user interface terminology, an affordance). Though i can’t see all the details without zooming in, i can see a significant cluster of titles starting with “Jesus warns”, and if that’s interesting, i can click on “warns” to zoom in and see those 18 titles.

This last case also points out a benefit i hadn’t previously considered, which is consistency checking (finally getting to the main topic of this post). Looking at the frequency-sorted suffixes for “+start+ Jesus warns”, i see a large group under “against”, and a number under “about”, but also a single instance, “Jesus warns of coming judgment”. Because the third word is “of” rather than “about”, it stands apart from the other instances which really share the same concept. This could just as easily be re-worded “Jesus warns about coming judgement”, and made more consistent with other similar pericopes. Given my goal of consistency (in order to enable just these kinds of visualizations!), it’s really useful to identify cases like this, where a minor revision retains the meaning but also makes the data more consistent. The word tree visualization made it easy to enter “+start+ John” and find the one case where, instead of “John the Baptist “, i just put “John baptizes Jesus.”

What would be really great would be to turn this from a visualization into a navigation system, so once i’ve drilled down to “Jesus warns against …”, then i could select a title and actually view the pericope text. That’s beyond the scope of Many Eye’s toolkit, but something i expect to be working on in the future.

More BibleTech:08 Followup

Additional posts of presentations and blog reviews about BibleTech:08 have continued to trickle in: there are even some photos, like this one taken during my Zoomable Bible talk.

I’ve finally got the slides up from my talks.

  • The Zoomable Bible. Abstract: Information visualization is an established computer technique for providing rich, typically interactive, visual presentations of complex multivariate data. While increased computing power has made information visualization more common, our interfaces for navigating and browsing the Bible are still largely linear adaptations of traditional print forms. New interface paradigms (like Apple’s iPhone and Microsoft’s SeaDragon technology) can present large amounts of information on a traditionally-sized computer display though the use of Zoomable User Interfaces (ZUIs). This presentation will overview existing tools, applications, and principles for ZUIs and other visualizations, and explore some novel interfaces that give higher-level views of Biblical content.
  • Bibleref: a Microformat for Bible References. Abstract: Microformats are “a set of simple, open data formats built upon existing and widely adopted standards” (see http://microformats.org) that capture small but important bits of information on web pages. Bibleref is a proposed microformat for identifying Bible references that are embedded in blog posts and other web content. Broad use of bibleref would enable search engines, content aggregators, and other automated tools to correctly label the references so they’re more easily searchable. This presentation will explain why bibleref is needed, explore the technical specifics, and discuss how to promote broader adoption.

They’re not fully linked into the navigation structure of SemanticBible yet, but the direct URLs linked above (which i gave in the talk) work fine. I’ll probably also tweak the content a bit (i really need some screenshots for the Zoomable Bible talk), but i wanted to get the official version out without more delay. There are lots of links embedded in the presentations, especially the resources at the end of the Zoomable Bible talk, so look for blue text.

If you’re curious, i’ve created these with Dave Raggett’s Slidy program (see this previous post). Editing (X)HTML content for these by hand is still a little clunky (though i’ve gotten better at it), and it would be nice to have a WYSIWYG interface (i did lots of edit -> save -> switch to browser -> reload -> view cycles: it’s quick, but still painful). But the big payoff for me is that the result (unlike PowerPoint) is really a first-class citizen of the web. For example, all the content gets indexed by the search engines, you can link into the presentations (each page has an ID), and not only can i talk about web markup, i can illustrate the point in the body of the presentation itself (view the source of the Bibleref talk for examples). Yes, you can publish PowerPoint on the web, but that’s it’s own special challenge, which is why nobody does it: they just post .ppt files, which are largely opaque to web tools. The newer version of Slidy also improves browser compatibility: these presentations mostly work fine under IE (though you don’t get the footer).

BibleTech Postscript

I’m still trying to catch up from all the excitement and activity of BibleTech:08, but the one-word summary is awesome. Several others have posted summaries of their experiences (Mark, Rick, Phil), and hopefully more will be emerging in the coming days.  The Technorati tag bibletech08 should collect a bunch of them. I was amazed how quickly they seemed to be picking up my on-the-spot twitter-style posts from the conference. (though now i don’t find them there at all?? Go figure.)

I should have a number of posts in the next few days as i dig out, including:

  • slides from my talks (both on semanticbible.org and probably also on the BibleTech:08 website)
  • some exciting news about sharing data on Biblical names

Neil Mayhew and Larry Waswick: Electronic Tools and Bible Translation

(Neil) Language analysis, ethnography, and translation work all rely on reference materials, not only commentaries and Bible dictionaries, but also language study, grammars, and other reference works. Wycliffe works in a “massively multilingual” context of some 6900 languages that often require new characters and complex scripts. Training translators now uses a “just in time” approach, since start-up time for new translators can be extensive. So they’re moving their own materials into Libronix so that they have an integrated system. (Larry) They use FieldWorks as a suite of tools for collecting ethnographic materials, doing translation, and a special word processor that handles rendering new Unicode characters. Translators Editor uses the Libronix DLS Object Model for interacting with Libronix and synchronizing the display to a selected passage. So separate tools from separate organizations can work together well.

(Neil) Vision 2025 is an attempt to get language work started in all the remaining languages that need a translation by the year 2025. This means a greater focus on equipping mother-tongue translators to get work started: less costly hardware and software, and more time off-grid (without AC power). So they’re evaluating One Laptop Per Child’s XO machines as an alternative. (XO demo)

Bob MacDonald: Visualizing Micro and Macro Structures in Scripture

Started from studying the Letter to the Hebrews, and understanding the role of Psalms in that book. A year and a half into a 6-year project to learn Hebrews and translate the Psalms, using color to represent parallels, chiasm, repetition, and other aspects. Other visualizations show types (songs, maskils, etc.), attributed authorship, and the distribution of different names of the Lord.

Patrick Durusau: Topic Maps and the Bible

An early experience in connecting manuals to software failed because subjects weren’t used and described consistently. Two outcomes of this failure were DocBook and Topic Maps, initiallly implemented in HyTime, but quickly afterwards in XML, producing XTM (see topicmaps.org). XTM was adopted into ISO 13250. The revisions to the current version should be mostly completed this year or next.

A diversity of identifications is a given: there won’t be a single identifier that everyone will adopt. This is an old problem in computer science that goes by lots of different names, like record linkage, entity resolution, etc. The “my identifier” method assigns a unique identifier to each thing: but you have to trust their judgment about what’s split and joined, and the original distinctions have gone away.

In Topic Maps, a subject is represented by a “topic”. A separate mechanism deals with relationships (“associations”) and “occurrences”, an instance of a subject. Subjects have identifiers and locators (like a URL). The Topic Maps Reference Model is different: an abstract model where subjects are represented by proxies. Identification matters because it defines what we can talk about. Topic maps give us a way to integration information across separate databases. Subject-centric computing is another old concept. We need some basis for disclosing our rules for merging: that way it can proceed bottom-up.

Reinier de Blois: the Semantic Dictionary of Biblical Hebrew

A simple digital re-implementation of a book like a Bible dictionary may not be very user-friendly. You can use hyperlinks to hide details, and then make them available on demand. But a few dictionaries have been designed from the beginning for digital publication, including the UBS Semantic Dictionary of Biblical Hebrew (SDBH), which is based on structural semantic distinctions in the language. Based on cognitive linguistics, where meanings have words (instead of vice versa). For Hebrew, you need both a paradigmatic and syntagmatic (contextual) perspective. An apple is both a kind of fruit (paradigmatic), but also a member of the horticulture domain that includes tree, ripe/unripe, picking, etc. So “sheep” in Hebrew is related to domestic animals as a category (which doesn’t include “pig”!), but also the contextual domains of animal husbandry and sacrifice.

Their dictionary tool starts with a template, and uses drag-and-drop to attach references (they want to attach them all, not just selected examples). After an entry is completed, you can find both the categorial and contextual relationships. About a third of the data has been completed, and the results are all available at www.sdbh.org. Since it’s not complete yet, you can’t find every word, but you can evaluate what’s there, and even contribute additional material. This will be integrated into Libronix when it’s complete.

Mark Hoffman: Digital Resources for Biblical Maps and Mapping

Good survey of many existing resources: map types and sources. Road maps help explain historical developments. Accordance has a nice animated map of events like Paul’s missionary journeys. Copyright issues throughout are complicated and variable, however, whether for ministry or commercial use, though non-profit usage is usually less restricted. Orientation is a special issue for Palestine, since the aspect works better with East at the top, though simply rotation doesn’t address the problem. Some interesting issues in reconciling the Biblical record with archaeology: if Ai was completely destroyed, then what do we make of its map placement? Traditional locations may not be authoritative.

The ability to edit maps is important, since most people don’t have time or skill to revise their own maps. Some of the software packages have maps that be edited, layered, etc. But ultimately, what makes a good map depends on the intended usage. We want our text, maps, and reference works to be interactive, so we can easily go from one to the other. (other interesting copyright discussion about whether use in church is considered a “teaching purpose”)

Gerasa project uses Google Earth to show ancient Roman cities. Megiddo is a great example of how maps enrich our understanding: you need to know the geography to make sense out of Megiddo’s importance. Walked through some examples using BibleWorks, Accordance, Holy Land 3-D, and Google Earth to look at the journey to Emmaus in Luke 24:13. Google Earth makes it easy to incorporate your own photos (when they’re geotagged) through Picassa and Panoramio: layering on Google Earth seems likely to grow in popularity.

Stephen Smith: The ESV and Bible Usability

Electronic books (like Amazon’s Kindle) are the next big thing, but they don’t yet give publishers enough flexibility to address usability issues for the Bible. How do we model what happens to people when they encounter the Bible text? Questions like their profession, education, church involvement, where they are when reading, and many others all affect the process of designing usability for Bible readers. (lots of questions and dimensions to this discussion)

After you’ve identified a persona (with particular characteristics), then you can answer some of the questions about how to design the right Bible for this kind of person. Don’t Make Me Think (Stephen Krug) is a good introduction to web usability. Bible publishers and developers should share their usability data so we can all learn.

Mark Miller: New Culture, New Media

What makes great communication? Story. (lots of pointers to websites doing New Media) Vox Pop Network (Mars Hill) is a good example of a church using new media. Media Convergence will force current software companies to re-think their models. Mashups are a new mix between professionally-created content and user-generated content. RSS is becoming a new broadcast model. wefeelfine.org is one interesting flash visualization of a variety of feelings taken off the web. (Lots of discussion around how communication is changing, and new modes of communication. Search “continuous partial attention” for more about how multiple simultaneous communication modes affects us. )