Using Word Tree Visualization for Checking Title Consistency

I’ve gotten a lot of positive comments on my Zoomable Bible talk from BibleTech:08. While the prototype i showed was little more than a conceptual toy, i think people liked it because

animated visualizations are just plain cool, but even more importantly,
visualizations (like zoomable user interfaces) provide a different view of the text than our linear print legacy has previously encouraged.

However, the real test of a visualization isn’t its coolness, but rather whether it helps you understand things that are otherwise difficult to grasp. I had a good example of that this morning, and walking through it might help others see the value of this tool.

I wrote a year ago about IBM’s Many Eyes site, which provides a host of easy-to-use visualization tools: you upload your data set, choose a visualization technique, and voila, you’ve got a sharable visualization! I’ve posted a few data sets and visualizations previously, like:

(the entire collection of my data and visualizations is here), and lots of others have posted interesting visualizations of Bible data as well. Of course, if you want fine control over the visualization, you’re probably not going to get it from these pre-packaged techniques. But it’s pretty impressive how much you can do with what’s there, and this is an easy way to learn about and sample different visualization techniques: if you’re a data-oriented person, i’d strongly encourage you to check it out.

One of their text oriented visualization techniques is the word tree, which provides a kind of visual concordance for free text. This example of the KJV text of Genesis is a good illustration: type a word in the search box at the top and hit return, and you can see all the phrases that start with that word. You can also turn it around and find phrases ending with a word, and sort by frequency. James Tauber has also used the word tree technique for visualizing NT Greek nominal suffixes.

I found a new use for word trees today, in reviewing titles for the Composite Gospel Index (CGI). One motivation for creating the CGI a few years back was to make it easier to get an overview of the combined content of the four Gospels. Pericope titles are meant to help with this by effectively summarizing the content of a single story, and i deliberately tried to regularize their content. In particular, i wanted as many as made sense to start like “Jesus …”, to try to show the commonality: “Jesus teaches about …”, “Jesus heals …”, “Jesus tells the parable of …”, etc.

Word trees are a perfect tool for data like this, because they make it easy to find phrases that start the same. Conversely, they tend to visually isolate phrases that start the same but then end differently. I’ve created a word tree for titles from CGI pericopes (unfortunately, i haven’t figured out how to embed the visualization live here in my blog: WordPress keeps eating the script element). The input data to word trees are normally free text, but in my case each title is a complete unit: so i just appended special tokens +start+ and +end+ to each one, making the input data look like this (except that, as viewed raw on the site, it’s all wrapped and hence not so readable).

+START+ Jesus is the Word +END+
+START+ God became a human being +END+
+START+ Jesus’ ancestry back to Adam +END+
+START+ Jesus’ ancestry from Abraham +END+
+START+ Luke’s purpose in writing +END+
+START+ The angel Gabriel promises the birth of John to Zechariah +END+

etc., for all 355 pericopes.

So if you enter “+start+ jesus” in the search box (or just click on Jesus in the default view), you’ll see the various titles that start with the word Jesus (255 of 355, or 72%: punctuation becomes a separate token, so a few starting with “Jesus’ …” aren’t included). This works even better sorted by frequency: here you can clearly see the most frequent pericope title is “Jesus teaches …”, and clicking on “teaches” narrows the view further (which you pretty much have to do to see the details: results over 30 or 40 aren’t really visible). One advantage of this representation is that it gives you some help in knowing what to explore (in user interface terminology, an affordance). Though i can’t see all the details without zooming in, i can see a significant cluster of titles starting with “Jesus warns”, and if that’s interesting, i can click on “warns” to zoom in and see those 18 titles.

This last case also points out a benefit i hadn’t previously considered, which is consistency checking (finally getting to the main topic of this post). Looking at the frequency-sorted suffixes for “+start+ Jesus warns”, i see a large group under “against”, and a number under “about”, but also a single instance, “Jesus warns of coming judgment”. Because the third word is “of” rather than “about”, it stands apart from the other instances which really share the same concept. This could just as easily be re-worded “Jesus warns about coming judgement”, and made more consistent with other similar pericopes. Given my goal of consistency (in order to enable just these kinds of visualizations!), it’s really useful to identify cases like this, where a minor revision retains the meaning but also makes the data more consistent. The word tree visualization made it easy to enter “+start+ John” and find the one case where, instead of “John the Baptist “, i just put “John baptizes Jesus.”

What would be really great would be to turn this from a visualization into a navigation system, so once i’ve drilled down to “Jesus warns against …”, then i could select a title and actually view the pericope text. That’s beyond the scope of Many Eye’s toolkit, but something i expect to be working on in the future.