Logos 5, Behind the Curtain: Curation

This post is part of Logos 5, Behind the Curtain, a series of blog posts looking at new data sets that are part of the latest Bible study application from Logos Bible Software.


At the heart of Logos’ approach to data is the practice of curation. If you’ve heard this term at all, it’s probably in the context of museums: but we mean something rather different. As usual, Scott Adams’ Dilbert has the pulse of corporate America:

The Official Dilbert Website featuring Scott Adams Dilbert strips, animations and more

For Logos, though, curation is not just trendy new jargon, but an essential practice that we’ve been pursing for quite a few years now (five under my direction, and more before my tenure). It’s a critical part of what makes Logos unique in its market. For example, we have not one but three Lexical Curators working on the Bible Sense Lexicon (one for Greek, two for Hebrew). To my knowledge (and Google’s), these appear to be the only people in the world with this job title. In fact, most people in the Content Innovation department at Logos are doing one kind of curation task or another.

So what is curation in the context of Bible study software, anyway? In the simplest terms, curation means organizing and maintaining a collection of things. In the museum context, that’s artifacts they display. Our kind of curation involves computer-readable data that captures knowledge relevant to biblical studies. You can summarize it briefly with three key practices:

  1. Collect
  2. Correct
  3. Connect

(“Describe” is probably a more apt term than “Correct”, but i just couldn’t resist the awesome alliteration.)

To “Collect” means imposing organization on a complicated and messy world of information, with an eye toward structuring it and making it useful in some way in the Logos system. Often this information is expressed in prose sentences in a reference book: for example, a Bible dictionary might describe a city, and include language about where the city was located, what larger region it was part of, where it’s mentioned in the Bible, etc. Fundamentally, a plan for collection requires deciding what information ought to be collected, and which things belong in the collection and which ones are excluded (in more technical terms, an ontology).

Capturing and formalizing knowledge typically involves some tradeoffs. First of all, we have to find categories and labels that balance and maximize utility and (formal) correctness. When it comes to categorization in particular, ultimately “everything is miscellaneous” (which is also the title of a really important book by David Weinberger), and if you push hard enough, each thing is its own unique category . But data sets are typically more useful if things are grouped in some way. So we categorize the following as “people” in our data set:

  • individuals (whether named or not)
  • groups, whether defined by residence (Greeks), common ancestry (Levites), belief systems (Pharisees and Sadducees) , etc.
  • supernatural entities (including those that most Bible readers would accept like the God of Israel, and those that biblical authors argue are not real at all, like Baal, Asherah, or Zeus)

To “Correct” or “Describe” means that we choose and populate particular attributes for the items we’re curating. In the case of people, that includes things like names, gender, and roles. In addition, we create three special attributes for nearly every data set:

  • a unique identifier: though you’ll probably know who i mean when i use a biblical name like “David”, you won’t for “John” because there are five different individuals known by that name. This kind of ambiguity, and other variations (“John”, “John the Baptist”, “John Baptizer”, etc.) means names are usually poor identifiers. Instead, we create a symbol like “John.4” that uniquely identifies one particular item in our collection (in this case, a member of the Sanhedrin mentioned in Acts 4:6). Since we don’t show these to users, we don’t have to worry about people understanding them.
  • a label: since our data is for people to look at and understand, we need user-friendly ways to display an entity. Labels are typically brief (less than 20 -25 characters), and also unique, so that in a drop-down list showing names that match “John”, i can distinguish “John (the Baptist)” from “John (Ac 4:6)”. Since the label is also unique, we could use it as the identifier: but since data stability is a primary goal, we separate the two, so that we can change the label if necessary. Since identifiers (not labels) are the means by which we connect data, we (almost) never change them, since that risks breaking the integrity of the data set.
  • a description: labels are brief so they take up minimum space, but consequently, they can’t carry much information. So we often provide a longer prose description, perhaps a sentence or two, that helps identify the entity and its most basic information. You could compare this to the leading sentence in a Wikipedia article. In the case of John.4, that’s “a member of the family of high priests in Jerusalem following Jesus’ ascension.”, which is probably enough to help you decide whether this is a John you want to know more about or not.

To “Connect” means linking entities to other entities (or other data sets). For people, family relationships are an important ways that people connect to each other. We also label those relationship (father, mother, sister), and, for biblical information, capture the textual sources that support this relationship (more technically, the provenance of the data).

Connecting information is one of the most important aspects of curation for Logos. While it may be interesting to learn that King David was also a shepherd, that’s an isolated fact. But if you can get a list of other individuals in the Bible who were also shepherds (or kings, or musicians), now you’re discovering new information. You might not have started out looking for this, or known how to find it for yourself.

Question: which part of Logos’ curation process (Collect, Correct, Connect) do you find the most interesting or appealing? Please leave me a comment.

(Edit: saw a good piece today by John Chambers of Cisco about the power of connection. http://www.wired.com/insights/2012/12/the-internet-of-everything/)