Logos 5, Behind the Curtain: Curation

This post is part of Logos 5, Behind the Curtain, a series of blog posts looking at new data sets that are part of the latest Bible study application from Logos Bible Software.

At the heart of Logos’ approach to data is the practice of curation. If you’ve heard this term at all, it’s probably in the context of museums: but we mean something rather different. As usual, Scott Adams’ Dilbert has the pulse of corporate America:

The Official Dilbert Website featuring Scott Adams Dilbert strips, animations and more

For Logos, though, curation is not just trendy new jargon, but an essential practice that we’ve been pursing for quite a few years now (five under my direction, and more before my tenure). It’s a critical part of what makes Logos unique in its market. For example, we have not one but three Lexical Curators working on the Bible Sense Lexicon (one for Greek, two for Hebrew). To my knowledge (and Google’s), these appear to be the only people in the world with this job title. In fact, most people in the Content Innovation department at Logos are doing one kind of curation task or another.

So what is curation in the context of Bible study software, anyway? In the simplest terms, curation means organizing and maintaining a collection of things. In the museum context, that’s artifacts they display. Our kind of curation involves computer-readable data that captures knowledge relevant to biblical studies. You can summarize it briefly with three key practices:

  1. Collect
  2. Correct
  3. Connect

(“Describe” is probably a more apt term than “Correct”, but i just couldn’t resist the awesome alliteration.)

To “Collect” means imposing organization on a complicated and messy world of information, with an eye toward structuring it and making it useful in some way in the Logos system. Often this information is expressed in prose sentences in a reference book: for example, a Bible dictionary might describe a city, and include language about where the city was located, what larger region it was part of, where it’s mentioned in the Bible, etc. Fundamentally, a plan for collection requires deciding what information ought to be collected, and which things belong in the collection and which ones are excluded (in more technical terms, an ontology).

Capturing and formalizing knowledge typically involves some tradeoffs. First of all, we have to find categories and labels that balance and maximize utility and (formal) correctness. When it comes to categorization in particular, ultimately “everything is miscellaneous” (which is also the title of a really important book by David Weinberger), and if you push hard enough, each thing is its own unique category . But data sets are typically more useful if things are grouped in some way. So we categorize the following as “people” in our data set:

  • individuals (whether named or not)
  • groups, whether defined by residence (Greeks), common ancestry (Levites), belief systems (Pharisees and Sadducees) , etc.
  • supernatural entities (including those that most Bible readers would accept like the God of Israel, and those that biblical authors argue are not real at all, like Baal, Asherah, or Zeus)

To “Correct” or “Describe” means that we choose and populate particular attributes for the items we’re curating. In the case of people, that includes things like names, gender, and roles. In addition, we create three special attributes for nearly every data set:

  • a unique identifier: though you’ll probably know who i mean when i use a biblical name like “David”, you won’t for “John” because there are five different individuals known by that name. This kind of ambiguity, and other variations (“John”, “John the Baptist”, “John Baptizer”, etc.) means names are usually poor identifiers. Instead, we create a symbol like “John.4” that uniquely identifies one particular item in our collection (in this case, a member of the Sanhedrin mentioned in Acts 4:6). Since we don’t show these to users, we don’t have to worry about people understanding them.
  • a label: since our data is for people to look at and understand, we need user-friendly ways to display an entity. Labels are typically brief (less than 20 -25 characters), and also unique, so that in a drop-down list showing names that match “John”, i can distinguish “John (the Baptist)” from “John (Ac 4:6)”. Since the label is also unique, we could use it as the identifier: but since data stability is a primary goal, we separate the two, so that we can change the label if necessary. Since identifiers (not labels) are the means by which we connect data, we (almost) never change them, since that risks breaking the integrity of the data set.
  • a description: labels are brief so they take up minimum space, but consequently, they can’t carry much information. So we often provide a longer prose description, perhaps a sentence or two, that helps identify the entity and its most basic information. You could compare this to the leading sentence in a Wikipedia article. In the case of John.4, that’s “a member of the family of high priests in Jerusalem following Jesus’ ascension.”, which is probably enough to help you decide whether this is a John you want to know more about or not.

To “Connect” means linking entities to other entities (or other data sets). For people, family relationships are an important ways that people connect to each other. We also label those relationship (father, mother, sister), and, for biblical information, capture the textual sources that support this relationship (more technically, the provenance of the data).

Connecting information is one of the most important aspects of curation for Logos. While it may be interesting to learn that King David was also a shepherd, that’s an isolated fact. But if you can get a list of other individuals in the Bible who were also shepherds (or kings, or musicians), now you’re discovering new information. You might not have started out looking for this, or known how to find it for yourself.

Question: which part of Logos’ curation process (Collect, Correct, Connect) do you find the most interesting or appealing? Please leave me a comment.

(Edit: saw a good piece today by John Chambers of Cisco about the power of connection. http://www.wired.com/insights/2012/12/the-internet-of-everything/)

Logos 5, Behind the Curtain: Introduction

I’ve fallen off in my blogging quite a bit over the last few years: my last post (a book review) was December 2011, and, other than conference reports and book reviews, it’s rather sparse for the year or two prior to that.

But i’m excited to be reviving my blog and starting a new series today to celebrate the release of Logos 5. In Bob’s Pritchett’s overview post on the Logos blog, he talks about the importance of connection in Bible study, and says [italics mine]

Logos Bible Software 5 is a significant update that is all about connection.

This focus on connection is not just marketing talk or a conceptual metaphor. It describes in a very concrete way the important new datasets that make Logos 5 a major contribution to the world of biblical studies. I recognize that’s a mighty big claim, but i plan to back it up by describing the vision, architectures, effort, and technical approaches that have made the new features of Logos 5 a reality. In my role as Director of Content Innovation at Logos, i lead a talented and hard-working team of people who have spent the last several years making all these connections. (it wasn’t my idea, but that’s me in the “What’s new in Logos 5” video) Over this series of posts, i hope you’ll gain a better appreciation for what went into these new features, and what makes them so important.

I’d love to hear your comments and questions. For example: which of the new Logos 5 features do you find most useful?

Hyperlinks to Logos Resources

A few years back i blogged about links into Logos software as a kind of knowledge resource. This style of richly-hyperlinked information is increasingly becoming the standard way i try to communicate: it couples the basic textual content with doors that open into related areas.

With the release of Logos 4 (now a year ago!), there have been some significant changes both to how those links get expressed, and what kind of information can be linked to. So i recently wrote a post for the Logos Blog explaining how this works and why Logos users might care:  Logos 4 Information Has an Address. If you’re a Logos user, i encourage you to check it out!

BibleTech:2010 Debrief

The BibleTech conference is an annual highlight for those of us who work at the intersection of Bible stuff and technology, and last week’s meeting in San Jose was no exception. This was the third BibleTech — i’ve been fortunate to have attended (and presented at) them all — and there’s always a great mix of new ideas, updates on ongoing projects, and lots of interesting people to talk to. (some other reviews: Rick Brannan, Mike Aubrey, Trey Gourley)

Some of the talks i liked best this year:

  • I was already interested in Pinax before hearing James Tauber’s talk on Using Django and Pinax for Collaborative Linguistics: now i’m itching to get started!
  • Stephen Smith had a nice analysis of the most frequently tweeted Bible passages (though the evidence of vast swaths of Scripture that get very little attention was perhaps a bit depressing).
  • Neil Rees showed Concordance Builder, a program that lets you use a Swahili concordance to bootstrap one for Welsh (or any other pair of languages) with no linguistic knowledge. Building on the Paratext tool, it leverages the verse indexes along with approximate string matching and statistical glossing (technical paper by J D Riding) to produce results that are about 90-95% correct out of the book. This can reduce concordance development to a matter of weeks rather than years.
  • There were several talks related to semantics in addition to mine: Randall Tan talked about more automated methods and fleshed them out relative to the higher-level structure of Galatians, and Andi Wu gave what looked like a really interesting presentation on semantic search based on syntax and cross-language correspondence (alas, i missed it).
  • Weston Ruter talked about APIs they’re developing at OpenScriptures.org (and brought in the Linked Data idea). Logos also unveiled their new API for Biblia.

I felt my talks went well and i got some good feedback. My slides are now posted (if you wrote down URLs at the conference, i didn’t get them quite right 🙁 but here they’re correct):

(As with some previous talks, i did my presentation with Slidy (previous post): i feel like it’s going a little more smoothly each time.)

Bob’s Talk at TOC

I blogged a funny story last week about Logos CEO Bob Pritchett’s attendance at the O’Reilly Tools of Change for Publishing (TOC) conference. But here’s a serious comment from Mark Coker of the Huffington Post that warrants quoting (italics are mine):

The Best Presentation at TOC

My favorite presentation of the conference was from Bob Pritchett of Logos Bible Software, in a session titled, Network Effects Support Premium Pricing. I remember attending his presentation four years ago at the first TOC in San Jose, so I knew I didn’t want to miss his presentation this time. They’re doing amazing stuff at Logos. They face an interesting challenge, one that every author and publisher faces: How do you compete against free? In their case, they sell about 10,000 bible study ebooks. How much has the bible changed over the last two hundred years? Not much. But what Logos excels at is making this information more accessible than ever before. They take a database-centric view of their vast and ever-growing library of content.

When you purchase a book from them, you’re not just getting a static ebook, you’re buying into a dynamic, integrated online application environment that becomes richer with each new publication, and with each new member to their community. Even if Bible study isn’t your thing, check them out for future-of-publishing inspiration. I can’t do them justice here.

High praise indeed from somebody who isn’t necessarily into Bible study, but recognizes that what Logos is doing is really quite unique in the entire publishing industry. Our “database-centric views” are only getting stronger, so you can expect to hear more about this in the months to come.

Building Data Applications – One Piece at a Time

My colleague Steve Runge (Logos bio, blog) made a new connection for me today, between the kind of data work we do at Logos and an old Johnny Cash song. I won’t spoil the surprise if you haven’t heard the song (and we don’t do it by stealing!), but there’s a commonality to the methodology: fact by fact, relation by relation, that’s the way to build a database. And with enough time and perseverance, when you’re done you too can say “… ’cause I have the only one there is around.”

Johnny Cash – One Piece At a Time – on YouTube.

Bookmarklets Redux

Time spent on the web can be oh-so tedious if you’re constantly cutting things from one page and pasting them elsewhere just to get to another, related page. Someday Linked Data may make this all better, but until then, we all get by with helpful tricks.

Bookmarklets are one essential weapon in the arsenal of the web-info-warrior. Usually they’re little JavaScript programs stored as a bookmark in Firefox, providing one-click access to some simple functionality like looking things up elsewhere, resizing your window, etc. I’ve blogged previously about bookmarklets to find local library sources for a book on an Amazon page (or PaperBackSwap).

I dusted off my bookmarklet skills this past week and came up with some nifty tools that i wanted to share.

First off, imagine you’re looking at a website with Bible references whose benighted author somehow failed to include RefTagger. So rather than a nice pop-up with the text of the reference, or even a helpful link to that text on some Bible site, you’re just looking at a inanimate, unlinked string: boo. The Bible Reference Bookmarklet to the rescue! Simply select the text of the reference, click the bookmarklet, and you’ll be whisked off to that reference at Bible.Logos.com. If you haven’t selected any text first, you get a dialogue box asking for it.

To get this goodie in Firefox, first make sure the Bookmarks Toolbar is showing (View > Toolsbars > Bookmarks Toolbar must be checked). I’d love to give you a link to just drag onto the toolbar, but i don’t seem to be able to get the code past WordPress. So go to Bookmarks > Organize Bookmarks, and select Organize > New Bookmark. Give it a useful name like “Bible Reference Lookup”, and paste the code below in Location field.


After you’ve clicked ok, you should see it on your toolbar.

You can do similar tricks for a wide variety of strings that you just want to look up elsewhere (i discovered one here while writing this post that lets you look up articles on Wikipedia). This isn’t fundamentally different from copying the string into a search box: but sometimes it’s more convenient.

Descending into more esoteric purposes (to give you ideas for your own bookmarklets): as part of an earlier post on Tools for Personal Knowledge Management, i mentioned my use of TiddlyWiki for quick organization of hyperlinked notes. Like other wiki software, TiddlyWiki has its own link syntax, that looks like

[[Link text | URL]]

When linking to lots of other web pages, i was getting tired of copying the URL, pasting that in, then typing the square brackets, link text, vertical bar, and more square brackets, all in the right format. Wouldn’t it be more convenient to just construct this expression from the title of the page and its URL, rather than having to type it myself? YES! and the TiddlyWiki Page Link bookmarklet does just that, putting the result in a little pop-up window where a triple-click selects the whole thing, ready to copy and paste into your tiddlywiki (and tailor as desired: the title isn’t always what you want, but it’s often easier to edit and throw things out rather than type afresh). This one you can just drag to your bookmarks toolbar and use right away.

TiddlyWiki Page Link

Also, i’ve switched to a much better library lookup bookmarklet (and a service to help you create one for your local library) from WorldCat. Among other things, it generates the list of all the different ISBNs that might exist for a title (which can be very long indeed), and when there are many, it provides links for alternate searches in case the first group comes up empty handed.

Some other cool bookmarklets in my collection include:

  • CiteULike Popup Post and kin to make it easy to add (certain kinds of) articles to your reading list management. Adds more value for sources whose structure it understands.
  • Show del.icio.us citations of the current URL (you can find it there)
  • Resize your browser window to 1024 x 768 (if you want to see how a page will look on a smaller monitor or projector): the bookmarklet follows, just drag to your toolbar. 1024 x 768
  • A CSS validator for the current page: see Pete Freitag’s page.

Hat tips:

Ruter: Open Scriptures: Picking up the Mantle of the RE:Greek-Open Source Initiative

The background of this talk: Zack Hubert’s talk from the last BibleTech. Zack developed a very useful web site which ultimately failed because he couldn’t maintain it, and couldn’t get other developers to pitch in and help.

The vision: an open web repository for integrated scriptural data and a platform for building applications of scripture (OpenScriptures.org). What kinds of data? Manuscripts, translations, versification systems, morphosyntactic parsings, user tags/annotations/cross-references. But it takes a lot of effort to get started with all this data, each of which is typically in its own format, and unlinked to other data.

Linked data principles (from timbl):

  • use URIs as names for things
  • use HTTP URIs so that people can look up those names
  • provide useful information behind the URIs
  • and links to other URIs so they can discover more things

“… the more things you have to connect together, the more powerful it is.” Can we connect things together through a unified manuscript that links together semantic units (words, phrases, clauses)?

Manuscript unification: normalize a manuscript (lowercase and remove diacritics: no spelling normalization yet), insert and save links to the unified manuscript. Then for additional manuscripts, normalize, merge links, and save them. Now you’ve got all the attested readings linked together. This unified manuscript now has an automated critical apparatus. [demo here of the manuscript comparator]

Potential applications include:

  • translation comparator (can also help with the versification problem)
  • comprehensive concordance
  • translation-independent cross-references (e.g. NT quotations of the OT)
  • interlinear/bilingual editions

You can automatically link manuscripts in the same language, but not different languages. Use collective intelligence to capture semantic linking between languages. Use the “games with a purpose” (GWAP) approach to gather links.

Copyright is a major challenge: you can’t link texts together if you can’t access them, and you can’t share them if they’re not open. Recently MorphGNT texts have been taken down from several sites because they’re not freely sharable. If the key benefit is connections between data, then data (including texts) should be more valuable if they’re sharable and connected. One solution: an Open Scriptures Platform that connects content owners, developers, and end-users. Passionate developers could build applications based on content licensed to Open Scriptures (as a proxy), and Open Scriptures makes sure than end-users provide revenue to content owners.