God’s Word | our words
meaning, communication, & technology
following Jesus, the Word made flesh
March 29th, 2007

Name Weights for Biblical Characters, Take 2

I had a few goofs in the code that computed the name weights in this post, including a parenthesis typo that caused an operator precedence problem and (egad!) having the weight factors for frequency and dispersion reversed.

Here’s a corrected graph, with a larger version linked behind it.

Top 50 Biblical Characters by Frequency and Dispersion (medium size), Take 2

Jeremiah and Joab have each climbed up 5 ranks, their (relative) frequency and dispersion being somewhat different from their neighbors. Otherwise, most of the top 50 are within one or two places of their previous (incorrect) rank.

I’ve left the original there to maintain my humility :-/

March 27th, 2007

Name Weights for Biblical Characters

[Update (3/29/2007): there’s an important correction here.]

Too many projects chasing too little time means you have to prioritize (in fact, if you don’t have to prioritize, you ought to question whether you’re doing anything worthwhile at all!). I’ve been spending a lot of time lately looking at the Biblical People data in Logos Bible Software, getting ready to incorporate the work i’ve done on New Testament Names, and then to go beyond that to the Bible Knowledgebase.
While we’ve got a lot of interesting data about characters in the Bible, there’s still not as much as i’d like: so how to prioritize additional development? Well, one practical answer is to assign a weight to each one, then start at the top and work your way down, and stop when it’s time to stop (typically because money, enthusiasm, or both are exhausted). While this isn’t a perfect approach to resource development, it’s a pretty good one, and pretty good is often good enough when you’re in new territory anyway.

Since we’ve got the data that maps people to the passages that refer to them, it’s pretty easy to go through and count. Note an important detail: i really do mean references to people, not just strings. It’s not enough to find the string “Judah” in a verse: you want to know when it’s Judah the person, as opposed to a cover term for Israel or the Southern Kingdom. For hard cases like Judah, the only way to know that is go through verse by verse and decide. For many other cases, while the string is only used to refer to people, there are numerous people with the same name. Zechariah is the Big Daddy here: there are 30 distinct ones in our database (this author found 31, but i haven’t gone through them to determine where the discrepancy lies). So just counting occurrences of “Zechariah” doesn’t get it right either: you have to count all 30 cases differently (for those who like skipping to the end of murder mysteries, the prophet Zechariah, whose prophecies are recorded in the book of the same name, gets mentioned the most). You also have to know which names refer to the same person: Simon is Peter is Cephas, and any instance of those names (that refers to Jesus’ disciple, as opposed to one of the other Simon’s) counts. So you need some real data to be able to do a reasonable job with this computation.

There are a lot of different ways you could count: here’s one. Let frequency be a count of the number of verses that mention a given individual (only counting one for verses like Luke 22:31, “Simon, Simon, Satan has desired to sift you like wheat”, which shouldn’t really count as two observations of Simon’s significance as a Biblical character). Let dispersion be the number of books of the Bible that mention the individual. The intuition here is that, for two individuals with the same frequency, the one that’s mentioned in more books is probably more important, broadly speaking. Normalize each of these by their maximum values (max frequency is 1370, max # of books is 31) just to scale things a little more nicely. Then assign a weight to each of these factors (i used 0.6 for frequency and 0.4 for dispersion) and combine them to get a number between 1 and 0.

Here’s what the graph looks like for the top 50 (there are a total of 2987 men and women), ranked by the composite metric (the green line) that combines frequency (the blue line) and dispersion (the red line). The image is linked to a larger version where you can actually read the names.

Top 50 Biblical Characters by Frequency and Dispersion

While the top names (Jesus, David, Moses, Jacob, Abraham) are no surprise, there are some interesting observations to make from this. First, the composite metric really does change the rankings: Levi is #14 by this method, but #53 if you only ranked by frequency. Likewise, Shaul (King Saul) would be #52 if you only ranked by distribution, because he’s mentioned in just a few books: but he’s clearly one of the most important characters in those books, and so it seems fitting that incorporating frequency boosts him up to #15 in the composite metric rank. You see graphically from where the the red and blue lines approach the cases where frequency and distribution are more equal, and places where they’re farthest apart (Judah’s a good example) where they’re most skewed. Back to the previous point about counting genuine person name instances versus strings: only 99 of the approximately 780 occurrences of “Judah” actually refer to Jacob and Leah’s son, so counting strings would be pretty misleading.

Since names, like many linguistic phenomena, typically follow a Zipfian Distribution (sometimes called a “long tail” or power law distribution), it’s no surprise that the majority (1634 of the 2987) of these names occur exactly once in the Bible, and the 59 most frequent names account for about half of all the name mentions in the Bible. So clearly these top names deserve much more attention than the long tail.

Important disclaimer: i’m not making any claims here about theological or historical importance. That’s a subjective matter, and you’d get different answers depending on your perspective. For example, John the Baptist doesn’t even make the top 50, but by Jesus’ own words in Luke 7:28 “among those born of women none is greater than John.” (ESV) So clearly his importance isn’t measured particularly well by this approach. But as a general approximation to how important different names are across the Bible, this isn’t a bad start. To be completely thorough, you’d also want to count pronoun (“she”) and descriptive (“the woman”) references: but we don’t have data for those yet.

    March 26th, 2007

    OpenBible.info and Bible Place Names

    OpenBible.info has posted a set of 1275 Bible place names (based on a list provided by the ESV folks) along with geocoding data and passages that refer to them. There’s also a nice thumbnail atlas page, and easy instructions for how to load the data into the free and visually gorgeous Google Earth application.

    This is a fantastic contribution to the world of Bible reference data. The opening post, Why This Site?, has a telling comment:

    It’s weird that no one’s ever collected basic biblical data—such as the locations of all the places in the Bible—into an accessible format.

    It is a bit weird: but it’s only fairly recently that the vision for self-published, re-usable data has started to catch fire, and people naturally tend to think first about applications for humans rather than data for machines (i grumbled a little about Bible mapping applications and information stovepipes in this previous post: this data answers my grumbling). It’s like the difference between an artifact and a tool: if i build a birdhouse, people can immediately understand what it’s for and put it to use. But given a hammer, nails, and lumber, they can also build more birdhouses for themselves, as well as doghouses, and even new things i never imagined that aren’t like birdhouses at all. In the long term, it’s data that makes new applications and capabilities possible (a familiar Blogos refrain). This is just the kind of information that will form the foundation of the Bible Knowledgebase (in fact, placenames are next on my development roadmap), and i’m thrilled this data has been made available.

    I incorporated an earlier, much more limited version of this kind of data, done by the Google Earth community, in my SBL talk last November: that covered about 80 New Testament place names (there were about 200 from the whole Bible), and the geocoding data were subsequently included in the last release of the New Testament Names database (unfortunately, OWL data is not very user-friendly). In fact, mapping applications were my key example of why we need semantically-organized data: how else can you distinguish Antioch (in Syria) from Antioch (in Pisidia), or know that seas of Chinnereth/Chinneroth, Tiberias, Gennesaret, and Galilee are all referring to the same body of water?

    A further step toward making this data both explicit and useful would be a slightly clearer notion of which areas include, or are included in, others. The technical terms here are holonym (from the whole to the parts) and meronym (from the parts to the whole): so the Aeropagus is part of (that is, a meronym of) Athens, and Athens is part of Greece or Achaia. You can see that in the small subset of the data below, where Athens is the root form, and the holonyms are designated by the same latitude and longitude, prefixed with a greater than sign:

    Achaia Athens >37.98333333333333 >23.73333333333333
    Areopagus Athens <37.98333333333333 <23.73333333333333
    Athens   37.98333333333333 23.73333333333333
    Greece Athens >37.98333333333333 >23.73333333333333

    Part of this funniness is that we don’t really know either the “center” (which is represented in the data) or the exact boundaries (which is not represented) of the region formerly known as Achaia. But we do know that Athens was a city, not a region, and that cities in general are meronyms of regions: likewise, the Areopagus was a building or site within the city. Providing explicit semantic types for the places (Athens ISA City), and part/whole relationships (Athens subRegionOf Achaia) would advance this data even further. But this is already a great beginning, and hopefully a sign of more to come in the general endeavor of capturing Bible reference information in reusable ways.

    March 25th, 2007

    Blogging for the Long Term

    Somebody from IVPress was kind enough to email me that a link in one of my posts to one of their titles was pointing to the wrong place, and would i please fix it (who knows for how many hundreds of others this is true!). This was a post from December 2003: since then, i’ve changed my blog hosting service, blogging software (Radio Userland to WordPress), page layout, my laptop, and who knows what else. So the only possible fix was to download the HTML file, edit the raw file, FTP it back, and hope i didn’t break anything else in the process. How quickly our recently past technology becomes brittle!
    What makes hyperlinks so easy and appealing — an immediate connection from one page to another — is of course also what makes them so fragile. Putting one layer of indirection in the middle — a link catalog, for example, that i own and maintain — would let me keep them pointing to the right place for as long as i cared to maintain it. And in some respects, that’s how my blog works: it’s my own catalog of things i once found interesting, and that (occasionally, as for example when prompted by others) i can update to point someplace new. It’s not general, it’s not flexible, but it mostly works.
    Of course, there’s an inherent tension between the quick, informal nature of blogging and considerations of permanence: this isn’t “the scholarly literature”, after all. But why shouldn’t it become more and more that way, and why shouldn’t we try? Like Jon Udell, i’d like my web writing to remain valuable for as long as feasible, given my modest investment of time.

    March 22nd, 2007

    New Series: Building the Bible Knowledgebase

    Since coming to Logos Bible Software, i’ve had the wonderful opportunity to take what was previously a spare-time labor of love (mostly published at SemanticBible and discussed in this blog) and turn it into a full-time effort. Consequently, i’ve spent most of the last two months starting to build the infrastructure and foundations of a major new semantic web effort at Logos that i’m currently calling simply the Bible Knowledgebase (BK for short).

    I first got excited about these ideas more than 3 years ago, when i envisioned a semantic annotation layer on top of Scripture to provide meaning-based automated processing and integration with other resources. I started building this knowledgebase from the bottom up with New Testament Names (overview, 2006 SBL presentation), an OWL ontology and set of instance data describing each named thing in the New Testament. Logos has been working on similar kinds of resources for some time: their Biblical People feature is a rich set of information about named people in the entire Bible (both Old and New Testaments) that disambiguates other people with the same name, describes their family relations, and provides all the Scripture references where they are mentioned.

    Starting from these key digital resources (which, by the way, are virtually unique in the world of Biblical studies, and still rare in general), my goal is to build a machine-readable general knowledgebase of semantic reference information about the Bible. I’ll provide more details about what this means and why it matters in a follow-on post: but i’m finding this to be a tremendously exciting opportunity. It’s also a tremendous engineering challenge that will take significant infrastructure and long-term, incremental development.

    I didn’t come to this task empty-handed: i’ve worked with concepts, standards, and tools from the Semantic Web for several years now. I also had the privilege of working with several former colleagues at BBN Technologies who were part of the development of OWL, the Web Ontology language, and expert in Semantic Web development. But there’s a big difference between a personal hobby project and a full-blown effort that will scale up by 100-1000 times and support future development of new Logos products and capabilities (and oh, by the way, it had better be industrial strength, maintainable, and extensible). While i’m learning a lot along the way, much of it comes from hands-on experience and trial-and-error learning: there isn’t a wealth of practical information out there about how to build large Semantic Web knowledgebases. So i wanted to share my experiences, to leave a trail for others who may come this way in the future. I also hope to hear from those whose experience may keep me from falling into the numerous traps that line the path: so please send me (that’s Sean) your email feedback, [myfirstname] at logos [the dot goes here] com with relevant comments and pointers.
    You should expect a lot of technical detail, but i plan to focus on practical implementation over theory. If you want to follow this series only, here’s the RSS feed for the Bible Knowledgebase category.

    March 6th, 2007

    Reading: Clock of the Long Now

    book cover: Clock of the Long NowSteward Brand has a long history of thinking ahead, going all the way back to the Whole Earth Catalog. He also has a taste for symbols that communicate powerful truths: one example is his involvement in the Long Now Foundation with Danny Hillis, founder of Thinking Machines, which aims to build a clock that will run for 10,000 years. Another founder, musician Brian Eno, had an experience with a friend in New York that demonstrated how small our frame of reference typically is:

    “I realized that the ‘here’ she lived in stopped at her front door …’Now’ meant ‘this week’. … No one had any investment in any kind of future except their own, conceived in the narrowest terms. I wrote in my noteback that December, “More and more I find I want to be living in a Big Here and a Long Now.”

    Brand’s book is an eclectic mix of reflections on history, religion, what moves fast and slow in civilizations, digital permanence, book burnings, and Big Ben. It’s also chock-full of one-liners:

    • “Fast gets all our attention, slow has all the power.”
    • “The great problem of the future is that we die there.”
    • “The debt we cannot repay our ancestors we pay our descendants.”

    I recommend Brand’s book as a helpful and readable reflection on the importance of the Long Now. The way we think about our actions definitely changes when we consider the long-term impact on the world our children will inhabit. Joel 1:3 is one of many passages in the Old Testament enjoining a long-term view by passing information across the generations: Joel’s response to the horrific plague of locusts he described was

    Tell your children of it,
    and let your children tell their children,
    and their children to another generation. (Joel 1:3, ESV)

    Christians, of all people, ought to be invested in a Long Now: an eternity in which we have only begun to live.

    March 5th, 2007

    Same Spam, Different Day

    I had my previous email address for 19 years, long before spam had become the plague it is today: so i freely posted it to newsgroups and like throughout the 90’s. Consequently my email address was sprinkled around the Web, and, no surprise, i got lots of junk email (happily our corporate systems for filtering it have also improved over those years).

    When i moved to Logos and got a brand spankin’ new email address, i figured i’d have the chance to leave that exposure behind, and adopt new policies better suited to the new realities of spam. I’m careful about what email i read, who i send to, and what i respond to. I have several junk addresses for high-exposure on-line signups. I’ve tried several tricks to keep spammers from harvesting my address off websites i control (my current favorite is “email to Sean Boisen, [my first name] @” followed by my domain name, figuring only a human would be able to reconstruct that).

    I’ve been at Logos for all of 6 weeks now, and today i got my first spam: the typical nondescript subject line with a .gif attachment, probably a virus. It was trapped by our corporate system, happily, and i wouldn’t have opened it anyway. There’s no way to know how it leaked out so soon (i suspect Hotmail/Messenger may be the culprit). But how sad …

    March 3rd, 2007

    The “Jesus Tomb” and Statistical Analysis

    Easter must be approaching: last Monday, James Cameron, director of the movie Titanic, held a news conference to publicize his documentary (showing this weekend) about the discovery of a tomb in Jerusalem with ossuaries (bone boxes) bearing names like “Jesus Son of Joseph”, “Judas son of Jesus”, Mary, etc. No doubt it will be full of dramatic music and probing questions: “could this be …”, “is it possible that …”, and so forth. The controversial allegation that they hope will draw in viewers, which they characterize as “what may be the most explosive archaeological discovery of all time”, is the claim that this could be the burial tomb of Jesus of Nazareth, a woman whose name can be connected to Mary Magdelene, and even members of his family.

    Here’s the thing: this archaeological discovery is more than 25 years old, and the interest in it from Biblical scholars since then has been, well, less than explosive. Ben Witherington has a good summary of the issues, and my colleague Mike Heiser has been following it closely: there’s also a collection of links at the Countercult Apologetics blog.

    What’s almost as annoying as the annual pre-Easter attempt to cash in on controversial religious claims is their strategy of alleging scientific respectability, rather than actually presenting evidence. That includes DNA tests (sounds scientific, right?) which demonstrate that the remains in the “Jesus” box aren’t genetically related to those in the “Mariamene e Mara” box. What else would you expect, if the remains are of a husband and a wife? There’s nothing the least bit interesting about that, unless you first accept the unsupported claim that this may actually be the tomb of Jesus of Nazareth.
    They go on with some statistical analysis about the names in the tomb, and how unlikely it is that the names found would match names of Jesus’ alleged family. Their website cites a study by Andrey Feuerverger, a professor of statistics and mathematics at the University of Toronto with a strong publication record: not the kind of guy to go off spouting nonsense. According to the Discovery Channel website, “he recently conducted a study addressing the probabilities that will soon be published in a leading statistical journal.” Without a careful reading of that yet-unavailable work (not releasing the study prior to airing of the documentary does seem a bit suspicious), it’s not possible to form a final judgement. But the statements on the website about the statistical analysis range from uncautious to downright misleading. Here’s their bottom line:

    The study concludes that the odds are at least 600 to 1 in favor of the Talpiot Tomb being the Jesus Family Tomb. (emphasis mine)

    That’s just plain wrong. If Feuerverger’s statistical analysis is correct (more about that below), what it shows is the odds of finding five ossuaries in the same tomb with these particular names. The leap to “and therefore it’s highly likely that this is the family of Jesus of Nazareth” is simply that, an unsupported leap of fancy (or, more likely, sensationalism). Just because you found a tomb with the names John, Paul, and Gregory, that doesn’t mean they’re the Beatles, even if you concoct a story about how Gregory is really a variant of George. You need evidence (beyond the likelihood of names) to support that: merely alleging it doesn’t shift the burden to others to disprove their hypothesis. (More details here about the names by Richard Bauckham, a bona fide Biblical scholar, and why they argue against, rather than in favor of, the hypothesis).