Name Weights for Biblical Characters, Take 2

I had a few goofs in the code that computed the name weights in this post, including a parenthesis typo that caused an operator precedence problem and (egad!) having the weight factors for frequency and dispersion reversed.

Here’s a corrected graph, with a larger version linked behind it.

Top 50 Biblical Characters by Frequency and Dispersion (medium size), Take 2

Jeremiah and Joab have each climbed up 5 ranks, their (relative) frequency and dispersion being somewhat different from their neighbors. Otherwise, most of the top 50 are within one or two places of their previous (incorrect) rank.

I’ve left the original there to maintain my humility :-/

Name Weights for Biblical Characters

[Update (3/29/2007): there’s an important correction here.]

Too many projects chasing too little time means you have to prioritize (in fact, if you don’t have to prioritize, you ought to question whether you’re doing anything worthwhile at all!). I’ve been spending a lot of time lately looking at the Biblical People data in Logos Bible Software, getting ready to incorporate the work i’ve done on New Testament Names, and then to go beyond that to the Bible Knowledgebase.
While we’ve got a lot of interesting data about characters in the Bible, there’s still not as much as i’d like: so how to prioritize additional development? Well, one practical answer is to assign a weight to each one, then start at the top and work your way down, and stop when it’s time to stop (typically because money, enthusiasm, or both are exhausted). While this isn’t a perfect approach to resource development, it’s a pretty good one, and pretty good is often good enough when you’re in new territory anyway.

Since we’ve got the data that maps people to the passages that refer to them, it’s pretty easy to go through and count. Note an important detail: i really do mean references to people, not just strings. It’s not enough to find the string “Judah” in a verse: you want to know when it’s Judah the person, as opposed to a cover term for Israel or the Southern Kingdom. For hard cases like Judah, the only way to know that is go through verse by verse and decide. For many other cases, while the string is only used to refer to people, there are numerous people with the same name. Zechariah is the Big Daddy here: there are 30 distinct ones in our database (this author found 31, but i haven’t gone through them to determine where the discrepancy lies). So just counting occurrences of “Zechariah” doesn’t get it right either: you have to count all 30 cases differently (for those who like skipping to the end of murder mysteries, the prophet Zechariah, whose prophecies are recorded in the book of the same name, gets mentioned the most). You also have to know which names refer to the same person: Simon is Peter is Cephas, and any instance of those names (that refers to Jesus’ disciple, as opposed to one of the other Simon’s) counts. So you need some real data to be able to do a reasonable job with this computation.

There are a lot of different ways you could count: here’s one. Let frequency be a count of the number of verses that mention a given individual (only counting one for verses like Luke 22:31, “Simon, Simon, Satan has desired to sift you like wheat”, which shouldn’t really count as two observations of Simon’s significance as a Biblical character). Let dispersion be the number of books of the Bible that mention the individual. The intuition here is that, for two individuals with the same frequency, the one that’s mentioned in more books is probably more important, broadly speaking. Normalize each of these by their maximum values (max frequency is 1370, max # of books is 31) just to scale things a little more nicely. Then assign a weight to each of these factors (i used 0.6 for frequency and 0.4 for dispersion) and combine them to get a number between 1 and 0.

Here’s what the graph looks like for the top 50 (there are a total of 2987 men and women), ranked by the composite metric (the green line) that combines frequency (the blue line) and dispersion (the red line). The image is linked to a larger version where you can actually read the names.

Top 50 Biblical Characters by Frequency and Dispersion

While the top names (Jesus, David, Moses, Jacob, Abraham) are no surprise, there are some interesting observations to make from this. First, the composite metric really does change the rankings: Levi is #14 by this method, but #53 if you only ranked by frequency. Likewise, Shaul (King Saul) would be #52 if you only ranked by distribution, because he’s mentioned in just a few books: but he’s clearly one of the most important characters in those books, and so it seems fitting that incorporating frequency boosts him up to #15 in the composite metric rank. You see graphically from where the the red and blue lines approach the cases where frequency and distribution are more equal, and places where they’re farthest apart (Judah’s a good example) where they’re most skewed. Back to the previous point about counting genuine person name instances versus strings: only 99 of the approximately 780 occurrences of “Judah” actually refer to Jacob and Leah’s son, so counting strings would be pretty misleading.

Since names, like many linguistic phenomena, typically follow a Zipfian Distribution (sometimes called a “long tail” or power law distribution), it’s no surprise that the majority (1634 of the 2987) of these names occur exactly once in the Bible, and the 59 most frequent names account for about half of all the name mentions in the Bible. So clearly these top names deserve much more attention than the long tail.

Important disclaimer: i’m not making any claims here about theological or historical importance. That’s a subjective matter, and you’d get different answers depending on your perspective. For example, John the Baptist doesn’t even make the top 50, but by Jesus’ own words in Luke 7:28 “among those born of women none is greater than John.” (ESV) So clearly his importance isn’t measured particularly well by this approach. But as a general approximation to how important different names are across the Bible, this isn’t a bad start. To be completely thorough, you’d also want to count pronoun (“she”) and descriptive (“the woman”) references: but we don’t have data for those yet.

    New Series: Building the Bible Knowledgebase

    Since coming to Logos Bible Software, i’ve had the wonderful opportunity to take what was previously a spare-time labor of love (mostly published at SemanticBible and discussed in this blog) and turn it into a full-time effort. Consequently, i’ve spent most of the last two months starting to build the infrastructure and foundations of a major new semantic web effort at Logos that i’m currently calling simply the Bible Knowledgebase (BK for short).

    I first got excited about these ideas more than 3 years ago, when i envisioned a semantic annotation layer on top of Scripture to provide meaning-based automated processing and integration with other resources. I started building this knowledgebase from the bottom up with New Testament Names (overview, 2006 SBL presentation), an OWL ontology and set of instance data describing each named thing in the New Testament. Logos has been working on similar kinds of resources for some time: their Biblical People feature is a rich set of information about named people in the entire Bible (both Old and New Testaments) that disambiguates other people with the same name, describes their family relations, and provides all the Scripture references where they are mentioned.

    Starting from these key digital resources (which, by the way, are virtually unique in the world of Biblical studies, and still rare in general), my goal is to build a machine-readable general knowledgebase of semantic reference information about the Bible. I’ll provide more details about what this means and why it matters in a follow-on post: but i’m finding this to be a tremendously exciting opportunity. It’s also a tremendous engineering challenge that will take significant infrastructure and long-term, incremental development.

    I didn’t come to this task empty-handed: i’ve worked with concepts, standards, and tools from the Semantic Web for several years now. I also had the privilege of working with several former colleagues at BBN Technologies who were part of the development of OWL, the Web Ontology language, and expert in Semantic Web development. But there’s a big difference between a personal hobby project and a full-blown effort that will scale up by 100-1000 times and support future development of new Logos products and capabilities (and oh, by the way, it had better be industrial strength, maintainable, and extensible). While i’m learning a lot along the way, much of it comes from hands-on experience and trial-and-error learning: there isn’t a wealth of practical information out there about how to build large Semantic Web knowledgebases. So i wanted to share my experiences, to leave a trail for others who may come this way in the future. I also hope to hear from those whose experience may keep me from falling into the numerous traps that line the path: so please send me (that’s Sean) your email feedback, [myfirstname] at logos [the dot goes here] com with relevant comments and pointers.
    You should expect a lot of technical detail, but i plan to focus on practical implementation over theory. If you want to follow this series only, here’s the RSS feed for the Bible Knowledgebase category.