The Semantic Web as Data + Intelligence

Talking with Talis is rapidly becoming my favorite podcast source: Paul Miller has a lot of really interesting guests addressing topics at the intersection of libraries and the Semantic Web.

Today i listened to an interview with Dr. Jim Hendler, now at Rensselaer Polytechnic Institute, but previously at University of Maryland and a key figure in the establishment of OWL during his tenure at DARPA. My comments here are really just a rehash of some things he said much better, and with much more authority (given his history in the field) — but blame me, not him, for what i say below.

The concept of the Semantic Web brings together two different communities , along with their respective priorities and technologies. Many of the disagreements within what looks like a single community are just two sets of people talking about different things (but using similar terminology). The “semantic” part is mostly represented by the Artificial Intelligence community, with interests in careful ontology development, deep reasoning, theoretical correctness, and academic activities. The “web” community has been out there for more than a decade, building the World Wide Web with HTML and lots and lots of data, and is now looking for ways to make it more useful, connected, and extensible.

You can represent these two concerns as two axes on a graph, and many different endeavors tend strongly toward one side or the other, depending on whether they emphasize the “intelligence” dimension, or the “data” dimension. Just a few examples on the data side (that could be multiplied many times over):

  • Yahoo plans to start indexing RDFa content (i discussed this a bit in my post about Bibleref and RDFa). As one of the major web players, this adds just a little more intelligence to a lot of data (potentially: users still have to create RDFa markup)
  • Freebase is harvesting data from Wikipedia and other sources, and then adding a modest amount of structured relations.
  • Talis has their own set of data from a long history of library applications.

On the “intelligence” side would be big ontology development efforts, and academics working on reasoning: Hendler also called out pharmaceutical companies as tending toward this dimension. Hendler’s own bet is that progress is more likely to come from data-side approaches than the hard-core intelligence side (and i think he’s right). He sees the combination of SPARQL and persistent identifiers as two recent developments that are likely to move the field ahead: these are things i’m looking at closely as well in Bible Knowledgebase development (more on the second one to come soon).

Countdown to BibleTech:2008

Things have been very quiet on Blogos for the last few weeks, as i’ve been cranking away on a prototype for my Zoomable Bible talk at BibleTech:2008. While i’ve always loved learning new things, over the last month i’ve been positively cramming on a multitude of totally new subjects to me:

  • programming in C# (i’ve been spoiled by Python)
  • Using Visual Studio as an IDE, including integration with MySQL databases
  • Basics of 2D graphics
  • Layout algorithms for treemaps (major kudos to the University of Maryland’s Human-Computer Interaction Lab for not only pioneering this area, but even providing open source implementations for people like me to learn from)
  • Using the excellent (but rich and hence challenging) Piccolo 2D toolkit for building zoomable user interfaces (also from the UMd HCIL group)
  • loading up a variety of Bible data (since visualization requires something to visualize!)

I also have a separate presentation about Bibleref: a Microformat for Bible References, and some related recent developments at Logos that will help make the world of on-line information about the Bible more searchable and usable than ever before. You’ll learn more at the conference about some of our plans in this area.

There will also be time Friday night for “birds of a feather” sessions to informally gather people around topics of common interest. I’m hoping to bring together people to talk about developing common naming conventions for people and places in the Bible. If you’ve been following my posts on the Bible Knowledgebase, you know an essential part of this work is simply identifying and disambiguating named people and places: which Judah, or Zechariah, or Gaius, or Jabneel, is which? I think some simple agreement on identifiers, and principles for constructing them, would make sharing such data much easier, and Logos is prepared to start by sharing our own sets of identifiers. So be sure to find me there if you’d like to talk more about how to make this happen. (By the way, i was tickled to see that my post on the most important person in the Bible was #7 in Logos’s list of the Top Ten Blog Posts for 2007 (most viewed)).

As i told one of the speakers in an email earlier today, i’m feeling a little giddy about what a great conference this promises to be. BibleTech, and the interesting and diverse group of people who are coming, really encompasses all the things that brought me to Logos in the first place, and that define my current professional endeavors as well as my personal interests.

It’s not too late! Come join us this Friday and Saturday at the SeaTac Hilton in Seattle (registration details).

Bible Knowledgebase Write-up at SemanticReport.com

SemanticReport is a relatively new digital newletter about the commercial application of semantic technologies like RDF and OWL. Following on my presentation last May at Semantic Technology 2007, they asked me to write up a brief description of the Bible Knowledgebase (BK) project at Logos (other Blogos posts about BK).

They’ve just released their December edition, which includes my article on Building the Semantic Bible.

Hard-core semantic geeks will be interested to know that:

  • they’re annotating their articles with semantic meta-data, which you can view here in what looks like a Simile-derived viewer. Each article also gets linked to RDF data. However …
  • in what may be a telling point about the whole Semantic Web enterprise, their meta-data production seems to lag a bit. So there isn’t any for my article yet :-/

Libronix Links as Knowledge Resources

Wikipedia has proven to be a revolutionary development in online information systems, through features like user-produced content, the breadth of the subjects it addresses, the ability to rapidly update articles, and too many others to list. But one benefit that’s perhaps more subtle is the way that Wikipedia provides a standard set of targets for hyperlinked text.

I use this all the time for my blog posts: as in this recent example, rather than digressing to explain terms like Python and XPath, i just link these terms to their associated Wikipedia articles. Those who know what those terms mean don’t need to follow the links: those who don’t can go find out, if they choose to, or just plow ahead if they don’t want to bother. This hypertext writing style has become much more common in the last decade (thanks in part to the popularity of blogs), and has even spawned new approaches to written communication, like wikis and hypertext fiction.

Once you’ve begun writing (and reading!) hypertext like this, you don’t want to go back: it’s so much more useful to readers to have the additional resources integrated directly into the text. This leads naturally to hyperlinking other kinds of text: for example, i don’t ever write a Bible reference like Luke 11:2-4 without a hyperlink to the verse itself, usually in the English Standard Version (and if you write Bible references like this, you should go look at the bibleref page to see the right way to create these links!).

That’s all background to a realization i had this morning. I was chatting with Dr. Peter Flint, who directs the Dead Sea Scrolls Institute at Trinity Western University, about my work on the Bible Knowledgebase. Dr. Flint gave an excellent talk in the Logos Lecture series last night about his work on the scrolls, and he visited the office today to talk about other things we’re doing. He was reflecting on how, as a professor, he often provides Wikipedia links for his students as additional on-line resources, and wondered whether the Bible Knowledgebase might someday function that way.

In fact, you can use Libronix that way now, with a little knowledge about keylinks and resources. Here are some sample links (which naturally require that you have Libronix installed locally):

  • Biblical People data for Simon Peter (as opposed to Simon from Luke 2:25). Since these Biblical People pages are disambiguated, and include links to Bible passages and related family members, they’re useful “hubs” for starting a study on an individual.
    • Link targets as text: libronixdls://report|name=ReportBiblicalPeople|page=ID%3APeter-1 and libronixdls://report|name=ReportBiblicalPeople|page=ID%3ASimeon2-1
  • Look up the English term “grace” in your default English dictionary (check your settings under Tools > Options > Keylink, for Data Type = English: by default, it’s probably the Merriam-Webster dictionary).
    • Link text: libronixdls://keylink|ref=[en]English:Grace
  • Look up the English term “grace” in the New Bible Dictionary.
    • Link text: libronixdls://keylink|ref=[en]English:Grace|res=LLS:NBD
  • and of course you can link to Bible passages, like this Libronix link to Luke 2:25 in the NIV.
    • Link text: libronixdls://keylink|ref=[en]bible:Luke.2.25|res=LLS:NIV

If you’re publishing your own content (like a blog, web page, or wiki) and your readers might be Libronix users, this can make it very easy for them to get to the data you have in mind, enriching the value of your content, and making your readers happier at the same time! You can put Libronix links in MS Word documents as well. See this page on the Logos blog for more information about hyperlinks to Libronix, and about using a double-link style to combine web links with Libronix links. (note: I use “libronixdls://” instead of simply “libronixdls:” because otherwise WordPress mangles the reference: your mileage may vary, but in general either should work.)

(Disclaimer: there’s significant controversy in the academic community about the appropriate role of secondary sources like Wikipedia. While i’m not trying to open that worm-filled can, there’s no question that, as background resources, Wikipedia and other on-line content has changed the nature of information and education.)


Update (12/7/2007): Phil Gons of Logos has taken this idea further and spelled out lots of cool things you can do with it on the Logos blog.

Organizing Bible Place Names

(Post 3 in a continuing series on building the Bible Knowledgebase: this category, which has an RSS feed here, will track future posts on this subject.)

If you think about the world of the Bible, and limit the discussion to concrete entities in the physical world (not because theology isn’t important, but because it’s much harder to nail down), there’s no question that people are the most important category of things that are discussed therein. The entire history of salvation is defined in terms of God’s interaction with people, from Adam to Noah to Abraham to Moses, continuing on through David up to Jesus and his early followers. Logos Bible Software already ships with a great deal of semantically-organized information about individuals in its Biblical People feature (Tools > Bible Data > Biblical People). Biblical People lets you visualize the family relationships for every named person in the Bible, as well as organizing references to them, and providing other attributes like their roles or occupations. Since joining Logos, i’ve been integrating the previous work i’d done on named individuals in the New Testament to enlarge this knowledgebase with a richer set of interpersonal relationships, and there are still many more relationships we can add.

But now that we’ve got a great start on people, the next big category in organizing semantic information is clearly places. Some of the challenges here are like those for people: for example, multiple places can share the same name (for example, there are two distinct places called Bethany in the New Testament), and multiple names can be used to refer to the same place.

Places have different attributes than people, of course, primary of which is a physical location, represented by latitude and longitude. There are some unique challenges as well:

  • Because places last longer than people do, they’re more prone to changing names.
  • The nature (more formally, the semantic type) of a place can also change over time: villages grow into towns which grow into cities and become capitals.
  • Places can be either supernaturally located (obviously these don’t have latitude and longitude), or metaphorical. Zion is a well-known instance of the latter: sometimes it refers to a physical location, but in other cases it designates a spiritual place (“But you have come to Mount Zion and to the city of the living God, the heavenly Jerusalem …”, Heb 12:2). It may also signify the role of Jerusalem as the religious capital of Israel.

“Jabneel” is a random example from the book of Joshua that shows how complex this process can be. First, name ambiguity: there are two different places that go by this name, one in Josh 15:11, and a different one from Josh 19:33. The Anchor Bible Dictionary states (without further explanation) that the first one is “probably the same Philistine town (Heb yabnÄ“h) conquered by Uzziah, king of Judah”, which is called Jabneh in 2 Chr 26:6. So there is name variation as well as ambiguity. (as often happens with this kind of task, Anchor puts most of its content in the sub-entry under Jabneel, while other resources have “See Jabneh” and put their content there).

Though a few of the key places in the Bible either aren’t named or aren’t locatable (like the Garden of Eden in Gen.2.8), there’s still lots of useful information that’s been around a long time, but never organized into a machine-readable form. As with people, access to rich details about the places of the Bible is an important background resource for readers.

I’ve posted previously about the list of place names produced by OpenBible.info, which is the most comprehensive one that i know of that’s freely available. Starting from this list, as well as some other internally developed resources, we’re in the process of creating a master database of Biblical places that will include

  • Mappings between names and (physical) places (using an interesting representation that i’ll discuss in a future post)
  • A complete set of Biblical references
  • Types (cities, mountains, rivers, etc.)
  • Latitude and longitude data, so any place can be mapped and distances between places can be computed
  • Categorization of place names into different historical periods
  • Part-whole relationships: for example, the fact that Jerusalem is a part of Judea (in technical parlance, a kind of meronymy)

Each physical places will have a unique identifier (akin to a URI), and once the data is complete we’ll turn it into RDF and incorporate it into the emerging semantic database we call the Bible Knowledgebase. We don’t have a timeline yet for when this data will get turned into a feature and included in the Logos product: that will take some time, and will likely unfold incrementally. But i’m excited about starting this next major step in the Bible Knowledgebase. An extra added bonus is that my wife Donna is working on the data with me!

Bible Knowledgebase: What, Why, How

(Post 2 in a series on building the Bible Knowledgebase, unfortunately delayed by a plague of web hosting problems)

The What: BK (Bible Knowledgebase) is reference information about the world of the Bible using Semantic Web standards and tools. The Semantic Web refers to moving from a world of networked pages displayed for humans (HTML, the vast majority of the current World Wide Web), to semantically-characterized information that is machine-readable (and therefore supports a variety of uses like search, browsing, visualization, etc.). Tim Berners-Lee likes to describe it as moving from a web of documents (meant to be read by humans) to a web of data (meant to be read by computers).

Initially, the scope is every named thing in the Bible (people and places are the bulk of the cases, but there are also languages, ethic groups, holidays, and numerous others). Eventually i hope to extend this to unnamed but described entities: for example, the Samaritan woman of John 4 is never named, but we know her ethnicity, where she lived, some people she interacted with, and other facts.

The Why: the Bible Knowledgebase will support

  • Knowledge exploration and discovery: just as hyperlinked web pages lead you to new information, linked facts about individuals will lead to other individuals or resources about them.
  • Smart (semantic) indexing: for a given passage, you’ll know which John/Mary/James is referred to, not just the collection of individuals who share that name. Searching will provide more precisely targeted results, because reference material will be disambiguated.
  • Visualization: rich data sets support graphic displays that given an overview of information that would otherwise be scattered across numerous different passages

The How:

  • I’ve designed an OWL ontology that captures an initial set of entity types and relationships between them
  • Information from Logos’ Biblical People feature and New Testament Names has been merged into the initial dataset
  • Both the ontology and the instance data will be extended to incorporate additional information. There’s no principled stopping point, but i expect to grow BK from its current size of ~100k RDF triples to perhaps 100-1000 times that size.

In the (perhaps unlikely) event that you’re in the intersection between

  • Blogos readers, and
  • attendees at the Semantic Technology Conference in San Jose next week

i’ll be giving a presentation about this work Thursday morning.

The Importance of Being Ambiguous

(with apologies to Oscar Wilde)

My colleague Rick Brannan has done some recent posts on the Logos Blog about some ambiguities in James:4:5-6. Given the work i’m doing on knowledge representation applied to Bible information, his latest post got me thinking about the general problem, and what best practices to use for representing ambiguity.

Ambiguity is a bad thing, and we should always try to get rid of it, right? Well, it’s always nice to have a clear understanding of things, and a lot of information technology, from language compilers to air traffic controllers to traffic lights, only functions well in its absence (though we’d lose a lot of our humor by banishing ambiguity altogether). But i’d claim the following (which i’ll glorify With Capital Letters) as the Best Practices for Representing Ambiguity:

  • sort out and resolve any ambiguity when you can
  • don’t create any spurious ambiguity when you can avoid it
  • where’s there’s genuine ambiguity, preserve it

It’s this last practice that i want to address here: don’t guess, don’t arbitrarily pick one of several alternatives, instead find a way to represent ambiguity when it’s real.

One major knowledge representation project that provides a good model here is the Penn Treebank, a million-word corpus of English annotated with grammatical analysis. Early on, the creators recognized that syntactic theories are all over the map, and by tying the Treebank too closely to one theory or its conclusions, they’d risk being ignored by the others. So they adopted some basic practices to ensure a theory-neutral representation, including tags specifically designed to indicate “there’s an ambiguity here, and we’re not resolving it”.

Things are never black-and-white in the absence of complete information: how to decide the tricky cases? A fair rule of thumb would be to imagine you and several equally brilliant friends reviewing the case (imagine an even number of friends, so you can be the tiebreaker). Would the decision about how to resolve a given ambiguity be unanimous? Then go with the consensus and consider it settled (even though you might imagine some nameless Bible scholar somewhere disagreeing about it) . Would the vote be close? Then see if there’s a way to capture the other possibility, rather than forcing a decision.

Here’s one way this works out in practice for creating a detailed knowledgebase about people in the Bible. One of the fundamental differences between semantic search and word-based search is that you have the chance to resolve ambiguity by deciding which references to people are the same, and which are different. Take Gaius as an example: this name occurs 5 times in the New Testament, twice in narrative passages in Acts (Acts.19.29 and Acts.20.4), Rom.16.23, 1Cor.1.14, and in the opening address of 3John.1.1. Assigning the mentions in Romans and 1Cor to the same person is pretty solid (this implies Paul was in Corinth when he wrote the letter to the Romans). But otherwise, the contexts leave open the possibility that there might be four different individuals here that just happen to share the same name (which was a common one). For example, Gaius in Acts 19 is Macedonian, but Acts 20 says Gaius of Derbe. While it’s possible some of these other mentions are the same person (and it’s always interesting to speculate on possible connections), in the absence of any solid evidence the best practice is to represent them as separate individuals.

So i’m modeling this in the Bible Knowledgebase as follows:

  • when different mentions seem clearly to be the same individual (e.g. Simon Peter and Cephas), standardize on a single URI for all their properties
  • when there are clearly different individuals, given them each a unique URI
  • when they might be the same but the evidence is weak (for example, the other Gaius mentions), treat them as different individuals (with unique URIs), and then use the property possiblySameAs to represent this hypothetical linkage.

It’s much easier to join things later if you decide they’re the same: the OWL property owl:sameAs exists to accomplish that with a single assertion that two entities are definitely the same. But splitting involves sorting out all their properties and reassigning them to the correct URI, which can be tricky business. So when in doubt, they’re left separate, with indicators where they might be the same as another.

There’s another extreme to this general principle, which is to be so afraid of making any commitments that you leave everything ambiguous. Obviously this doesn’t work well either (you wind up saying nothing), and deciding which is which is as much art as science. But as a generalization, i’d always prefer to err on the side of conservatism and joining later rather than risk losing ambiguity when it’s real.

Name Weights for Biblical Characters, Take 3

(originally posted 2007-04-02, but then a victim of repeat web site hosting problems, i’m trying again …)
Looking further at the numbers i previously discussed for estimating the ranked importance of Biblical characters by how often and where they’re mentioned, there’s a refinement of the dispersion factor that i like better. It came from comparing the rank of Ishmael.1 (Abraham’s son by Hagar) to Ishmael.2 (who assassinated Gedaliah, the Babylonian-appointed ruler of Judah discussed in 2 Kings 25 and Jer 40-41). In my first ranking, Ishmael.2 (who i didn’t even remember) was ranked slightly higher than Ishmael.1, contrary to my intuitions (and those of every Bible dictionary i’ve checked, measured by the number of sentences used to describe each).

Quantifying your ideas gives you a way to measure how they match your intuitions, and, when they don’t, think about why. In this case, it was immediately obvious: though Ishmael.2 is mentioned a few more times, those mentions are highly concentrated, in a total of 3 chapters (across 2 books). Ishmael.1 is also mentioned in two books, but in 6 different chapters. By incorporating the number of distinct chapters a name occurs in (just a more fine-grained measure of dispersion), their rank comes out more like what i’d expect. Specifically, given weights of

  • .6 for frequency (as before)
  • .2 for chapter dispersion
  • .2 for book dispersion (so the total dispersion weight is still .4, just refined a bit)

Ishmael.1 comes out at #257, versus #285 for Ishmael.2. Here’s the top 50 chart using this metric:

Top 50 Biblical Characters by Frequency and Dispersion (medium size), Take 3

Here’s a graphic to show more clearly how the rankings change with this metric. Red markers above the Blue line are names that have moved up in rank with the revised metric: for example, John the Baptist (John.1) moves from #50 to #30, which seems appropriate. Those below the line are ranked less highly under the new metric (e.g. Jesse.1, who moves from #18 down to #24).
Biblical People weights, with and without chapter dispersion (medium size)
Some other factors that might improve the estimate even further (and remember, this is just an estimate):

  • As suggested above, external sources (like Bible dictionaries) are a rich and quantifiable source of judgments about importance: the more words or sentences used to describe an individual, the more important they’re likely to be. By using several dictionaries, you’re not held captive to the biases of an individual work or editorial slant. The key feature here is making the connection between the described individual (often in a numbered paragraph) and the Biblical character: but given a map from individuals to passages (which we have), that ought to be possible with a bit of programming at better than 90% accuracy.
  • Though the whole of Scripture is authoritative and inspired, there’s a sense in which certain sections are broader in their implications. For example, anyone mentioned in the first chapters of Genesis should probably get an extra measure of importance: these are the foundational stories of Hebrew and Christian history (and this is another way in which Ishmael.1 is surely the most important one).

Postscript: after drafting the above, but before publishing (making and uploading the charts is still a bit painful), i saw this post at OpenBible.info, suggesting some alternative approaches (thanks for playing!). The first (rank some chapters higher than others) is similar in spirit to my second suggestion above: we both agree, as i’m sure do others, that some parts of the Bible ought to add more weight to this metric. The second suggestion there also proposes a valuable refinement, using association with important people (approximated by co-occurrence in a chapter) to lend importance. I’ll look into incorporating that figure as well.

Name Weights for Biblical Characters, Take 2

I had a few goofs in the code that computed the name weights in this post, including a parenthesis typo that caused an operator precedence problem and (egad!) having the weight factors for frequency and dispersion reversed.

Here’s a corrected graph, with a larger version linked behind it.

Top 50 Biblical Characters by Frequency and Dispersion (medium size), Take 2

Jeremiah and Joab have each climbed up 5 ranks, their (relative) frequency and dispersion being somewhat different from their neighbors. Otherwise, most of the top 50 are within one or two places of their previous (incorrect) rank.

I’ve left the original there to maintain my humility :-/

Name Weights for Biblical Characters

[Update (3/29/2007): there’s an important correction here.]

Too many projects chasing too little time means you have to prioritize (in fact, if you don’t have to prioritize, you ought to question whether you’re doing anything worthwhile at all!). I’ve been spending a lot of time lately looking at the Biblical People data in Logos Bible Software, getting ready to incorporate the work i’ve done on New Testament Names, and then to go beyond that to the Bible Knowledgebase.
While we’ve got a lot of interesting data about characters in the Bible, there’s still not as much as i’d like: so how to prioritize additional development? Well, one practical answer is to assign a weight to each one, then start at the top and work your way down, and stop when it’s time to stop (typically because money, enthusiasm, or both are exhausted). While this isn’t a perfect approach to resource development, it’s a pretty good one, and pretty good is often good enough when you’re in new territory anyway.

Since we’ve got the data that maps people to the passages that refer to them, it’s pretty easy to go through and count. Note an important detail: i really do mean references to people, not just strings. It’s not enough to find the string “Judah” in a verse: you want to know when it’s Judah the person, as opposed to a cover term for Israel or the Southern Kingdom. For hard cases like Judah, the only way to know that is go through verse by verse and decide. For many other cases, while the string is only used to refer to people, there are numerous people with the same name. Zechariah is the Big Daddy here: there are 30 distinct ones in our database (this author found 31, but i haven’t gone through them to determine where the discrepancy lies). So just counting occurrences of “Zechariah” doesn’t get it right either: you have to count all 30 cases differently (for those who like skipping to the end of murder mysteries, the prophet Zechariah, whose prophecies are recorded in the book of the same name, gets mentioned the most). You also have to know which names refer to the same person: Simon is Peter is Cephas, and any instance of those names (that refers to Jesus’ disciple, as opposed to one of the other Simon’s) counts. So you need some real data to be able to do a reasonable job with this computation.

There are a lot of different ways you could count: here’s one. Let frequency be a count of the number of verses that mention a given individual (only counting one for verses like Luke 22:31, “Simon, Simon, Satan has desired to sift you like wheat”, which shouldn’t really count as two observations of Simon’s significance as a Biblical character). Let dispersion be the number of books of the Bible that mention the individual. The intuition here is that, for two individuals with the same frequency, the one that’s mentioned in more books is probably more important, broadly speaking. Normalize each of these by their maximum values (max frequency is 1370, max # of books is 31) just to scale things a little more nicely. Then assign a weight to each of these factors (i used 0.6 for frequency and 0.4 for dispersion) and combine them to get a number between 1 and 0.

Here’s what the graph looks like for the top 50 (there are a total of 2987 men and women), ranked by the composite metric (the green line) that combines frequency (the blue line) and dispersion (the red line). The image is linked to a larger version where you can actually read the names.

Top 50 Biblical Characters by Frequency and Dispersion

While the top names (Jesus, David, Moses, Jacob, Abraham) are no surprise, there are some interesting observations to make from this. First, the composite metric really does change the rankings: Levi is #14 by this method, but #53 if you only ranked by frequency. Likewise, Shaul (King Saul) would be #52 if you only ranked by distribution, because he’s mentioned in just a few books: but he’s clearly one of the most important characters in those books, and so it seems fitting that incorporating frequency boosts him up to #15 in the composite metric rank. You see graphically from where the the red and blue lines approach the cases where frequency and distribution are more equal, and places where they’re farthest apart (Judah’s a good example) where they’re most skewed. Back to the previous point about counting genuine person name instances versus strings: only 99 of the approximately 780 occurrences of “Judah” actually refer to Jacob and Leah’s son, so counting strings would be pretty misleading.

Since names, like many linguistic phenomena, typically follow a Zipfian Distribution (sometimes called a “long tail” or power law distribution), it’s no surprise that the majority (1634 of the 2987) of these names occur exactly once in the Bible, and the 59 most frequent names account for about half of all the name mentions in the Bible. So clearly these top names deserve much more attention than the long tail.

Important disclaimer: i’m not making any claims here about theological or historical importance. That’s a subjective matter, and you’d get different answers depending on your perspective. For example, John the Baptist doesn’t even make the top 50, but by Jesus’ own words in Luke 7:28 “among those born of women none is greater than John.” (ESV) So clearly his importance isn’t measured particularly well by this approach. But as a general approximation to how important different names are across the Bible, this isn’t a bad start. To be completely thorough, you’d also want to count pronoun (“she”) and descriptive (“the woman”) references: but we don’t have data for those yet.