The title of this post is a deliberate take-off from a recent post at OpenBible.info entitled “What Are the Most Popular Verses in the Bible? It Depends Whom You Ask”. That post combines data from an earlier ESV analysis of search results, TopVerses.com, a BibleGateway (internal) study, and OpenBible data to present a list of 278 verses, all of which occur in the top hundred of at least one source’s “top 100″ list. It’s interesting to see both how much disparity there is (only 13% occur in at least three of the four lists), but also how uneven the distribution is. As one commenter points out, it’s somewhat surprising that there are no verses from Revelation, and Old Testament narrative in particular is largely absent except for Genesis. John’s gospel has about as many popular verses as all the other gospels combined: there are only four verses from Mark (two of them from the often-questioned ending). Less surprisingly, perhaps, there are none from the shortest NT books (Philemon, Jude, 2-3 John). Altogether it’s an interesting study.
The larger question this raises for me is how we might come up with a comprehensive, global score for verses to indicate their importance for a variety of purposes. As the OpenBible post suggests, this depends both on what the source of the data is, but also on what your purpose is and what you mean by “important” (which is certainly different from “popular”, though not completely unrelated).
One useful purpose is ranking verses to present them in response to searches: TopVerses.com is explicitly organized this way, as indicated in this news article about the site. They don’t go into much detail about how they gathered their data, though the scope (37M references scoured from the web) is impressive. But there’s a subtle disparity here: their data is based on counting mentions (citations) in published web pages, but their use case is prioritizing search results, and these may be out of sync. The fact that a given verse is frequently published on the web doesn’t necessarily mean it’s the one you want at the top of the list when you’re doing a word-based search, for example. The other three sources seem perhaps better matched to ranking search results, since they’re derived from searches themselves.
Another key hitch is these endeavors is how to handle range references, both in processing source data and (for search purposes) in handling queries. For example, many Bible dictionaries frequently reference ranges of verses, sometimes extensive, multi-chapter ones. If you’re going to count these, you need to think carefully about how you do the counting so you don’t introduce bias (or, better, you select the bias that’s best suited to your purposes).
For example, in the TopVerses.com ranking John 3.1 is #26, despite the rather plain descriptive content with little obvious spiritual impact.
Now there was a Pharisee, a man named Nicodemus who was a member of the Jewish ruling council. (John 3.1, NIV)
While i can’t be sure, i strongly suspect this high rank is an unintended consequence of dis-aggregating ranges and whole chapter references from John 3. In fact, scanning top verses by chapter from John, the first verse in each chapter is very often the highest or second-highest ranked, and near always among the top ten. This probably says more about the counting methodology than the significance of those verses in particular. The Bible Gateway study focuses on ranges of no more than three verses to explicit mitigate this problem.
Other Measures of Importance
Moving from popularity to importance, i can imagine several different factors that might be combined to produce a more general importance score:
- citation frequency (based on some corpus). In the TopVerses.com approach, these are web pages, which provides a very large set of observations. A number of other digital text collections would also suit this purpose, and even allow segmentation by genre: for example, you get a very different ranking from the Anchor Yale Bible Dictionary compared to Easton’s (and neither have John 3.16 at the top of the list). See below for more about this.
- search frequency, the basis for the other three sources in the OpenBible.info post. This could be refined further given data on follow-up activities. For example, depending on your application, verses searches whose results are then expanded into a chapter view or followed to the next verse might get a boost compared to those with no further action (this seems like a variant of “click through” rates used in search engine advertising)
- content analysis (context-independent): this could have several different flavors.
- word count: though John 11:35 gets mentioned more than you’d expect precisely because it’s the shortest verse in the (English) Bible, in general longer verses are more likely to be important. This could be refined further given a metric for important words (but now we’ve introduced a new problem: where does that data come from?), which could be used for weighting the counts.
- We could do even better if, instead of counting words, we count concepts (and weight them). Assuming we think the concept of HUMILITY is important, we’d want verses expressing that concept to rank more highly, regardless of whether they used a more common word like “humilty”, or a less common one like “lowly”. Converting words to concepts is a difficult challenge, however.
- Connections to other data also affect importance. In some sense, every verse that reports words of Jesus is probably more important to a Christian than one whose importance is otherwise comparable, which is why we have the convention of printing Bibles with the words of Christ in red (a binary system for visualizing importance).
- We might even consider negative factors: a lower rank for unfamiliar, hard-to-pronounce names, or “taboo” words.
Unlike TopVerses.com, i don’t see a particular need to provide a unique rank for each verse. If each verse has a score (to simplify the math, a decimal between 0 and 1 is a common approach), you can simply pick the top n verses that fit your purpose, and then order any ties canonically.
Comparing Dictionary Reference Citations
I did a small experiment to compare the most frequent reference citations in seven Bible dictionaries that are incorporate in Logos’s software (so this is citation frequency, not search frequency). I extracted and counted all the references, and then aggregated the counts across all seven: the top 20 references are shown below, along with how many “votes” they received in the OpenBible.info list. In the case of whole chapter references (four of the top ten), i’ve indicated with yes/no whether any verse from that chapter occurs in the OpenBible list.
There’s relatively little overlap between the two lists: only seven of these are in the OpenBible list. Many of these make sense given the different purposes of reference works: for example, Is 61.1 is a key messianic text. The high rank for 2 Ki 15.29 is initially puzzling, but probably results from being commonly cited in discussions of the conquests of Tiglath-Pileser and the Babylonian exile. Overall, this is probably much too small a sample to show the correspondences: i presume we’d find much more overlap in the top few hundred.
|Reference||Aggregate Count||Count In
|2 Ki 15:29||165.2||0|
|1 Pe 2:9||126.3||0|
|1 Sa 1:1||121.5||0|
- The dictionaries used were the Anchor Yale Bible Dictionary, Baker Encyclopedia of the Bible, Eerdman’s Dictionary of the Bible, Eerdman’s Bible Dictionary, International Standard Bible Encyclopedia (ISBE), New Bible Dictionary, and Tyndale Bible Dictionary.
- Like the OpenBible.info approach, i took range references with 3 or fewer verses and decomposed them into individual verses, splitting the counts (which is why the aggregate counts are floats rather than integers). Larger ranges were left atomic, which confuses the results further: for example, Ge 1:26 probably ought to be even higher, since the high-ranking chapter reference Ge 1 includes it.
- Some references are undercounted because this method distinguishes BHS and LXX references, but i doubt this materially affects the results.
None of this is meant as criticism of the particular sites mentioned above. I strongly believe that any user-oriented, empirically-based data set is better than nothing, and in most endeavors like this, “the best data is more data”. * But with more data comes more complexity, and i’ve only scratched the surface here in considering some of the different factors.
The key point is this: if we want to measure something, we need to be clear up front about exactly what it is, and also what purpose we hope it will serve. I never stop being amazed at how often “obvious” approaches to data problems produce surprising results.
* In my recollection, this quote is attributed to Bob Mercer, a leading researcher in statistical language processing who was part of the IBM research group in the 1990s. I haven’t been able to verify a real source, however.