Matt Dabbs has an interesting series of posts about using Google’s blog search to determine how frequently different chapters of the Bible are referred to by bloggers, starting with the most blogged scriptures (and then some followups on least blogged Scriptures).

This data would be really interesting to nail down, but (ever the data quibbler) i have some questions about how this works. While i don’t know the details of Matt’s methodology, i expect some typical keyword search problems take their toll here too:

  • There are multiple ways to specify a reference (John 3, Jn 3): this has the potential to reduce recall
  • Some matches for a given reference may not actually count: for example, the 12th match i found for “John 3” is actually a roster for a Civil War infantry regiment containing the phrase “Adams, John: 3/9/1864”. It’s only one case, but i wonder how many more are lurking.
  • A similar problem occurs with some Scripture references themselves: “John 3” also matches “1 John 3”! I wonder if that helps to account for the popularity of John 1, 2, and 3, all of which made the top 15? In these cases, you could subtract the count for “1 John 3” from the count for “John 3”
  • I wondered why only John and Matthew’s Gospels made the top 15: but queries for Mark (or Mk) produce results that are full of non-Biblical acronyms and other misses, and i’ll bet Luke does too.

None of this is to put down Matt’s efforts: even noisy data is more instructive than silence. But this kind of counting can be very tricky business. The ESV blog discussed this topic based on Bible searches on their site a while back.
Given the full text of all these posts, i’ll bet the vocabulary is distinct enough that a statistical text classifier could be trained to determine with high reliability which ones actually referred to Biblical discussions, and which ones didn’t.