(with apologies to Oscar Wilde)
My colleague Rick Brannan has done some recent posts on the Logos Blog about some ambiguities in James:4:5-6. Given the work i’m doing on knowledge representation applied to Bible information, his latest post got me thinking about the general problem, and what best practices to use for representing ambiguity.
Ambiguity is a bad thing, and we should always try to get rid of it, right? Well, it’s always nice to have a clear understanding of things, and a lot of information technology, from language compilers to air traffic controllers to traffic lights, only functions well in its absence (though we’d lose a lot of our humor by banishing ambiguity altogether). But i’d claim the following (which i’ll glorify With Capital Letters) as the Best Practices for Representing Ambiguity:
- sort out and resolve any ambiguity when you can
- don’t create any spurious ambiguity when you can avoid it
- where’s there’s genuine ambiguity, preserve it
It’s this last practice that i want to address here: don’t guess, don’t arbitrarily pick one of several alternatives, instead find a way to represent ambiguity when it’s real.
One major knowledge representation project that provides a good model here is the Penn Treebank, a million-word corpus of English annotated with grammatical analysis. Early on, the creators recognized that syntactic theories are all over the map, and by tying the Treebank too closely to one theory or its conclusions, they’d risk being ignored by the others. So they adopted some basic practices to ensure a theory-neutral representation, including tags specifically designed to indicate “there’s an ambiguity here, and we’re not resolving it”.
Things are never black-and-white in the absence of complete information: how to decide the tricky cases? A fair rule of thumb would be to imagine you and several equally brilliant friends reviewing the case (imagine an even number of friends, so you can be the tiebreaker). Would the decision about how to resolve a given ambiguity be unanimous? Then go with the consensus and consider it settled (even though you might imagine some nameless Bible scholar somewhere disagreeing about it) . Would the vote be close? Then see if there’s a way to capture the other possibility, rather than forcing a decision.
Here’s one way this works out in practice for creating a detailed knowledgebase about people in the Bible. One of the fundamental differences between semantic search and word-based search is that you have the chance to resolve ambiguity by deciding which references to people are the same, and which are different. Take Gaius as an example: this name occurs 5 times in the New Testament, twice in narrative passages in Acts (Acts.19.29 and Acts.20.4), Rom.16.23, 1Cor.1.14, and in the opening address of 3John.1.1. Assigning the mentions in Romans and 1Cor to the same person is pretty solid (this implies Paul was in Corinth when he wrote the letter to the Romans). But otherwise, the contexts leave open the possibility that there might be four different individuals here that just happen to share the same name (which was a common one). For example, Gaius in Acts 19 is Macedonian, but Acts 20 says Gaius of Derbe. While it’s possible some of these other mentions are the same person (and it’s always interesting to speculate on possible connections), in the absence of any solid evidence the best practice is to represent them as separate individuals.
So i’m modeling this in the Bible Knowledgebase as follows:
- when different mentions seem clearly to be the same individual (e.g. Simon Peter and Cephas), standardize on a single URI for all their properties
- when there are clearly different individuals, given them each a unique URI
- when they might be the same but the evidence is weak (for example, the other Gaius mentions), treat them as different individuals (with unique URIs), and then use the property possiblySameAs to represent this hypothetical linkage.
It’s much easier to join things later if you decide they’re the same: the OWL property owl:sameAs exists to accomplish that with a single assertion that two entities are definitely the same. But splitting involves sorting out all their properties and reassigning them to the correct URI, which can be tricky business. So when in doubt, they’re left separate, with indicators where they might be the same as another.
There’s another extreme to this general principle, which is to be so afraid of making any commitments that you leave everything ambiguous. Obviously this doesn’t work well either (you wind up saying nothing), and deciding which is which is as much art as science. But as a generalization, i’d always prefer to err on the side of conservatism and joining later rather than risk losing ambiguity when it’s real.