This is mostly a working note for myself of a sequence of steps i frequently go through in looking at a dataset and trying to understand its characteristics. Jon Udell gets credit (in my consciousness, at least) for the notion of a “software idiom” (a series of repeated steps to accomplish some purpose), and also for encouraging the narration of the work we do as a means of tranmitting knowledge to others. I happen to use Python in the examples that follow, though they would work equally well in Perl or any number of other languages. Python works especially well, though, because the interpreter makes the process highly interactive: the meta-process for Python development i tend to use is

  • play with the individual steps until you get them right
  • tie them together into small bundles (this is made easier when you can retrieve them with a history mechanism in your IDE)
  • if you might use the whole package again, tidy it up and persist it as code in your library.

I haven’t turned the data exploration idioms here into code because they’re always just a little different, and they always seems simple enough that it’s not worth trying to generalize them (though doubtless many others have, and i’ve just never stumbled over their code). But i want to narrate my process with this post, if only to help me remember it (and if it helps you too, so much the better).

The Data

In the basic data exploration scenario, you’ve got a list of items, a variable-length list of attributes for each item, and you’d like to summarize it in some fashion. For this discussion, i’ll use my most recent experience with our Biblical People data: there are about 3000 people, with anywhere from one to hundreds of references to them in the Biblical text. But this is a very simple and general data model that fits lots of different cases. FOr this example, the data looks something like this:

Person ID References
Aaron Ex 4:14, Ex 4:27, Ex 4:28, Ex 4:29, (and many many more)
Abagtha.1 Esth 1:10
Abda.1 1Kgs 4:6
Abda.2 Neh 11:17

and so forth (the numbers on the end of the Person IDs are for convenience in referring to different people who share the same name: but they could just as easily be bp1, bp2, etc.).

By far the most flexible way to represent this kind of data is in a Python dictionary whose keys are the person IDs (this assumes they’re unique), and whose values are the list of references (assume for this purpose those are strings like “Ex 4:14″, though other representations are possible). The rest of my discussion assumes you’ve got such a dictionary, named allrefs (you’re on your own for constructing it). Already you’ve got one useful summary of the data in the dictionary itself: the length of the dictionary is the number of keys (which, because of the way dictionaries work, are all distinct), 2986 in all.

>>> len(allrefs)
2986

Idiom #1: Find Singletons

A number of scholars (notably Pareto and Zipf) have contributed to the general observation that many events in the human sphere, rather than being normally distributed (the traditional bell curve), have a power law distribution, with a few very high frequency items, and then (to use Chris Anderson’s term) a long tail of very low frequency items. A rule of thumb is that roughly half the items occur just once (“singletons”) in such distributions. Identifying the singletons (and counting them) is often an important part of exploring the data.

For this task, the singletons are those with only a single reference, e.g. those keys for which the length of the value list is 1.

>>> singletons = filter(lambda (k,v): len(v) == 1, allrefs.items())
>>> len(singletons)
1635

Each item of singletons is now a (key, value) tuple, but only those with a single value, and we find there about as many as our rule of thumb would predict (1635 is roughly half of the total of 2986). This is an important subclass of the dataset: they dominate the data in quantity (more than half), though each individual item has little to add (only a single reference).

Idiom #2: Invert the Dictionary

allrefs is organized by people IDs: so it’s easy to ask questions about a given ID, but you can’t explore what’s true of a given value (reference). So a typical next step would be turn the data around (invert it) and organize it by reference instead. First i define a utility function extendDictToList for populating a dictionary whose values are a list rather than a single item, without having to separately first initialize the empty list. This can be done directly in one line of Python, but somehow i find my rephrasing of it here easier to grasp and use correctly.

>>> def extendDictToList(dict, key, value):
... """Extend DICT as necessary to include KEY, adding VALUE to the
... list of values (so KEY must be hashable). """
... dict.setdefault(key, []).append(value)
...
>>> foo = {}
>>> extendDictToList(foo, 'bar', 'value 1')
>>> extendDictToList(foo, 'bar', 'value 2')
>>> foo
{'bar': ['value 1', 'value 2']}

This makes it easy to accumulate several values under the same key. With this in hand, we can invert allrefs by individual references:

>>> allrefsinv = {}
>>> for (person, references) in allrefs.items():
... for ref in references:
... extendDictToList(allrefsinv, ref, person)
...

So now allrefsinv is a dictionary inverting allrefs, with references as keys, and a list of the people mentioned in that verse as the value for each key. This data has different dimensions from allrefs, of course: there are 9617 references, each including one or more people. You could go back to idiom #1 at this point and find the singletons in this dictionary (that is, references to verses that only mention one person) if that were useful.

Idiom #3: Bin by Frequency

By looking at singletons, we’ve started down the road of examining the frequency of data items, and how the frequencies are distributed (noting that the special case of keys with only a single value accounts for roughly half of all the keys). While the individual singletons themselves might be interesting, it’s also useful to see how the sizes of all the values are distributed. To do this, we construct a frequency distribution from the original allrefs dictionary.

While you could do this directly using dictionaries, i find it more convenient to just use the FreqDist class in the probability module of the Natural Language Toolkit (NLTK). Digression: NLTK is a very useful package for a wide variety of language-related processing tasks (in fact, i’ll be giving a presentation about it April 26th – 27th, 2008 at LinuxFest Northwest (Bellingham Technical College, Bellingham WA) ). Though it’s clearly overkill for this small problem, reusing well-defined code is a good practice. If you don’t want to mess with the whole package, just look at the source for probability.py: it looks like you could just pull out FreqDist, though i haven’t tried.

Let’s bin the inverted dictionary allrefsinv, to get a sense of how often verses refer to one or more people. Once you’ve got the FreqDist class defined, the steps are

  1. create an empty instance of a frequency distribution (fd)
  2. go through the value list (the people) for each key (the reference) and count how many values
  3. increment the bin count for that number

>>> fd = FreqDist()
>>> for (ref, people) in allrefsinv.items():
... fd.inc(len(people))
...
>>> len(fd)
15

It’s easy to get confused about this point: the keys in fd (which is just a dictionary with some additional methods) aren’t individual data items, but bins of a given size (there happen to be 15 of them). The counts for each bin represent the number of keys that have a value whose length is the bin size. So all the keys for which len(people) == 1 are counted by fd[1], the values with length 2 are in fd[2], etc. In this sample there aren’t any cases with binned size 0 (though in other data there might be: if we had that data here, it would be the verses that reference no people at all).

Now we can look at the counts for individual bins (fd[1] is the (inverted) singletons, references for verses that only refer to a single person), look at the maximum value for the entire distribution (it happens to be the singleton bin), and in fact print out the entire distribution.

>>> fd.samples()
[1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 16]
>>> fd[1]
5459
>>> for sample in fd.samples():
... print "%d\t%d" % (sample, fd[sample])
...
1 5459
2 2341
3 949
4 424
5 209
6 102
7 60
8 40
9 12
10 7
11 7
12 2
13 2
14 2
16 1

Note there’s a gap in sequence: there didn’t happen to be any references to 15 people, so you’d have to use a different iteration construct if you want to show those zeros. Since this is aggregate data, you can’t directly find which verse is the whopper with 16 names in it, but filtering allrefsinv tells you it’s 1Chr 25:4.

>>> from pprint import pprint
>>> pprint(filter(lambda (k,v): len(v) == 16, allrefsinv.items()))
[('1Chr 25:4',
['Heman.2',
'Bukkiah.1',
'Mallothi.1',
'Mahazioth.1',
'Hanani.2',
'Eliathah.1',
'Hothir.1',
'Uzziel.4',
'Romamtiezer.1',
'Giddalti.1',
'Mattaniah.1',
'Joshbekashah.1',
'Hananiah2.3',
'Shebuel.2',
'Jerimoth.3',
'Jerimoth.5'])]

Of course, this is only the starting point: there’s a lot more you could do beyond this to characterize the data, and maybe your head is already swimming from the data permutations. But i’ve found this 3-step process very helpful on lots of occasions where i want to understand a data set better. Gathering these few characteristics provides a very helpful overview to get started.

PS: Jon Udell’s been encouraging the accumulation of what he terms “unexamined software idioms” under the del.icio.us tag unexamined-software-idioms. I’m not completely sure this qualifies, but i’m tagging it that way just in case.

PPS: if you’ve been following the Bibleref meme, you might notice i didn’t annotate the references above. That’s because here i’m using them as data items, not real references. But thanks for noticing (if you did) …