God’s Word | our words
meaning, communication, & technology
following Jesus, the Word made flesh
February 23rd, 2008

New Life for Old Books, and Feeding Your Book Habit

Bibliophiles have two interlocking problems:

  1. Given all my interests, how do i get more books without going broke?
  2. What do i do later with the pile of things generated by #1?

If you read a lot, just managing your list of things you’d like to read becomes an information technology challenge all its own. Amazon wishlists work alright for this purpose (and help you remember just how interested you were, if you use the priority feature), and of course Amazon makes it very easy to buy them! (in case you’re feeling generous, our Amazon wishlist is here)

When it comes to a task like software development that i’ll invest personal time in, purchasing a new book is an incredible value: an hour or two saved nearly always justifies the cost of the book. Nevertheless, i try not go overboard, so my regular routine is

  1. check my local library (unless it’s something i need to own or use for a long period of time)
  2. check a bookswap site like PaperBackSwap or BookMooch
  3. only if those fail me, cough up the money and buy it, typically from Amazon

But i don’t yet have quite the information technology i need to make this work as smoothly as i’d like. First of all, it means i have to check multiple places. Links and browser bookmarklets make this somewhat easier: looking at a book on Amazon, i can check my local library with one click with a library lookup bookmarklet (though it’s sometimes misses if there’s a different edition), and check PaperBackSwap with another (i posted here about the PBS bookmarklet i created).

More recently, i’ve been trying LibraryThing as my starting point for searches: once you’ve located the book of interest, they make it easy to get to Amazon, as well as your local library (through WorldCat), and they’ll even tell you if any of their associated bookswap sites have it available (kudos to BookMooch for participating in this: boo on PaperBackSwap for not playing). Furthermore, their Universal Import lets you import your Amazon wishlist directly into LibraryThing (at which point you can go down the list and check other sources). I’ve been thinking lately about writing a little Python application to query my Amazon wishlist but then do the work for me of identifying any items that are currently available from a non-purchase source.

What about problem #2, getting rid of book you don’t want anymore? We’ve had some success lately with a hybrid strategy:

  • First, estimate whether there’s continued interest in purchasing this book by checking the used market at Amazon. Powell’s makes this even easier: you can type a lengthy list of ISBNs into their page, and they’ll tell you which ones they’ll buy (which is a pretty good estimate of whether there’s still a market for the book). If there’s a market for the book, you have nothing to lose other than a little effort by listing it on Amazon’s site. Once you’ve set up a seller account, you can list them for 60 days for free. They get a cut if you sell it, but any sale means more money in the kitty for future book purchases!
  • If there’s no market for the book, then we list it on a book swap site. When people want them, we have to pay to mail them, but then we get credits so we can request future swaps. It averages out to a couple of bucks a book, which is a steal if you can find ones you want.
  • If nobody wants it on the swap site and it’s not out of date, your local library might be interested in a donation
  • If you absolutely positively have to throw away a book (sniff), at least make sure you recycle it!

(This post was motivated by Phil Gon’s latest post at the Logos blog on building your digital library by reselling your paper books. He describes a great way to finance your purchase of Logos software, and you’ll be much happier with a digital reference library next time you have to move it!)

February 21st, 2008

Software Idioms for Data Exploration in Python

This is mostly a working note for myself of a sequence of steps i frequently go through in looking at a dataset and trying to understand its characteristics. Jon Udell gets credit (in my consciousness, at least) for the notion of a “software idiom” (a series of repeated steps to accomplish some purpose), and also for encouraging the narration of the work we do as a means of tranmitting knowledge to others. I happen to use Python in the examples that follow, though they would work equally well in Perl or any number of other languages. Python works especially well, though, because the interpreter makes the process highly interactive: the meta-process for Python development i tend to use is

  • play with the individual steps until you get them right
  • tie them together into small bundles (this is made easier when you can retrieve them with a history mechanism in your IDE)
  • if you might use the whole package again, tidy it up and persist it as code in your library.

I haven’t turned the data exploration idioms here into code because they’re always just a little different, and they always seems simple enough that it’s not worth trying to generalize them (though doubtless many others have, and i’ve just never stumbled over their code). But i want to narrate my process with this post, if only to help me remember it (and if it helps you too, so much the better).

The Data

In the basic data exploration scenario, you’ve got a list of items, a variable-length list of attributes for each item, and you’d like to summarize it in some fashion. For this discussion, i’ll use my most recent experience with our Biblical People data: there are about 3000 people, with anywhere from one to hundreds of references to them in the Biblical text. But this is a very simple and general data model that fits lots of different cases. FOr this example, the data looks something like this:

Person ID References
Aaron Ex 4:14, Ex 4:27, Ex 4:28, Ex 4:29, (and many many more)
Abagtha.1 Esth 1:10
Abda.1 1Kgs 4:6
Abda.2 Neh 11:17

and so forth (the numbers on the end of the Person IDs are for convenience in referring to different people who share the same name: but they could just as easily be bp1, bp2, etc.).

By far the most flexible way to represent this kind of data is in a Python dictionary whose keys are the person IDs (this assumes they’re unique), and whose values are the list of references (assume for this purpose those are strings like “Ex 4:14″, though other representations are possible). The rest of my discussion assumes you’ve got such a dictionary, named allrefs (you’re on your own for constructing it). Already you’ve got one useful summary of the data in the dictionary itself: the length of the dictionary is the number of keys (which, because of the way dictionaries work, are all distinct), 2986 in all.

>>> len(allrefs)
2986

Idiom #1: Find Singletons

A number of scholars (notably Pareto and Zipf) have contributed to the general observation that many events in the human sphere, rather than being normally distributed (the traditional bell curve), have a power law distribution, with a few very high frequency items, and then (to use Chris Anderson’s term) a long tail of very low frequency items. A rule of thumb is that roughly half the items occur just once (“singletons”) in such distributions. Identifying the singletons (and counting them) is often an important part of exploring the data.

For this task, the singletons are those with only a single reference, e.g. those keys for which the length of the value list is 1.

>>> singletons = filter(lambda (k,v): len(v) == 1, allrefs.items())
>>> len(singletons)
1635

Each item of singletons is now a (key, value) tuple, but only those with a single value, and we find there about as many as our rule of thumb would predict (1635 is roughly half of the total of 2986). This is an important subclass of the dataset: they dominate the data in quantity (more than half), though each individual item has little to add (only a single reference).

Idiom #2: Invert the Dictionary

allrefs is organized by people IDs: so it’s easy to ask questions about a given ID, but you can’t explore what’s true of a given value (reference). So a typical next step would be turn the data around (invert it) and organize it by reference instead. First i define a utility function extendDictToList for populating a dictionary whose values are a list rather than a single item, without having to separately first initialize the empty list. This can be done directly in one line of Python, but somehow i find my rephrasing of it here easier to grasp and use correctly.

>>> def extendDictToList(dict, key, value):
... """Extend DICT as necessary to include KEY, adding VALUE to the
... list of values (so KEY must be hashable). """
... dict.setdefault(key, []).append(value)
...
>>> foo = {}
>>> extendDictToList(foo, 'bar', 'value 1')
>>> extendDictToList(foo, 'bar', 'value 2')
>>> foo
{'bar': ['value 1', 'value 2']}

This makes it easy to accumulate several values under the same key. With this in hand, we can invert allrefs by individual references:

>>> allrefsinv = {}
>>> for (person, references) in allrefs.items():
... for ref in references:
... extendDictToList(allrefsinv, ref, person)
...

So now allrefsinv is a dictionary inverting allrefs, with references as keys, and a list of the people mentioned in that verse as the value for each key. This data has different dimensions from allrefs, of course: there are 9617 references, each including one or more people. You could go back to idiom #1 at this point and find the singletons in this dictionary (that is, references to verses that only mention one person) if that were useful.

Idiom #3: Bin by Frequency

By looking at singletons, we’ve started down the road of examining the frequency of data items, and how the frequencies are distributed (noting that the special case of keys with only a single value accounts for roughly half of all the keys). While the individual singletons themselves might be interesting, it’s also useful to see how the sizes of all the values are distributed. To do this, we construct a frequency distribution from the original allrefs dictionary.

While you could do this directly using dictionaries, i find it more convenient to just use the FreqDist class in the probability module of the Natural Language Toolkit (NLTK). Digression: NLTK is a very useful package for a wide variety of language-related processing tasks (in fact, i’ll be giving a presentation about it April 26th – 27th, 2008 at LinuxFest Northwest (Bellingham Technical College, Bellingham WA) ). Though it’s clearly overkill for this small problem, reusing well-defined code is a good practice. If you don’t want to mess with the whole package, just look at the source for probability.py: it looks like you could just pull out FreqDist, though i haven’t tried.

Let’s bin the inverted dictionary allrefsinv, to get a sense of how often verses refer to one or more people. Once you’ve got the FreqDist class defined, the steps are

  1. create an empty instance of a frequency distribution (fd)
  2. go through the value list (the people) for each key (the reference) and count how many values
  3. increment the bin count for that number

>>> fd = FreqDist()
>>> for (ref, people) in allrefsinv.items():
... fd.inc(len(people))
...
>>> len(fd)
15

It’s easy to get confused about this point: the keys in fd (which is just a dictionary with some additional methods) aren’t individual data items, but bins of a given size (there happen to be 15 of them). The counts for each bin represent the number of keys that have a value whose length is the bin size. So all the keys for which len(people) == 1 are counted by fd[1], the values with length 2 are in fd[2], etc. In this sample there aren’t any cases with binned size 0 (though in other data there might be: if we had that data here, it would be the verses that reference no people at all).

Now we can look at the counts for individual bins (fd[1] is the (inverted) singletons, references for verses that only refer to a single person), look at the maximum value for the entire distribution (it happens to be the singleton bin), and in fact print out the entire distribution.

>>> fd.samples()
[1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 16]
>>> fd[1]
5459
>>> for sample in fd.samples():
... print "%d\t%d" % (sample, fd[sample])
...
1 5459
2 2341
3 949
4 424
5 209
6 102
7 60
8 40
9 12
10 7
11 7
12 2
13 2
14 2
16 1

Note there’s a gap in sequence: there didn’t happen to be any references to 15 people, so you’d have to use a different iteration construct if you want to show those zeros. Since this is aggregate data, you can’t directly find which verse is the whopper with 16 names in it, but filtering allrefsinv tells you it’s 1Chr 25:4.

>>> from pprint import pprint
>>> pprint(filter(lambda (k,v): len(v) == 16, allrefsinv.items()))
[('1Chr 25:4',
['Heman.2',
'Bukkiah.1',
'Mallothi.1',
'Mahazioth.1',
'Hanani.2',
'Eliathah.1',
'Hothir.1',
'Uzziel.4',
'Romamtiezer.1',
'Giddalti.1',
'Mattaniah.1',
'Joshbekashah.1',
'Hananiah2.3',
'Shebuel.2',
'Jerimoth.3',
'Jerimoth.5'])]

Of course, this is only the starting point: there’s a lot more you could do beyond this to characterize the data, and maybe your head is already swimming from the data permutations. But i’ve found this 3-step process very helpful on lots of occasions where i want to understand a data set better. Gathering these few characteristics provides a very helpful overview to get started.

PS: Jon Udell’s been encouraging the accumulation of what he terms “unexamined software idioms” under the del.icio.us tag unexamined-software-idioms. I’m not completely sure this qualifies, but i’m tagging it that way just in case.

PPS: if you’ve been following the Bibleref meme, you might notice i didn’t annotate the references above. That’s because here i’m using them as data items, not real references. But thanks for noticing (if you did) …

|