Data Mining Your Tweets

If you revel in data (and who doesn’t! ;-), check out Technology Review’s article on How to Use Twitter for Personal Data Mining. Even if (like me) you’re not a twitterer (or do you call that a twit?), this is a nice introduction to data mining using nothing more complex than text editing and a simple visualization tool.

The techniques described here are applicable to lots of other data sets too (including perhaps the promised history of your Facebook wall posts: my account doesn’t have this feature yet, so i don’t know what this will look like).

Even if it seems silly to analyze what you yourself have said, it’s worth thinking about the data tracks we all leave around now, and what we (and others!) might do with them.

Skills vs. Education

Great piece from Michael Schrage at the Harvard Business Review blog: Higher Education is Overrated. Skills Aren’t. I’m no basher of formal credentials: i’ve even got a couple myself. But i too have been frustrated (multiple times) by computer science majors who aren’t effective programmers, to take one example.

His bottom line is that skills and accomplishment are really the coinage of business, and there’s no guarantee they follow directly from  formal education (valuable though it might be).

Some other quotable quotes:

  • “Knowledge may be power, but “knowledge from college” is neither predictor nor guarantor of success. “
  • “Treating education as the best proxy for human capital is like using patents as your proxy for measuring innovation”
  • “Academic and classroom markets are profoundly different than business and workplace markets. Why should anyone be surprised that serious knowledge/skill gaps dominate those differences?”

Information Moving To The Web

Xerox Star 8010 We all know there’s a massive shift of information onto the Internet, with Google Books scanning whole libraries, more content being born digital, the transformation of digital libraries, and tera-peta-exa-zeta-yotta-yadayadayada-bytes of data going online. But somehow, those abstract notions don’t have quite the same tangible impact as actual physical artifacts (like books) with their connections to our personal histories. Here’s how this hit home for me today.

I first got interested in computational linguistics around 1979, when i was finishing up my degree at Occidental College (an independent major combining linguistics and anthropology) and playing around with computers. Later, as a graduate student in linguistics at UCLA, i attended my first academic conference in the field: COLING 84 at Stanford, a combined gathering of the 10th International Conference on Computational Linguistics and the 22nd Annual Meeting of the Association for Computational Linguistics. It was a pretty heady experience for this young man: i still remember playing with the bit-mapped graphics on what i think was a Xerox Star, one of the earliest commercial systems with many of the display and interface innovations that are commonplace today.

I brought back the proceedings, a hefty volume about 3cm thick. Later i joined the Association for Computational Linguistics, which included getting the journal Computational Linguistics, and over the course of my 19 years with BBN Technologies i attended many annual meetings and other workshops, collecting proceedings all the time (they started distributing them on CDs around 2000). COLING 1984 Proceedings I have close to a complete collection of the journal for many years (dozens of volumes). I count 16 proceedings volumes, typically several cm each. All told. these were taking up about a meter of shelf space in my office, as they have for the last 10 years or so (the last one i have is from 2000, which is about when i got more involved in management and had a harder time justifying these kinds of technical conferences).

Today, casting about for a place to put some new books i’d acquired, i looked at these journals and proceedings, and had an epiphany. I googled a few articles: sure enough, they were all on-line. In fact, the journal became open access in 2009, and they’ve put all the back issues on the web as well. The ACL Anthology hosts thousands of computational linguistics papers, and they’ve provided digital versions of all the proceedings i have (and many many others). So all of a sudden, i realized i had a meter of useless paper volumes on my bookshelf.

You might wonder what took me so long. I do too: I guess one answer is simply inertia. I’ve had these volumes on my shelves for so long i hadn’t gotten around to reconsidering whether i really needed them. I’m also an information omnivore, so i’ve always been reluctant to just give them up (though i couldn’t tell you the last time i actually cracked the cover on one). In part, I suppose another reason is that having a shelf of professional journals and proceedings makes me feel smarter (silly though that sounds when said out loud): it’s evidence of many years of commitment to the field. In the digital age, these markers of industriousness are becoming as scarce as the artifacts themselves. 20 Years of Journals and Proceedings

Some of these volumes have moved with me many times, from Los Angeles to Massachusetts when i took my first research position with BBN (1987), through various office moves there, when we moved to Maryland in 2000 (and more internal moves there), and when we moved to the northwest to work for Logos in 2007. That first COLING volume has been on my office bookshelf as long as i’ve had an office with bookshelves! But, with ever more information on-line (and much more findable and useful there), new books that need to find a home, and doubtless other office moves ahead … it’s time to let go and continue the march into the digital future.

Irresponsible Retirement

A colleague recently described to me a professional meeting he attended for an industry that’s experiencing tremendous market pressures due to changes in technology. He characterized the attitudes of many old-school, late-career executives (who have been living in denial of the fundamental challenges) as “I just hope I can prop things up and keep them running for another 5 years so I can retire.”

Using retirement as an excuse for ignoring a challenge to your business is bad stewardship. If you’re in that kind of industry, you ought to either work to revive and/or redirect it (until the day you retire for the right reasons), or just be honest and quit now. It’s one thing to come to the end of your working career and retire because it’s time for you personally to do so. Industries change and die, and those kinds of transitions are normal too (though traumatic): maybe you need to acknowledge that and start moving your company to whatever comes next. But if you work for a company with customers, assets, and shareholders, you owe it to them to do the best you can with what’s been entrusted to you. Riding the train up to a washed out bridge, knowing that you can jump off at the last minute (even though all the other passengers are going down) is just plain irresponsible.

My Django Talk at LinuxFest

Apparently i neglected to let Blogos readers know that i was speaking at LinuxFest Northwest this past weekend: my bad! My talk was a basic practical intro to Django, the Python-based web application framework, entitled “From 0 to Website in 60 Minutes – with Django“. Since Django is touted (rightly in my view) as a highly-productive way to do web development, what better way to demonstrate that than to actually build a functioning database-backed website in the course of the talk?

It was a pretty ambitious goal, and i had to take a few shortcuts to pull it off (like starting past the boring stuff, with Python/Django/MySQL already installed, and data ready to go). But i think i can fairly claim to have delivered what i promised. We walked through an application that’s been a side-project for the Whatcom Python Users Group, a web version of Sustainable Connection‘s Food and Farm Finder brochure. It’s a nice simple learning example, well-suited to tutorial purposes. I’d say there were at least 40 or so in attendance, many the kind of beginners i was trying to focus on. And even though the time slot turned out to only be 45 minutes, I finished with several minutes to spare (in retrospect, i could have gone a little slower).

Slides are here, along with the data you need to follow them on the main page for the talk. I have audio of the talk that i’ll post in the next day or two once i’ve cleaned it up a bit: then it will be almost like being there (though without the ability to make sense of the “skeleton” joke). I was glad to have the opportunity to shine a little light on Django and repay a tiny portion of the debt of gratitude i owe its creators, since it’s been a major productivity boost in my work at Logos.

The Definitive Guide to Django Here’s another reason why i give talks whenever i get the chance: you always learn more when you teach others. As a concrete example, i was reminded while prepping the talk that Django’s template framework, while primarily designed around HTML generation, is quite general and therefore capable of generating other data formats as well. At work, i’d built up an entire module of custom code around serializing Bible Knowledgebase data as XML for internal hand-off to our developers. Re-reading the Django book gave me the idea of using Django templates to do this instead. In fairly short order, i was able to rewrite my test example, 80 lines of custom code, with a single clean template and 20 much simpler lines instead.

A Python Interface for api.Biblia.com

Last week Logos announced a public API for their new website, Biblia.com, at BibleTech. Of course, i want to wave the flag for my employer. But i’m also interested as somebody who’s dabbled in Bible web services in the past, most notably the excellent ESV Bible web service (many aspects of which are mirrored in the Biblia API: some previous posts around this can be found here at Blogos in the Web Services category). Dabblers like me often face a perennial problem: the translations people most want to read are typically not the most accessible via API, or have various other limitations.

So i’m happy with the other announcement from BibleTech last week: Logos is making the Lexham English Bible available under very generous terms (details here). The LEB is in the family of “essentially literal” translations, which makes it a good choice for tasks where the precise wording matters. And the LEB is available through the API (unlike most other versions you’re likely to want, at least until we resolve some other licensing issues).

I don’t want to do a review of the entire API here (and it will probably continue to evolve). But here are a couple of things about it that excite me:

  • The most obvious one is the ability to retrieve Bible text given a reference (the content service). Of the currently available Bible versions, the LEB is the one that interests me the most here (i hope we’ll have others in the future).
  • Another exciting aspect for me is the tag service. You provide text which may include Bible references: the service identifies any references embedded in it, and then inserts hyperlinks for them to enrich the text. So this is like RefTagger on demand (not just embedded in your website template). You can also supply a URL and tag the text that’s retrieved from it. One caveat with this latter functionality: if you want to run this on HTML, you should plan to do some pre-processing first, rather than treating it all as one big string. Otherwise random things (like “XHTML 1.0” in a DOCTYPE declaration) wind up getting tagged in strange ways (like <a href="http://ref.ly/Mal1">ML 1.0</a>).

I’ve just started working through the Biblia API today, but since i’m a Pythonista, developing a Python interface seemed like the way to go. This is still very much a work in progress, but you can download the code from this zip file and give it a whirl. Caveats abound:

  • I’ve only implemented three of the services so far: content() (retrieves Bible content for a reference), find() (lists available Bibles and metadata), and tag() (finds references in  text and enhances it with hyperlinks). And even with these three services, i haven’t supported all the parameters (maybe i will, maybe i won’t).
  • This is my first stab at creating a Python interface to an API, so there may be many stylistic shortcomings.
  • Testing has also gotten very little attention, and bugs doubtless remain.

If you’re interested and want to play along, let me know: we can probably set up a Google group or something for those who want to improve this code further.

Holy Week Visualization

If you’re thinking through the events of Holy Week, let me know what you think about this visualization that i created last year (but apparently failed to tie into the SemanticBible navigation, so you might not easily find it otherwise).  Here’s my previous Blogos post on this. I’m really interested in presentations like this that enable browsing by content rather than having to know the reference in advance.

To recap some of the features:

  • Colored blocks are grouped together by pericope so the presentation is organized by the events, rather than the order of texts themselves. The size of the block indicates how many words are associated with the pericope, and the colors indicate which Gospel provided the material. This helps you immediately see things like the fact that all four Gospels provide quite a bit of detail about the triumphal entry, though only Luke includes Jesus’ sorrow over Jerusalem.
  • The blocks are grouped by day, through the chronology is uncertain in several places, so this is an approximation at best.
  • Clicking on the pericope title takes you to the Composite Gospel page (though apparently some of the indexes are off). Clicking on the colored block takes you to source text at bible.logos.com (and a tooltip indicates the reference). As i recall, i couldn’t figure out a way to use RefTagger to actually display the text more directly in a popup.

BibleTech:2010 Debrief

The BibleTech conference is an annual highlight for those of us who work at the intersection of Bible stuff and technology, and last week’s meeting in San Jose was no exception. This was the third BibleTech — i’ve been fortunate to have attended (and presented at) them all — and there’s always a great mix of new ideas, updates on ongoing projects, and lots of interesting people to talk to. (some other reviews: Rick Brannan, Mike Aubrey, Trey Gourley)

Some of the talks i liked best this year:

  • I was already interested in Pinax before hearing James Tauber’s talk on Using Django and Pinax for Collaborative Linguistics: now i’m itching to get started!
  • Stephen Smith had a nice analysis of the most frequently tweeted Bible passages (though the evidence of vast swaths of Scripture that get very little attention was perhaps a bit depressing).
  • Neil Rees showed Concordance Builder, a program that lets you use a Swahili concordance to bootstrap one for Welsh (or any other pair of languages) with no linguistic knowledge. Building on the Paratext tool, it leverages the verse indexes along with approximate string matching and statistical glossing (technical paper by J D Riding) to produce results that are about 90-95% correct out of the book. This can reduce concordance development to a matter of weeks rather than years.
  • There were several talks related to semantics in addition to mine: Randall Tan talked about more automated methods and fleshed them out relative to the higher-level structure of Galatians, and Andi Wu gave what looked like a really interesting presentation on semantic search based on syntax and cross-language correspondence (alas, i missed it).
  • Weston Ruter talked about APIs they’re developing at OpenScriptures.org (and brought in the Linked Data idea). Logos also unveiled their new API for Biblia.

I felt my talks went well and i got some good feedback. My slides are now posted (if you wrote down URLs at the conference, i didn’t get them quite right 🙁 but here they’re correct):

(As with some previous talks, i did my presentation with Slidy (previous post): i feel like it’s going a little more smoothly each time.)

Shoutout for Audacity (and FOSS)

Since so much of my days involve pressing against intransigent data problems under they (or i) yield, i love it when things “just work”. I had such an experience a few months back with Audacity, an open-source audio recorder/editor. So i want to give a little back to the Free and Open Source Software (FOSS) movement with some well-deserved praise.

I’ve used Audacity before for some home recording projects, and it’s one of the most popular projects on SourceForge, so nobody who knows about it is likely to be surprised. My task this time was to find a way to convert more than 6000 audio files in WAV format to MP3 (so they take less space: if you’re a Logos customer, you’ll be hearing more about these — literally — in a future update). I really did not want to do

  1. open file
  2. select export
  3. fiddle with parameters
  4. open save dialog
  5. pick a filename
  6. hit save

times 6000!

A quick Google showed that the current Audacity beta provides a batch processing feature. I downloaded it without a hitch. The download page helpfully pointed out i also needed to get an MP3 encoder library (which i remembered from a previous install): also no problems.

First hitch: the documentation here is a little off, there is no Batch tab on the Preferences dialog. Another 60 seconds of search on the Wiki site found this page with the correct information: you do File > Edit Chains to set up the processing sequence, and then Apply Chain to apply it.

I tried a few files, seemed to work okay. When i audaciously tried to do all 6667 files in one go, there was some problem (but that really seemed like too big a bite anyway). So i backed off to groups of a thousand or so. I hadn’t even noticed there were some non-audio files in the directory: Audacity understandably barfed on these, and i had to restart the process after their failures. There were a few other glitches with temp files that couldn’t be saved, but i just kept restarting things.

Was it perfect? No. But come on … conversion of 6000 files took maybe an hour, and cost me nothing. How can you not like that?

Human Internet Proxies

The MIT Technology Review echoes an AP story about how, despite the proliferation of smart phones (and the digerati’s consequent obsession with them), “most wireless use is still centered on laptops”. So what do people do when they’re on the road and need something? They call a friend and ask them to look it up/book it/etc., as a human internet proxy.

Donna and i do this all the time: we don’t have web-connected phones, so if i’m driving and lost, i call her. She’s very likely to be either sitting at or within 50 feet of an Internet-connected computer, so she can relay the information back to me. Maybe not quite as cool as having my own pocket Internet , but very workable, a whole cheaper (no data plan), and it reinforces our relationship at the same time.

Technology Review: Info on the go for travelers without smart phones.