God’s Word | our words
meaning, communication, & technology
following Jesus, the Word made flesh
October 13th, 2010

Data Mining Your Tweets

If you revel in data (and who doesn’t! ;-), check out Technology Review’s article on How to Use Twitter for Personal Data Mining. Even if (like me) you’re not a twitterer (or do you call that a twit?), this is a nice introduction to data mining using nothing more complex than text editing and a simple visualization tool.

The techniques described here are applicable to lots of other data sets too (including perhaps the promised history of your Facebook wall posts: my account doesn’t have this feature yet, so i don’t know what this will look like).

Even if it seems silly to analyze what you yourself have said, it’s worth thinking about the data tracks we all leave around now, and what we (and others!) might do with them.

April 27th, 2010

My Django Talk at LinuxFest

Apparently i neglected to let Blogos readers know that i was speaking at LinuxFest Northwest this past weekend: my bad! My talk was a basic practical intro to Django, the Python-based web application framework, entitled “From 0 to Website in 60 Minutes – with Django“. Since Django is touted (rightly in my view) as a highly-productive way to do web development, what better way to demonstrate that than to actually build a functioning database-backed website in the course of the talk?

It was a pretty ambitious goal, and i had to take a few shortcuts to pull it off (like starting past the boring stuff, with Python/Django/MySQL already installed, and data ready to go). But i think i can fairly claim to have delivered what i promised. We walked through an application that’s been a side-project for the Whatcom Python Users Group, a web version of Sustainable Connection‘s Food and Farm Finder brochure. It’s a nice simple learning example, well-suited to tutorial purposes. I’d say there were at least 40 or so in attendance, many the kind of beginners i was trying to focus on. And even though the time slot turned out to only be 45 minutes, I finished with several minutes to spare (in retrospect, i could have gone a little slower).

Slides are here, along with the data you need to follow them on the main page for the talk. I have audio of the talk that i’ll post in the next day or two once i’ve cleaned it up a bit: then it will be almost like being there (though without the ability to make sense of the “skeleton” joke). I was glad to have the opportunity to shine a little light on Django and repay a tiny portion of the debt of gratitude i owe its creators, since it’s been a major productivity boost in my work at Logos.

The Definitive Guide to Django Here’s another reason why i give talks whenever i get the chance: you always learn more when you teach others. As a concrete example, i was reminded while prepping the talk that Django’s template framework, while primarily designed around HTML generation, is quite general and therefore capable of generating other data formats as well. At work, i’d built up an entire module of custom code around serializing Bible Knowledgebase data as XML for internal hand-off to our developers. Re-reading the Django book gave me the idea of using Django templates to do this instead. In fairly short order, i was able to rewrite my test example, 80 lines of custom code, with a single clean template and 20 much simpler lines instead.

April 2nd, 2010

A Python Interface for api.Biblia.com

Last week Logos announced a public API for their new website, Biblia.com, at BibleTech. Of course, i want to wave the flag for my employer. But i’m also interested as somebody who’s dabbled in Bible web services in the past, most notably the excellent ESV Bible web service (many aspects of which are mirrored in the Biblia API: some previous posts around this can be found here at Blogos in the Web Services category). Dabblers like me often face a perennial problem: the translations people most want to read are typically not the most accessible via API, or have various other limitations.

So i’m happy with the other announcement from BibleTech last week: Logos is making the Lexham English Bible available under very generous terms (details here). The LEB is in the family of “essentially literal” translations, which makes it a good choice for tasks where the precise wording matters. And the LEB is available through the API (unlike most other versions you’re likely to want, at least until we resolve some other licensing issues).

I don’t want to do a review of the entire API here (and it will probably continue to evolve). But here are a couple of things about it that excite me:

  • The most obvious one is the ability to retrieve Bible text given a reference (the content service). Of the currently available Bible versions, the LEB is the one that interests me the most here (i hope we’ll have others in the future).
  • Another exciting aspect for me is the tag service. You provide text which may include Bible references: the service identifies any references embedded in it, and then inserts hyperlinks for them to enrich the text. So this is like RefTagger on demand (not just embedded in your website template). You can also supply a URL and tag the text that’s retrieved from it. One caveat with this latter functionality: if you want to run this on HTML, you should plan to do some pre-processing first, rather than treating it all as one big string. Otherwise random things (like “XHTML 1.0″ in a DOCTYPE declaration) wind up getting tagged in strange ways (like <a href="http://ref.ly/Mal1">ML 1.0</a>).

I’ve just started working through the Biblia API today, but since i’m a Pythonista, developing a Python interface seemed like the way to go. This is still very much a work in progress, but you can download the code from this zip file and give it a whirl. Caveats abound:

  • I’ve only implemented three of the services so far: content() (retrieves Bible content for a reference), find() (lists available Bibles and metadata), and tag() (finds references in  text and enhances it with hyperlinks). And even with these three services, i haven’t supported all the parameters (maybe i will, maybe i won’t).
  • This is my first stab at creating a Python interface to an API, so there may be many stylistic shortcomings.
  • Testing has also gotten very little attention, and bugs doubtless remain.

If you’re interested and want to play along, let me know: we can probably set up a Google group or something for those who want to improve this code further.

March 23rd, 2010

Shoutout for Audacity (and FOSS)

Since so much of my days involve pressing against intransigent data problems under they (or i) yield, i love it when things “just work”. I had such an experience a few months back with Audacity, an open-source audio recorder/editor. So i want to give a little back to the Free and Open Source Software (FOSS) movement with some well-deserved praise.

I’ve used Audacity before for some home recording projects, and it’s one of the most popular projects on SourceForge, so nobody who knows about it is likely to be surprised. My task this time was to find a way to convert more than 6000 audio files in WAV format to MP3 (so they take less space: if you’re a Logos customer, you’ll be hearing more about these — literally — in a future update). I really did not want to do

  1. open file
  2. select export
  3. fiddle with parameters
  4. open save dialog
  5. pick a filename
  6. hit save

times 6000!

A quick Google showed that the current Audacity beta provides a batch processing feature. I downloaded it without a hitch. The download page helpfully pointed out i also needed to get an MP3 encoder library (which i remembered from a previous install): also no problems.

First hitch: the documentation here is a little off, there is no Batch tab on the Preferences dialog. Another 60 seconds of search on the Wiki site found this page with the correct information: you do File > Edit Chains to set up the processing sequence, and then Apply Chain to apply it.

I tried a few files, seemed to work okay. When i audaciously tried to do all 6667 files in one go, there was some problem (but that really seemed like too big a bite anyway). So i backed off to groups of a thousand or so. I hadn’t even noticed there were some non-audio files in the directory: Audacity understandably barfed on these, and i had to restart the process after their failures. There were a few other glitches with temp files that couldn’t be saved, but i just kept restarting things.

Was it perfect? No. But come on … conversion of 6000 files took maybe an hour, and cost me nothing. How can you not like that?

February 11th, 2010

Bookmarklets Redux

Time spent on the web can be oh-so tedious if you’re constantly cutting things from one page and pasting them elsewhere just to get to another, related page. Someday Linked Data may make this all better, but until then, we all get by with helpful tricks.

Bookmarklets are one essential weapon in the arsenal of the web-info-warrior. Usually they’re little JavaScript programs stored as a bookmark in Firefox, providing one-click access to some simple functionality like looking things up elsewhere, resizing your window, etc. I’ve blogged previously about bookmarklets to find local library sources for a book on an Amazon page (or PaperBackSwap).

I dusted off my bookmarklet skills this past week and came up with some nifty tools that i wanted to share.

First off, imagine you’re looking at a website with Bible references whose benighted author somehow failed to include RefTagger. So rather than a nice pop-up with the text of the reference, or even a helpful link to that text on some Bible site, you’re just looking at a inanimate, unlinked string: boo. The Bible Reference Bookmarklet to the rescue! Simply select the text of the reference, click the bookmarklet, and you’ll be whisked off to that reference at Bible.Logos.com. If you haven’t selected any text first, you get a dialogue box asking for it.

To get this goodie in Firefox, first make sure the Bookmarks Toolbar is showing (View > Toolsbars > Bookmarks Toolbar must be checked). I’d love to give you a link to just drag onto the toolbar, but i don’t seem to be able to get the code past WordPress. So go to Bookmarks > Organize Bookmarks, and select Organize > New Bookmark. Give it a useful name like “Bible Reference Lookup”, and paste the code below in Location field.

javascript:(function(){%20function%20getSearchString%20(promptString)%20{%20s%20=%20null;
if%20(document.selection%20&&%20document.selection.createRange)%20{%20s%20=document.selection.createRange().text;%20}%20
else%20if%20(document.getSelection)%20{%20s=%20document.getSelection();%20}%20
if%20(!%20(s%20&&%20s.length))%20{%20s%20=prompt(promptString,'');%20}
%20return%20s;%20}%20searchString%20=%20getSearchString('Bible%20Reference%20to%20look%20up%20:');%20
if%20(searchString%20!=%20null)%20{%20if(searchString.length)%20{%20location%20='http://bible.logos.com/#ref='+escape(searchString);%20}%20
else%20{%20location%20='http://bible.logos.com/';%20}%20}%20%20})();

After you’ve clicked ok, you should see it on your toolbar.

You can do similar tricks for a wide variety of strings that you just want to look up elsewhere (i discovered one here while writing this post that lets you look up articles on Wikipedia). This isn’t fundamentally different from copying the string into a search box: but sometimes it’s more convenient.

Descending into more esoteric purposes (to give you ideas for your own bookmarklets): as part of an earlier post on Tools for Personal Knowledge Management, i mentioned my use of TiddlyWiki for quick organization of hyperlinked notes. Like other wiki software, TiddlyWiki has its own link syntax, that looks like

[[Link text | URL]]

When linking to lots of other web pages, i was getting tired of copying the URL, pasting that in, then typing the square brackets, link text, vertical bar, and more square brackets, all in the right format. Wouldn’t it be more convenient to just construct this expression from the title of the page and its URL, rather than having to type it myself? YES! and the TiddlyWiki Page Link bookmarklet does just that, putting the result in a little pop-up window where a triple-click selects the whole thing, ready to copy and paste into your tiddlywiki (and tailor as desired: the title isn’t always what you want, but it’s often easier to edit and throw things out rather than type afresh). This one you can just drag to your bookmarks toolbar and use right away.

TiddlyWiki Page Link

Also, i’ve switched to a much better library lookup bookmarklet (and a service to help you create one for your local library) from WorldCat. Among other things, it generates the list of all the different ISBNs that might exist for a title (which can be very long indeed), and when there are many, it provides links for alternate searches in case the first group comes up empty handed.

Some other cool bookmarklets in my collection include:

  • CiteULike Popup Post and kin to make it easy to add (certain kinds of) articles to your reading list management. Adds more value for sources whose structure it understands.
  • Show del.icio.us citations of the current URL (you can find it there)
  • Resize your browser window to 1024 x 768 (if you want to see how a page will look on a smaller monitor or projector): the bookmarklet follows, just drag to your toolbar. 1024 x 768
  • A CSS validator for the current page: see Pete Freitag’s page.

Hat tips:

November 1st, 2009

Bible Chatbots

Suppose you had a database listing authorship and reported speech in the Bible, so that, for each set of words, you know who said or wrote them (the ESV folks did this using Amazon’s Mechanical Turk a few years back, and Jim Albright’s Dramatizer has similar data embedded in it). I assume the speakers have standardized identifiers.

Now imagine a matching algorithm (there are lots of candidates out there) that, when provided with either a question or a list of words, and optionally a speaker, retrieves passages that best match the input.

Example: “why does God allow evil?” might return

  • Eliphaz the Temanite: Job 15:14-16
  • the woman of Tekoa: 2 Sam 14:14
  • the apostle John: 1 John 3:11-17

Querying about “what about God and evil?” with speaker=Jesus might (in the best case) give answers like

  • Matt 5:45
  • Matt 12:35

Apart from how accurate such answers might be (that depends on the sophistication of the matching algorithm), you’ve now got the engine for a chatbot that gives Biblical “answers” . Aside from perhaps being an interesting hack, would this be useful? Lazyweb, are you listening?

June 16th, 2009

http://ref.ly for Bible References

My colleagues at Logos have launched http://ref.ly, a URL shortening service for Bible references: see this blog post. It provides the convenience of TinyURL (turning long unreadable URLs into something much more manageable), but unlike that service also provides readable, understandable content. Once you get past the prefix, you won’t have any trouble figuring out what verse http://ref.ly/Mk4.9 is referring to.

If you’re a Twitter person trying to shoehorn your message into 140-character tweets, you’ll like the fact that this gives you a brief and unambiguous way to both specify a Bible reference and link to the content behind it (the references resolve to the actual verse text at bible.logos.com). Since addressability matters, this is a good thing.

But it has precisely the same utility even if you’re not a Twitterhead (i’m not):

  • it clearly marks a string of characters as a Bible reference
  • it also normalizes the reference into a form that can be automatically processed

While it’s not quite a microformat, it’s really only a small step away from things like bibleref. In particular, if lots of people start using ref.ly references, it will be possible to process that content and understand things like what verses are most popular.

What’s more, editors that recognize and automatically link URLs (like MS Outlook for HTML-based email, and MS Word) will now automatically make Bible links for you (like RefTagger does for blog posts), as long as you’re willing to tack on “http://ref.ly/” and live with the slightly non-traditional format. You don’t need to know anything about how to make a hyperlink in HTML: just a little extra syntax (14 characters, to be precise) moves these references toward much greater usefulness.

June 12th, 2009

Reading Tab-Delimited Data in Python with csv

I had a head-slapper this morning when i realized i’d been using custom code for a long time to do something that’s in a standard Python module. Here’s the sorry tale, in hopes of saving others from a similar fate.

I regularly use tab-delimited files for data wrangling: it’s a nice, lightweight format for table-structured data, and Excel makes a good enough editor for non-programmers to change things without messing up the format. Here’s a simple example, with a set of identifiers in the first column: a typical use case would be that somebody is editing the second column so you can map old identifiers to new ones.

Old New
Aphek1 AphekOfAsher
Aphek2 AphekOfSharon
Aphek3 AphekOfAram

It’s also very easy to read and write this kind of data in Python:

for row in open('somefile.txt', 'rb'):
    old, new = row.split('\t')
    # do something useful here

So i have a little utility reader module doing only a little more than this, stripping out comment lines, returning a list or a dict, etc., and i use this code all over the place. Then i recently needed to read some CSV (comma separated values) files, and stopped to ask The Question, which every programmer should ask before writing new code:

Hasn’t somebody else solved this problem already?

In the case of reading and writing CSV files, the answer was a quick and clear “yes”: there’s a standard Python module called csv that does just that, and nicely. So, reformatting the earlier data example as CSV would look like this:

"Old", "New"
"Aphek1", "AphekOfAsher"
"Aphek2", "AphekOfSharon"
"Aphek3", "AphekOfAram"

and there’s a nice DictReader method that (assuming your columns are unique and your first row identifies them) makes working with this data even easier.

import csv
reader = csv.DictReader(open('somefile.csv', 'rb'))
for row in reader:
    #do something more useful here
    print row.get('new')

If the first row doesn’t contain column headers, you can supply them to DictReader. This looks like overkill for this simple problem, but once you have multiple columns, need to check values or map them onto something else, or add other logic and processing, life is just much easier with a dictionary structure (for one thing, you get rid of meaningless mystery indexes and stop asking “what the heck is in row[1]“?).

Now comes the embarrassing part: i quickly breezed through the documentation, accomplished my immediate task, and moved on, missing one important detail that i just now (a month later!) figured out. Tab-delimited files are just a special case of a CSV file. My original, tab-delimited file works just the same way, once i construct the reader with tabs (rather than the default of commas) as the delimiter.

import csv
reader = csv.DictReader(open('somefile.txt', 'rb'), delimiter='\t')
for row in reader:
    #do something more useful here
    print row.get('new')

There are a few other gotchas, the most important of which for me is that csv doesn’t handle Unicode. So if you have to read Unicode data, you’re back to reading the data directly, splitting lines on tabs, etc.

The best code is usually the code you didn’t write and don’t have to maintain. No matter how many times i stop and ask The Question, i still don’t do it enough.

May 22nd, 2009

The Most Important Verses? It Depends What You Mean

The title of this post is a deliberate take-off from a recent post at OpenBible.info entitled “What Are the Most Popular Verses in the Bible? It Depends Whom You Ask”. That post combines data from an earlier ESV analysis of search results, TopVerses.com, a BibleGateway (internal) study, and OpenBible data to present a list of 278 verses, all of which occur in the top hundred of at least one source’s “top 100″ list. It’s interesting to see both how much disparity there is (only 13% occur in at least three of the four lists), but also how uneven the distribution is. As one commenter points out, it’s somewhat surprising that there are no verses from Revelation, and Old Testament narrative in particular is largely absent except for Genesis. John’s gospel has about as many popular verses as all the other gospels combined: there are only four verses from Mark (two of them from the often-questioned ending). Less surprisingly, perhaps, there are none from the shortest NT books (Philemon, Jude, 2-3 John). Altogether it’s an interesting study.

The larger question this raises for me is how we might come up with a comprehensive, global score for verses to indicate their importance for a variety of purposes. As the OpenBible post suggests, this depends both on what the source of the data is, but also on what your purpose is and what you mean by “important” (which is certainly different from “popular”, though not completely unrelated).

One useful purpose is ranking verses to present them in response to searches: TopVerses.com is explicitly organized this way, as indicated in this news article about the site. They don’t go into much detail about how they gathered their data, though the scope (37M references scoured from the web) is impressive. But there’s a subtle disparity here: their data is based on counting mentions (citations) in published web pages, but their use case is prioritizing search results, and these may be out of sync. The fact that a given verse is frequently published on the web doesn’t necessarily mean it’s the one you want at the top of the list when you’re doing a word-based search, for example. The other three sources seem perhaps better matched to ranking search results, since they’re derived from searches themselves.

Another key hitch is these endeavors is how to handle range references, both in processing source data and (for search purposes) in handling queries. For example, many Bible dictionaries frequently reference ranges of verses, sometimes extensive, multi-chapter ones. If you’re going to count these, you need to think carefully about how you do the counting so you don’t introduce bias (or, better, you select the bias that’s best suited to your purposes).

For example, in the TopVerses.com ranking John 3.1 is #26, despite the rather plain descriptive content with little obvious spiritual impact.

Now there was a Pharisee, a man named Nicodemus who was a member of the Jewish ruling council. (John 3.1, NIV)

While i can’t be sure, i strongly suspect this high rank is an unintended consequence of  dis-aggregating ranges and whole chapter references from John 3. In fact, scanning top verses by chapter from John, the first verse in each chapter is very often the highest or second-highest ranked, and near always among the top ten. This probably says more about the counting methodology than the significance of those verses in particular. The Bible Gateway study focuses on ranges of no more than three verses to explicit mitigate this problem.

Other Measures of Importance

Moving from popularity to importance, i can imagine several different factors that might be combined to produce a more general importance score:

  • citation frequency (based on some corpus). In the TopVerses.com approach, these are web pages, which provides a very large set of observations. A number of other digital text collections would also suit this purpose, and even allow segmentation by genre: for example, you get a very different ranking from the Anchor Yale Bible Dictionary compared to Easton’s (and neither have John 3.16 at the top of the list). See below for more about this.
  • search frequency, the basis for the other three sources in the OpenBible.info post. This could be refined further given data on follow-up activities. For example, depending on your application, verses searches whose results are then expanded into a chapter view or followed to the next verse might get a boost compared to those with no further action (this seems like a variant of “click through” rates used in search engine advertising)
  • content analysis (context-independent): this could have several different flavors.
    • word count: though John 11:35 gets mentioned more than you’d expect precisely because it’s the shortest verse in the (English) Bible, in general longer verses are more likely to be important. This could be refined further given a metric for important words (but now we’ve introduced a new problem: where does that data come from?), which could be used for weighting the counts.
    • We could do even better if, instead of counting words, we count concepts (and weight them). Assuming we think the concept of HUMILITY is important, we’d want verses expressing that concept to rank more highly, regardless of whether they used a more common word like “humilty”, or a less common one like “lowly”. Converting words to concepts is a difficult challenge, however.
    • Connections to other data also affect importance. In some sense, every verse that reports words of Jesus is probably more important to a Christian than one whose importance is otherwise comparable, which is why we have the convention of printing Bibles with the words of Christ in red (a binary system for visualizing importance).
    • We might even consider negative factors: a lower rank for unfamiliar, hard-to-pronounce names, or “taboo” words.

Unlike TopVerses.com, i don’t see a particular need to provide a unique rank for each verse. If each verse has a score (to simplify the math, a decimal between 0 and 1 is a common approach), you can simply pick the top n verses that fit your purpose, and then order any ties canonically.

Comparing Dictionary Reference Citations

I did a small experiment to compare the most frequent reference citations in seven Bible dictionaries that are incorporate in Logos’s software (so this is citation frequency, not search frequency). I extracted and counted all the references, and then aggregated the counts across all seven: the top 20 references are shown below, along with how many “votes” they received in the OpenBible.info list. In the case of whole chapter references (four of the top ten), i’ve indicated with yes/no whether any verse from that chapter occurs in the OpenBible list.

There’s relatively little overlap between the two lists: only seven of these are in the OpenBible list. Many of these make sense given the different purposes of reference works: for example, Is 61.1 is a key messianic text. The high rank for 2 Ki 15.29 is initially puzzling, but probably results from being commonly cited in discussions of the conquests of Tiglath-Pileser and the Babylonian exile. Overall, this is probably much too small a sample to show the correspondences: i presume we’d find much more overlap in the top few hundred.

Reference Aggregate Count Count In
OpenBible List
Jn 1:14 169.5 1
2 Ki 15:29 165.2 0
Is 61:1 159.8 0
Ac 1:13 151.7 0
Ge 1 150.0 yes
Ac 15 143.0 no
Ge 2:7 142.3 no
Ge 46:21 139.3 no
Jn 3:16 137.8 4
Ge 1:26 135.2 3
Is 7:14 134.3 1
Mt 28:19 130.2 3
Da 7:13 130.0 0
Ps 2:7 129.8 0
1 Pe 2:9 126.3 0
Ac 20:4 124.3 0
Lk 3:1 123.8 0
Mk 10:45 123.7 0
1 Sa 1:1 121.5 0
Ac 1:8 120.8 3

Details:

Conclusions

None of this is meant as criticism of the particular sites mentioned above. I strongly believe that any user-oriented, empirically-based data set is better than nothing, and in most endeavors like this, “the best data is more data”. * But with more data comes more complexity, and i’ve only scratched the surface here in considering some of the different factors.

The key point is this: if we want to measure something, we need to be clear up front about exactly what it is, and also what purpose we hope it will serve. I never stop being amazed at how often “obvious” approaches to data problems produce surprising results.


* In my recollection, this quote is attributed to Bob Mercer, a leading researcher in statistical language processing who was part of the IBM research group in the 1990s. I haven’t been able to verify a real source, however.

March 5th, 2009

Using GATE for Simple Text Mining

The Punditry Propagation Principle (P3)

If you say enough, long enough and loud enough, people start to believe you know something.

In accordance withthe P3, people occasionally ask me things, apparently because they’ve been fooled into thinking i know something. But these conversations sometimes produce useful blog posts, so it’s not a total loss (for me, i mean: i can’t say for them).

In yesterday’s example of P3, an acquaintance of a Logos colleague had read some press release about my hiring, thought i knew something, and wanted to talk with me about a task he was performing for a client. He was charged with using Word to search documents for keywords from a particular subject domain, looking for occurrences of some information of interest. Then he would mark those instances with some special code, and in a subsequent second pass, somebody else with more domain expertise would go through, find the special codes, and pull out some bits of information. These would then get pasted into an Excel spreadsheet.

While the technically sophisticated might scoff at this low-tech approach to the problem of information extraction, I have no doubt that there are people all over the business world doing similar things. There’s simply too much information locked up in prose inside documents, created with no thought about how you might later extract structured data from them, and only the crudest of tools are easily available to help with the task.

Well, in this case, i actually did know something about the subject: i was an active researcher in the field of information extraction for quite a few years at BBN Technologies. We chatted for a while, and i gave him a number of caveats as to why this approach might be too heavyweight or otherwise inappropriate for his task, and then recommended he check out the General Architecture for Text Engineering (GATE), developed over several years by natural language processing researchers at the University of Sheffield. While GATE is not the most sophisticated of information extraction tools, it has several features to recommend it:

  • it provides important basic capabilities right out of the box, including I/O in standard formats, integration of a number of other useful tools
  • it includes a visual development interface
  • best of all, it’s open-source and freely  available for Linux, Windows, and Mac OS X

I’m not going to provide a tutorial for using GATE (they’ve got plenty of documentation available already), but here’s a high-level overview of the steps, to help you decide if GATE might be a good fit for your task.

  1. download and install. There are a lot of different pieces, most of which you don’t need for the simplest tasks.
  2. Get the data you want to process in some structured form. GATE can process XML, HTML, SGML, email, and plain text.
  3. Define a new Language Resource for your document. At this point you can also view it in the GUI, and if there were existing XML annotations, you can view them
  4. Select one or more existing Processing Resources: out of the box, there are sentence splitters, tokenizers, and even ANNIE, a basic information extraction system
  5. Create an Application consisting of one or more processing steps, select your document, and run it.
  6. Now when you go back to view your document, you’ll see additional annotations have been added.

You can also use GATE as a manual annotation tool, and save the results in various formats. GATE’s written in Java and includes a plug-in architecture: so Java programmers can add new capabilities.

This screen shot shows a portion of Romans 16 from the World English Bible (one of the few modern Bible texts that’s freely available in XML form), annotated (imperfectly) using names from the Bible Knowledgebase (salmon), and highlighting the original notes (green).

GATE processing example

If any of you are still with me, let me emphasize this caveat: GATE is not a tool for casual use, and you should bring some technical expertise along with the expectation that you’ll have to invest a fair amount of time in figuring out how things work. But if these are the kinds of capabilities you need, it may make a lot more sense to start from GATE that to reinvent them yourself.

Postlude

This whole experience was a good example for me of the Principle of Reciprocity that Jesus teaches in Mark 4:24-25.

“Consider carefully what you hear,” he continued. “With the measure you use, it will be measured to you—and even more. Whoever has will be given more; whoever does not have, even what he has will be taken from him.”

I didn’t expect this initial conversation to produce anything useful for me: i just did it to help a friend of a friend. But it wound up “giving me more”, not only this blog post, but renewing my interest in GATE as a possible tool for related future work.