Using GATE for Simple Text Mining

The Punditry Propagation Principle (P3)

If you say enough, long enough and loud enough, people start to believe you know something.

In accordance withthe P3, people occasionally ask me things, apparently because they’ve been fooled into thinking i know something. But these conversations sometimes produce useful blog posts, so it’s not a total loss (for me, i mean: i can’t say for them).

In yesterday’s example of P3, an acquaintance of a Logos colleague had read some press release about my hiring, thought i knew something, and wanted to talk with me about a task he was performing for a client. He was charged with using Word to search documents for keywords from a particular subject domain, looking for occurrences of some information of interest. Then he would mark those instances with some special code, and in a subsequent second pass, somebody else with more domain expertise would go through, find the special codes, and pull out some bits of information. These would then get pasted into an Excel spreadsheet.

While the technically sophisticated might scoff at this low-tech approach to the problem of information extraction, I have no doubt that there are people all over the business world doing similar things. There’s simply too much information locked up in prose inside documents, created with no thought about how you might later extract structured data from them, and only the crudest of tools are easily available to help with the task.

Well, in this case, i actually did know something about the subject: i was an active researcher in the field of information extraction for quite a few years at BBN Technologies. We chatted for a while, and i gave him a number of caveats as to why this approach might be too heavyweight or otherwise inappropriate for his task, and then recommended he check out the General Architecture for Text Engineering (GATE), developed over several years by natural language processing researchers at the University of Sheffield. While GATE is not the most sophisticated of information extraction tools, it has several features to recommend it:

  • it provides important basic capabilities right out of the box, including I/O in standard formats, integration of a number of other useful tools
  • it includes a visual development interface
  • best of all, it’s open-source and freely  available for Linux, Windows, and Mac OS X

I’m not going to provide a tutorial for using GATE (they’ve got plenty of documentation available already), but here’s a high-level overview of the steps, to help you decide if GATE might be a good fit for your task.

  1. download and install. There are a lot of different pieces, most of which you don’t need for the simplest tasks.
  2. Get the data you want to process in some structured form. GATE can process XML, HTML, SGML, email, and plain text.
  3. Define a new Language Resource for your document. At this point you can also view it in the GUI, and if there were existing XML annotations, you can view them
  4. Select one or more existing Processing Resources: out of the box, there are sentence splitters, tokenizers, and even ANNIE, a basic information extraction system
  5. Create an Application consisting of one or more processing steps, select your document, and run it.
  6. Now when you go back to view your document, you’ll see additional annotations have been added.

You can also use GATE as a manual annotation tool, and save the results in various formats. GATE’s written in Java and includes a plug-in architecture: so Java programmers can add new capabilities.

This screen shot shows a portion of Romans 16 from the World English Bible (one of the few modern Bible texts that’s freely available in XML form), annotated (imperfectly) using names from the Bible Knowledgebase (salmon), and highlighting the original notes (green).

GATE processing example

If any of you are still with me, let me emphasize this caveat: GATE is not a tool for casual use, and you should bring some technical expertise along with the expectation that you’ll have to invest a fair amount of time in figuring out how things work. But if these are the kinds of capabilities you need, it may make a lot more sense to start from GATE that to reinvent them yourself.

Postlude

This whole experience was a good example for me of the Principle of Reciprocity that Jesus teaches in Mark 4:24-25.

“Consider carefully what you hear,” he continued. “With the measure you use, it will be measured to you—and even more. Whoever has will be given more; whoever does not have, even what he has will be taken from him.”

I didn’t expect this initial conversation to produce anything useful for me: i just did it to help a friend of a friend. But it wound up “giving me more”, not only this blog post, but renewing my interest in GATE as a possible tool for related future work.