If my blog had aural feedback, no doubt i’d hear a few snickers for saying this out loud, but here goes:

i use Emacs.

There, i said it. With all the tightly integrated development environments available these days (Visual Studio, Eclipse, etc.), you may wonder why anybody would use such an old-school tool. In fact, i’ve been using Emacs in various flavors for more than 20 years now: this request for an Emacs mode for an exotic programming language called Icon might even be my oldest extant trace on the Internet (though we didn’t call it that back then, kids). I’m pretty stale now, but at one point i considered Emacs Lisp one of the programming languages i’d put on a resume (if that sentence doesn’t make sense to you, it shows that you don’t know enough about Emacs to snicker at it).

Sure, it’s got a steep learning curve, it’s really geeky, and it’s not the hammer for every nail. I don’t write UI tools in it anymore, though for a while it was a pretty good choice for that. But there are still things i can do easily in Emacs that i don’t know how to do elsewhere without a lot more work: that’s one definition of what makes a good tool.

One use case i encounter a lot when groveling through data is progressive refinement. Typically that means a large data set (thousands or more), where i need several steps to filter out certain values (that i don’t know in advance: that’s one reason an editing environment is a good choice). For example, my current task is finding funky Unicode characters encoded as XML character entities, and replacing them with ASCII equivalents (i know that’s not good form, but for this particular string-matching task, it’s good enough). I’ve got a few 10s of thousands of lines of data, and i want to find all the different #&-encoded values so i can create a mapping table.

A simple pattern match for &# returns some 800 hits, and i don’t really want to look through all of them (particular when there’s a typical 80/20 distribution: i’ll get the major ones, but miss some long-tail cases once i start scanning quickly). So here’s the easy trick in Emacs i find myself using a lot:

  1. M-x occur creates another buffer (called *Occur*) with all the lines that match a given regexp (i just use &#)
  2. I scan the first page of results to see what looks like the most common value (in this case, ’, equivalent to an apostrophe), and add it to my map
  3. Here’s where it gets cool: the *Occur* buffer is a filtered view of my data, and i can work on it directly (once i toggle it the read-only status, a minor annoyance). So i switch to the *Occur* buffer, and then do M-x flush-lines for the value i just captured. This removes all the lines matching that case (about 400 of them for this first example), without damaging my original data (i’m in a different buffer).
  4. I go back to step 2 for a new value and repeat. Each time i’m capturing some large percentage of my data and then excluding that value from further consideration, getting a narrower and narrower view.
  5. At some point the view is narrowed down to a dozen or two lines, at which point i capture any remaining cases (all now in plain view), and i’m done.

This is completely interactive, the possibilities are always in plain sight so i can make decisions about where to go next, and i don’t have to go hunting around. If i make a mistake, i can undo, or just back up and start over. And the values are right there for easy cut-and-paste into another buffer where i’m writing my code. (Caveats: this approach really only works with line-oriented data, multiple matches per line make it more complicated, and of course you need to figure out suitable regexps) Most of the time, i find about 10 or so cycles is enough for me to find all the values i care about, out of an original set of thousands.

Can your editor do that?