In Praise of Python

In computer science, you have to learn new languages and frameworks on a regular basis, because the field changes so quickly. I’ve learned my share of languages over the years, well over a dozen last time i counted (most of which i don’t use anymore). Perl was the last one that i really invested time in at BBN, and it was my main language for scripting and data processing for the last half dozen years or so (as a manager, this was the only kind of programming i could get away with :-)).

A couple of years ago, i made the decision that my next language would be Python. My reasoning was based on a pretty simple kind of social networking: over and over, the bloggers i read, colleagues i respected, and projects of interest i discovered kept talking about Python. I figured if this many smart people were using it, there must be a good reason. When i started work at Logos this year, i had more time to focus on programming, and finally got to make good on my intention. I haven’t been sorry, and having done a lot of Python coding over the last 6 months (the only way you really learn a language), i’m really loving the language.

Here are some of the things i’m finding that are great about Python:

  • Interactive evaluation: how many times have you embedded print statements inside your Perl code so you can figure out what’s going on? Python lets you short-circuit that because you can evaluate expressions directly, inspect the results, and work on it until you get it right. It has the side-benefit of encouraging modularization of your code, simply because that makes it easier to test interactively. For me, interactive evaluation provides an enormous productivity boost.
  • List processing: I’m not embarassed to admit that i still think Lisp is one of the best languages i’ve ever coded in. When i did a little recreational programming last year, i went back to my Lisp roots, because some problems just cry out for list-based solutions. Perl has lists, lists of lists, etc. but there’s just enough friction in working with them that it always feels hard to me. Python brings back a rich list-based environment with mapping, filtering, and other useful features. Though some people find lambda functions intimidating (the name doesn’t help),they’re a very powerful feature. Python also has set operations (intersection, union, etc.) which are very helpful for data cleanup.
  • Introspection: you can ask the environment what objects it knows about, and you can ask an object what its methods and attributes are. This enables powerful kinds of meta-programming capabilities.
  • Django, a web application framework (like the popular Ruby on Rails) that makes it very easy to build rich, data-driven, web-based systems. I’ve been using Django to build a thesaurus development interface for in-house use, and i strongly recommend it (i hope to start a side project soon using Django for publishing genealogy information).

I’m not trolling for flamewars here (i’ve managed to do so inadvertently in the past), and i still like Perl. But from now on, i’m a Python guy.
Other reading:

  • JoelOnSoftware has a characteristically insightful post about why the question “what’s the best language to use?” isn’t really meaningful. He points out the value of the language’s ecosystem as a key criteria: that was one reason Python made it to the top of my list.
  • I’ve been playing with the Natural Language Toolkit, which is written in Python. It’s still in flux, but has a lot of interesting capabilities, including a good WordNet interface.

Can Your Editor Do This?

If my blog had aural feedback, no doubt i’d hear a few snickers for saying this out loud, but here goes:

i use Emacs.

There, i said it. With all the tightly integrated development environments available these days (Visual Studio, Eclipse, etc.), you may wonder why anybody would use such an old-school tool. In fact, i’ve been using Emacs in various flavors for more than 20 years now: this request for an Emacs mode for an exotic programming language called Icon might even be my oldest extant trace on the Internet (though we didn’t call it that back then, kids). I’m pretty stale now, but at one point i considered Emacs Lisp one of the programming languages i’d put on a resume (if that sentence doesn’t make sense to you, it shows that you don’t know enough about Emacs to snicker at it).

Sure, it’s got a steep learning curve, it’s really geeky, and it’s not the hammer for every nail. I don’t write UI tools in it anymore, though for a while it was a pretty good choice for that. But there are still things i can do easily in Emacs that i don’t know how to do elsewhere without a lot more work: that’s one definition of what makes a good tool.

One use case i encounter a lot when groveling through data is progressive refinement. Typically that means a large data set (thousands or more), where i need several steps to filter out certain values (that i don’t know in advance: that’s one reason an editing environment is a good choice). For example, my current task is finding funky Unicode characters encoded as XML character entities, and replacing them with ASCII equivalents (i know that’s not good form, but for this particular string-matching task, it’s good enough). I’ve got a few 10s of thousands of lines of data, and i want to find all the different #&-encoded values so i can create a mapping table.

A simple pattern match for &# returns some 800 hits, and i don’t really want to look through all of them (particular when there’s a typical 80/20 distribution: i’ll get the major ones, but miss some long-tail cases once i start scanning quickly). So here’s the easy trick in Emacs i find myself using a lot:

  1. M-x occur creates another buffer (called *Occur*) with all the lines that match a given regexp (i just use &#)
  2. I scan the first page of results to see what looks like the most common value (in this case, ’, equivalent to an apostrophe), and add it to my map
  3. Here’s where it gets cool: the *Occur* buffer is a filtered view of my data, and i can work on it directly (once i toggle it the read-only status, a minor annoyance). So i switch to the *Occur* buffer, and then do M-x flush-lines for the value i just captured. This removes all the lines matching that case (about 400 of them for this first example), without damaging my original data (i’m in a different buffer).
  4. I go back to step 2 for a new value and repeat. Each time i’m capturing some large percentage of my data and then excluding that value from further consideration, getting a narrower and narrower view.
  5. At some point the view is narrowed down to a dozen or two lines, at which point i capture any remaining cases (all now in plain view), and i’m done.

This is completely interactive, the possibilities are always in plain sight so i can make decisions about where to go next, and i don’t have to go hunting around. If i make a mistake, i can undo, or just back up and start over. And the values are right there for easy cut-and-paste into another buffer where i’m writing my code. (Caveats: this approach really only works with line-oriented data, multiple matches per line make it more complicated, and of course you need to figure out suitable regexps) Most of the time, i find about 10 or so cycles is enough for me to find all the values i care about, out of an original set of thousands.

Can your editor do that?

Vista and File Search for XML

I don’t intend to make a career out of grumbling about Vista, i just keep bumping my head. Today it was the search box. To their credit, Microsoft has attempted to make search ubiquitous (or cut off Google Desktop at the knees) by putting this feature right in Windows Explorer, and when it works (like in the Start Menu), it’s great. But here’s a place it doesn’t work, at least not as you’d expect.

At work, we have lots of XML data. So in a directory with 2000 XML files, i wanted to find the few files with some common content. But it appears that Vista only indexes the character data, not the XML markup, in the files. That means you can’t search on elements, attribute names, or even attribute values like priority="critical": even if you search in file contents for “critical”, Vista won’t find it. If you manually change the file type to something like txt, now that content becomes searchable (but of course i’m not about to do that for 2000 files). So Vista gets points for trying to be smart, but it’s not smart enough. Ironically, better support for XML (at least as an application format) is supposed to be one of the improvements of Vista.
I can already hear people telling me it’s a feature …

More about Vista and File Virtualization

There’s quite a bit of additional discussion on my Windows Vista post on this programming reddit thread. A sampling of opinions:

  • there are virtues to this feature (i can appreciate this): some would temper it with a warning or other notification (given how regularly Vista nags me about innocuous things like creating folders, it’s puzzling that creating a virtual file doesn’t merit a peep)
  • it’s bizarre to want to edit things in Program Files (in a perfect world, perhaps, but the reality is, developers sometimes need to change things)
  • Scott Hanselman has a post on the virtualization feature with a lot less whining 🙂 and a lot more technical detail that you may find helpful

One commenter pointed out that McAfee’s SiteAdvisor, which is supposed to warn you about bad sites, thinks SemanticBible is one of them! The basis for that assessment is a link i have somewhere to daml.org, a repository of OWL ontologies and data, which was originally set up as part of the government funded DAML research program by my former employer, BBN Technologies (DAML is the one of the progenitors of OWL, the semantic web language). SiteAdvisor’s claim is that daml.org was “found to be a distributor of downloads some people consider adware, spyware or other unwanted programs.” If you go to their assessment of daml.org, that’s based on a couple of downloadable zip files that are apparently infected with viruses. Given that the DAML site exists to share resources, i guess some of this is inevitable. Not that this excuses the problem: I remember pointing this out to one of the daml.org guys in the past, but apparently they haven’t fixed the problem (and SiteAdvisor doesn’t make it too easy). Anyway, i’m in the process of trying to get my black mark with McAfee removed, but let me assure you that SemanticBible does not knowingly distribute or encourage the distribution of malware.

Windows Vista: a Cautionary Tale

Since i’m the New Guy at work, along with my new box, i’ve had Windows Vista for several weeks now, ahead of all the eager crowds. So far, with the exception of a few nice features like searching from the start menu, i haven’t been too impressed.
My latest adventure started innocently enough, adding a little content to a file with my favorite XML editor so i could see how it got rendered in our app. Curiously, when i looked at the files in the folder, it didn’t look like the time stamp had changed. I figured i’d just modified the wrong one of several different versions, so i retraced my steps: nope, that was the right file. This is weird, i thought, so i re-opened the file from the folder, and all my content was there. I saved it again … no changes to the directory listing.

I fiddled with a few file and folder permissions (initially it had been read-only), but couldn’t get the directory listing to tell me the truth. Puzzled, i tried opening the file with Notepad, and fell headlong into some kind of parallel universe: the changed content was gone, and the file had reverted to its original content. What?!? I went back and opened it again with the original editor: my changes were there, just as i made them. I opened it with Notepad, then Wordpad: my changes were gone.

I’m chagrined to admit that i did this for at least 20 minutes, trying to prove to myself that i wasn’t seeing what i was seeing, and was just making some dumb mistake: what more basic service can an OS provide than showing the attributes of files and delivering their content to applications? I went to our sys admin and said “i think i’m losing my mind: do you want to watch?”. He did, and we tried several more tests. Looking at the file properties, it had a modification time of yesterday, but a creation time of … 10 minutes ago?!? Copying the file (which lived in Program Files with its application) to the desktop made it normal: every editor then told the truth. We opened the file with half a dozen different editors, which lined up neatly into two camps: MS apps (Notepad, Wordpad, MSWord, Visual Studio, even the DOS type command) that lied about the content, and every other editor that told the truth and showed me my changes. We pulled another colleague in to watch the spectacle.

At this point my readers are probably in two camps: some are utterly flummoxed (like i was), and others are chuckling at my naivete or knowingly tsk-tsk’ing, wondering how i could have missed hearing about this wonderful new thing in Vista. It took another half an hour and one of our savvy developers to explain that this was a feature, not a bug. You’re not supposed to edit things in Program Files, and Vista helps you with that by hiding such edits in a parallel universe called Compatability Files. He showed me the button Compatibility Files button from Windows Vista that had unobtrusively appeared on the window to alert me to the fact that i’d entered this weird place. Vista-aware apps were playing by the new rules, and non-aware ones (all the normal ones that i actually use every day) were blissfully opening the one i wasn’t supposed to have created in the first place. And by using a cool new Vista feature called symlinks (old Unix hands, it’s your turn to chuckle now), you just create a link to a folder outside Program Files, and then things behave like they’re supposed to.

Well, maybe: but it felt a little like taking a walk in the woods behind your new high-tech house, stepping off the path to pick up a walking stick, and getting attacked by a mechanical bear. After you wrestle the bear to the ground, you find out this bear is designed to keep you and your sticks secure (after all, you could poke your eye out), so the bear was for your own good, and if you use the Stick Supply House instead of just picking up random sticks, everything will be okay. But i’ve gotten used to picking up sticks in the woods, and they’re my woods, after all.

Anyway, you’ve been warned: there are bears in the woods.

Better Programming Through … Less Programming?

Jeff Atwood has an interesting post, starting from some quotes from Bill Gates, on how to become a better programmer by not programming (yes, you read that right). It’s worth reading the whole thing, but here’s my spin on his bottom line: even if you have the skills and aptitude (of course, many don’t), what you need to get to the next level is not just more of the same, but an understanding of users and their problems, and a passion for solutions.
To me, this is just another application of one of Stephen Covey’s Seven Habits, Sharpen the Saw. At some point, more sawing doesn’t make you more productive: you’ve got to step back and think more broadly about what you’re doing and how to make it more productive (which might include going down to the hardware store for one of those new-fangled gas chain saws!).
(Hat tip to Jeremy Zawodny for the link)