I had a head-slapper this morning when i realized i’d been using custom code for a long time to do something that’s in a standard Python module. Here’s the sorry tale, in hopes of saving others from a similar fate.
I regularly use tab-delimited files for data wrangling: it’s a nice, lightweight format for table-structured data, and Excel makes a good enough editor for non-programmers to change things without messing up the format. Here’s a simple example, with a set of identifiers in the first column: a typical use case would be that somebody is editing the second column so you can map old identifiers to new ones.
Old New Aphek1 AphekOfAsher Aphek2 AphekOfSharon Aphek3 AphekOfAram
It’s also very easy to read and write this kind of data in Python:
for row in open('somefile.txt', 'rb'): old, new = row.split('\t') # do something useful here
So i have a little utility reader module doing only a little more than this, stripping out comment lines, returning a list or a dict, etc., and i use this code all over the place. Then i recently needed to read some CSV (comma separated values) files, and stopped to ask The Question, which every programmer should ask before writing new code:
Hasn’t somebody else solved this problem already?
In the case of reading and writing CSV files, the answer was a quick and clear “yes”: there’s a standard Python module called csv that does just that, and nicely. So, reformatting the earlier data example as CSV would look like this:
and there’s a nice
DictReader method that (assuming your columns are unique and your first row identifies them) makes working with this data even easier.
import csv reader = csv.DictReader(open('somefile.csv', 'rb')) for row in reader: #do something more useful here print row.get('new')
If the first row doesn’t contain column headers, you can supply them to
DictReader. This looks like overkill for this simple problem, but once you have multiple columns, need to check values or map them onto something else, or add other logic and processing, life is just much easier with a dictionary structure (for one thing, you get rid of meaningless mystery indexes and stop asking “what the heck is in row”?).
Now comes the embarrassing part: i quickly breezed through the documentation, accomplished my immediate task, and moved on, missing one important detail that i just now (a month later!) figured out. Tab-delimited files are just a special case of a CSV file. My original, tab-delimited file works just the same way, once i construct the reader with tabs (rather than the default of commas) as the delimiter.
import csv reader = csv.DictReader(open('somefile.txt', 'rb'), delimiter='\t') for row in reader: #do something more useful here print row.get('new')
There are a few other gotchas, the most important of which for me is that
csv doesn’t handle Unicode. So if you have to read Unicode data, you’re back to reading the data directly, splitting lines on tabs, etc.
The best code is usually the code you didn’t write and don’t have to maintain. No matter how many times i stop and ask The Question, i still don’t do it enough.