Don’t get your hopes up: i’m not about to reveal some secret wisdom that will let you painlessly get nice sermon transcripts from your recordings. But i got an email from someone using Dragon Naturally Speaking 10 to automatically transcribe sermons from an audio recording. He asked some reasonable questions, and others might be interested in my answers.
Here’s a sample transcription he provided:
Children this morning I had intended on staying all the things that Bruce about five minutes ago regarding last week, which tells you I was not stepped in Wednesday morning. I but I do want to add just one thing as one share with you one comment from one guest last week as person came a little late and as result of that had to sit somewhere in the back of no exactly where she sat but it was in the back. And during our time of worship, she told me afterwards that at one point she looked up at the guys serving in the sound and much to her amazement expecting to see three or four heads this kind of buried in buttons.
Though i haven’t heard the original audio for comparison, this transcription is not too edifying, not to mention barely comprehensible! Unfortunately, while commercial speech-to-text systems have made a lot of progress, the current state-of-the-art is too often a source of amusement rather than usable transcriptions.
He comments:
As you can see, much work needs to be done to massage this transcript into final form. Some of the initial work is:
- Correct incorrectly transcribed words/phrases.
- Correct punctuation/sentence breaks.
- Define paragraph breaks.
My question is: could the resources available in the NLTK be used to automate some of the editing? Perhaps you have suggestions as to how one could most efficiently arrive at a final product, given the attached input.
Though i haven’t used Dragon’s system, i’ve spent a fair amount of my professional career working with speech-to-text systems, and unfortunately, i don’t know of any easy solution to this problem.
As commercial systems go, Dragon’s is probably about as good as you can get. While there are better performing systems in the research labs (my former colleagues at BBN Technologies have one of the best), they’re focused on customers with different requirements and much larger budgets than pastors. You should definitely spend the time to provide training samples of your speech (under the same acoustic conditions): that should pay off in better results. You might also get slightly better results with careful microphone placement: though our ears are very forgiving (and our interpreting brains very good at guessing), that’s not true of speech-to-text systems. In general, a close-talking mike at a constant distance will work better than one fixed to a podium.
There is another approach: CastingWords uses Amazon’s Mechanical Turk system to engage human transcribers in transcription. Their advertised budget transcription rate (if you’re not in a rush) is $0.75 per minute: so for $15, you could get the transcript for a typical 20-minute sermon (i’m sure you don’t go longer!). That’s likely to be more cost-effective than having somebody clean up transcripts as poor as the one above. You can even provide them with the URL for an audio file and they’ll take it from there. Disclaimer: i’ve never used their service, but i’ve heard others say they’re happy with the results.
While there is a lot of capability in the Python-based Natural Language Toolkit (NLTK), which i highly recommend for programmers interested in natural language processing, it doesn’t provide any silver bullets that i’m aware of.

