LinuxFest Northwest 2008, April 26
But First, a Word from our Sponsors ...
Goals
- Introduce the field of Natural Language Processing (NLP)
- Introduce NLTK
- Illustrate some simple applications for processing language
- Convince you it's simple enough to download and try it out!
- Non-Goals:
- Teaching Python
- Comprehensive discussion of NLP
- Theoretical justification
- Depth
Intended Audience
- Pythonistas (or the ability to follow along) who:
- want to go beyond string processing and regexps
- deal with large quantities of prose text
- are looking for new tools and techniques
- People who want to explore NLP technology
Outline
- Overviews: me, you, Python, NLP
- Introducing the Natural Language Toolkit (NLTK)
- Some sample tasks
- Process a corpus
- Word frequency analysis
- Stemming
- Part-of-speech analysis
- Collocation analysis
Overview: Why Python?
- Clean, elegant, object-oriented language
- Dynamic typing and interactive evaluation are well suited to rapid prototype development
- Rich set of library modules
- Active open-source community
- Broadly used in scientific research
Overview: What is Natural Language Processing?
- A variety of technologies for analyzing, manipulating, and generating human language data
- Related fields of study:
- Linguistics and Cognitive Science
- Artificial Intelligence
- Computer Science
- Probability and Statistics, Machine Learning
Overview: What is Natural Language Processing? (2)
- Some areas of interest:
- Words (=tokens): their internal structure (=morphology), syntactic function (part of speech), distribution, and semantics
- Larger syntactic structures: phrases, clauses, sentences (= grammar)
- Even larger structures: discourse, dialogue, genre, style, corpus
- Meaning structures
- Over the last 2 decades, approaches in the field have shifted toward:
- Computational
- Statistical
- Corpus-driven
Why we need NLP
- Words are ambiguous: "stock" meaning
- share of a company
- broth
- goods stored in a warehouse
- unmodified
- Much information in language is implicit:
- "Edward Nance, 34, doesn't think ....
Nance, who is also a paid consultant to ABC News, said ..."
- My wife loves pickles, but I don't.
- Syntactic structure is ambiguous too:
"time flies like an arrow"
"fruit flies like a banana"
Annoying Questions (Computational) Linguists Hear
- How many languages do you speak?
- How many words are in the English language?
- How many Eskimo words are there for snow?
Overview: The Natural Language Toolkit
- Available from http://nltk.org: "A suite of open source Python modules, data sets, and tutorials supporting research and development in NLP", including:
- Code: corpus readers, tokenizers, stemmers, taggers, parsers, WordNet, and more (50k lines of code)
- Corpora: >40 annotated data sets, including some widely used in natural language processing (>300Mb data)
- Documentation: a 400-page book, articles, reviews, API documentation, licensed under Creative Commons (see individual license details)
- Used for instruction in more than 3 dozen universities
- Project leaders: Steven Bird, Ewan Klein, Edward Loper, along with dozens of other developers and contributors
Overview: The Natural Language Toolkit (2)
- All open source and freely available
- Under active development by a responsive community
- Most modules have a demo() function that shows what they do
- Installation:
- Pre-requisite: Python 2.4 or later
- Download and install the code (tarball, zip, Windows installer)
- Download the data (corpora)
- Optional modules:
numpy, matplotlib
- Optional: download the documentation
So What Can You Do With NLTK?
- Work with a corpus
- Tokenize, stem, and count
- Part-of-speech analysis ("tagging")
So What Else Can You Do With NLTK?
(We won't have time to cover these today)
- More sophisticated kinds of part-of-speech tagging
- Simplified phrase analysis ("chunking") and sophisticated syntactic parsing
- Run a chatbot
- Feature-based statistical classification
- Draw trees, directed graphs, graphs, and other objects using Tkinter
- Work with other container objects
- Theorem proving and model building
- Create Word Finder style puzzles (and their solution)
- Represent and evaluate semantic structures for language as first-order logic formulas
NLTK Corpora
- NLTK includes more than 40 corpora and corpus samples (300Mb), along with readers and processing methods
- Various types and stages of analysis
- Text: News, presidential addresses, selected Project Gutenberg books
- Tagged: Brown Corpus
- Parsed: Penn Treebank Sample, other languages
- Categorized: Reuters-21578
- Lexicons: CMU Pronouncing Dictionary, stopwords in multiple languages
- Shakespeare XML corpus
- ... and many others
Task #1: Process a Corpus
>>> from nltk.corpus import gutenberg
>>> print gutenberg.files()
('austen-emma.txt', 'austen-persuasion.txt', 'austen-sense.txt', 'bible-kjv.txt', 'blake-poems.txt', 'blake-songs.txt', 'chesterton-ball.txt', 'chesterton-brown.txt', 'chesterton-thursday.txt', 'milton-paradise.txt', 'shakespeare-caesar.txt', 'shakespeare-hamlet.txt', 'shakespeare-macbeth.txt', 'whitman-leaves.txt')
>>> len(gutenberg.words('chesterton-brown.txt'))
89090
- Also useful:
corpus.sents(): list of sentences
corpus.tagged_words(): list of (word, tag) tuples (for tagged corpora)
Task #2: Word Frequency Analysis
>>> from nltk.probability import FreqDist
>>> fd = FreqDist()
>>> fd
<FreqDist with 0 samples>
>>> for word in gutenberg.words('chesterton-brown.txt'):
... fd.inc(word)
...
>>> fd.N() # number of samples
89090
>>> fd.B() # number of bins
8839
Task #2: Word Frequency Analysis (2)
>>> fds = fd.sorted()
>>> for word in fds[:10]:
... print word, fd[word]
...
the 4399
, 4251
. 2889
of 2151
and 2119
a 2103
" 1484
to 1439
in 1233
was 1144
Digression: Zipf's Law
- Word frequency demonstrates Zipf's Law:
>>> fd['the'] # rank = 1
4399
>>> len(filter(lambda w: fd[w] == 1, fd.keys())) # frequency = 1
4373
- You can also use NLTK to plot this relationship
Task #3: Stemming
>>> stemmer = nltk.PorterStemmer()
>>> stemmer.stem('appearance')
'appear'
>>> verbs = ['appears', 'appear', 'appeared', 'appearing', 'appearance']
>>> map(stemmer.stem, verbs)
['appear', 'appear', 'appear', 'appear', 'appear']
- Other stemmers:
- regexp
- Lancaster
- WordNet
- Build your own!
- Also other tokenizers
Task #4: Part-of-speech Frequency Analysis
>>> from nltk.corpus import brown
>>> from nltk.probability import FreqDist, ConditionalFreqDist
>>> fd = FreqDist()
>>> cfd = ConditionalFreqDist()
>>> for text in brown.files():
... for sent in brown.tagged_sents(text):
... for (token, tag) in sent:
... fd.inc(tag)
... cfd[token].inc(tag)
...
>>> fd['NN']
152470
>>> for pos in cfd['light']:
... print pos, cfd['light'][pos]
...
VB 9
JJ 60
NN 251
Task #5: Identifying Potential Collocations
- Collocation: a group of words that occurs together more frequently than chance would predict (e.g. 'pencil sharpener')
- We can identify candidates by comparing the probability of two-word sequences to the probabilities of the component words
Collocation code
from operator import itemgetter
def collocations(words):
# Count the words and bigrams
wfd = nltk.FreqDist(words)
pfd = nltk.FreqDist(tuple(words[i:i+2]) for i in range(len(words)-1))
# score them
scored = [((w1,w2), score(w1, w2, wfd, pfd)) for w1, w2 in pfd]
scored.sort(key=itemgetter(1), reverse=True)
return map(itemgetter(0), scored)
def score(word1, word2, wfd, pfd, power=3):
freq1 = wfd[word1]
freq2 = wfd[word2]
freq12 = pfd[(word1, word2)]
return freq12 ** power / float(freq1 * freq2)
Collocation Example
>>> file = 'chesterton-brown.txt'
>>> words = [word.lower() for word in gutenberg.words(file) if len(word) > 2]
>>> [w1+' '+w2 for w1, w2 in collocations(words)[:15]]
['father brown', 'project gutenberg', 'pilgrim pond', 'nigger ned', 'martin ward', 'sir claude', 'drugger davis', 'michael hart', 'sir wilson', 'calhoun kidd', 'http ://', 'literary archive', 'archive foundation', 'fund raising', 'thousand pounds']
Useful Utilities: nltk.probability
- frequency distributions for counting things
- probability distributions (multiple varieties)
- conditional distributions
Useful Utilities: nltk.evaluate
- Confusion matrices
- Levenshtein distance
- Evaluation metrics
Useful Utilities: nltk.evaluate (2)
c:\Python24\Lib\site-packages\nltk>evaluate.py
---------------------------------------------------------------------------
Reference = ['DET', 'NN', 'VB', 'DET', 'JJ', 'NN', 'NN', 'IN', 'DET', 'NN']
Test = ['DET', 'VB', 'VB', 'DET', 'NN', 'NN', 'NN', 'IN', 'DET', 'NN']
Confusion matrix:
| D |
| E I J N V |
| T N J N B |
----+-----------+
DET | 3 0 0 0 0 |
IN | 0 1 0 0 0 |
JJ | 0 0 0 1 0 |
NN | 0 0 0 3 1 |
VB | 0 0 0 0 1 |
----+-----------+
(row = reference; col = test)
Accuracy: 0.8
---------------------------------------------------------------------------
Reference = set(['VB', 'DET', 'JJ', 'NN', 'IN'])
Test = set(['VB', 'DET', 'NN', 'IN'])
Precision: 1.0
Recall: 0.8
F-Measure: 0.888888888889
Summary
- NLTK provides a wealth of NLP resources for Python programmers
- Basic tasks are very accessible
- More sophisticated techniques are available (and require more sophisticated understanding)