Suppose you want to get top frequent words from a text. This task quickly reveals the caveats: there are swarms of words each with dozens of forms, all those n’t and ‘s and and also commas and periods… All these should be accounted for.
Good for us, there are very powerful tools exist for word processing and text mining - libraries that handles these tasks in a best way possible.
Get familiar with nltk - a powerful library for NLP (natural language processing)!
Ok, so the aim is to get word frequencies. The workflow would be:
- imports - get libraries
- Normalize - prepare words
- Tokenize - smart word split
- Lemmatize - smart processing to meaningful ‘universal’ words.
- Get word frequencies!
Follow these simple, yet powerful text mining steps.
imports
First import nltk
, also you might probably need to do nltk.download()
to get all resources. That might take some time!
import nltk
nltk.download()
showing info https://raw.githubusercontent.com/nltk/nltk_data/gh-pages/index.xml
True
Loading raw version of Alice in Wonderland as an example for our analysis.
alice = nltk.corpus.gutenberg.raw('carroll-alice.txt')
len(alice)
144395
Normalization
Is simply lowercasing all words nicely.
alice_lower = alice.lower()
print(alice_lower[:393])
[alice's adventures in wonderland by lewis carroll 1865]
chapter i. down the rabbit-hole
alice was beginning to get very tired of sitting by her sister on the
bank, and of having nothing to do: once or twice she had peeped into the
book her sister was reading, but it had no pictures or conversations in
it, 'and what is the use of a book,' thought alice 'without pictures or
conversation?'
Tokenize
Tokenization is a smart splitting of the text to it’s constituent parts: words, symbols, endings. nltk
has a powerful tool word_tokenize
and takes into consideration many different cases.
alice_tokenized = nltk.word_tokenize(alice_lower)
print(alice_tokenized[:80])
['[', 'alice', "'s", 'adventures', 'in', 'wonderland', 'by', 'lewis', 'carroll', '1865', ']', 'chapter', 'i.', 'down', 'the', 'rabbit-hole', 'alice', 'was', 'beginning', 'to', 'get', 'very', 'tired', 'of', 'sitting', 'by', 'her', 'sister', 'on', 'the', 'bank', ',', 'and', 'of', 'having', 'nothing', 'to', 'do', ':', 'once', 'or', 'twice', 'she', 'had', 'peeped', 'into', 'the', 'book', 'her', 'sister', 'was', 'reading', ',', 'but', 'it', 'had', 'no', 'pictures', 'or', 'conversations', 'in', 'it', ',', "'and", 'what', 'is', 'the', 'use', 'of', 'a', 'book', ',', "'", 'thought', 'alice', "'without", 'pictures', 'or', 'conversation', '?']
Filter Stop words
There are a lot of words, such as ‘a’, ‘or’, ‘the’. By the way ‘the’ is the most frequent word in english. Since we are interested in meaningful words we shall filter out any stop words.
from nltk.corpus import stopwords
stopWords = set(stopwords.words('english'))
alice_tokenized = [ token for token in alice_tokenized if token not in stopWords ]
Filter punctuation
Punctuation symbols are top frequent as well for the texts. For our purposes we should filter them out as well.
alice_tokenized = [token for token in alice_tokenized if token.isalpha()]
Lemmatize
Lemmatization, is a smart-stemming. Stemming finds and transforms a word to it’s stem. For example the step of universal
is univers
- which is not very appealing for a human. While lemmatization create meaningful words, thus universal
or universally
would be transformed to actually universal
!
See examples of stemming and lemmatization:
# Stemming
porter = nltk.PorterStemmer()
alice_stemmed = [porter.stem(t) for t in alice_tokenized] # Still Lemmatization
print(alice_stemmed[:80])
['alic', 'adventur', 'wonderland', 'lewi', 'carrol', 'chapter', 'alic', 'begin', 'get', 'tire', 'sit', 'sister', 'bank', 'noth', 'twice', 'peep', 'book', 'sister', 'read', 'pictur', 'convers', 'use', 'book', 'thought', 'alic', 'pictur', 'convers', 'consid', 'mind', 'well', 'could', 'hot', 'day', 'made', 'feel', 'sleepi', 'stupid', 'whether', 'pleasur', 'make', 'would', 'worth', 'troubl', 'get', 'pick', 'daisi', 'suddenli', 'white', 'rabbit', 'pink', 'eye', 'ran', 'close', 'noth', 'remark', 'alic', 'think', 'much', 'way', 'hear', 'rabbit', 'say', 'dear', 'oh', 'dear', 'shall', 'late', 'thought', 'afterward', 'occur', 'ought', 'wonder', 'time', 'seem', 'quit', 'natur', 'rabbit', 'actual', 'took', 'watch']
# Lemmatization. Notice the diffetence.
WNlemma = nltk.WordNetLemmatizer()
alice_lemmatized = [WNlemma.lemmatize(t) for t in alice_tokenized]
print(alice_lemmatized[:80])
['alice', 'adventure', 'wonderland', 'lewis', 'carroll', 'chapter', 'alice', 'beginning', 'get', 'tired', 'sitting', 'sister', 'bank', 'nothing', 'twice', 'peeped', 'book', 'sister', 'reading', 'picture', 'conversation', 'use', 'book', 'thought', 'alice', 'picture', 'conversation', 'considering', 'mind', 'well', 'could', 'hot', 'day', 'made', 'feel', 'sleepy', 'stupid', 'whether', 'pleasure', 'making', 'would', 'worth', 'trouble', 'getting', 'picking', 'daisy', 'suddenly', 'white', 'rabbit', 'pink', 'eye', 'ran', 'close', 'nothing', 'remarkable', 'alice', 'think', 'much', 'way', 'hear', 'rabbit', 'say', 'dear', 'oh', 'dear', 'shall', 'late', 'thought', 'afterwards', 'occurred', 'ought', 'wondered', 'time', 'seemed', 'quite', 'natural', 'rabbit', 'actually', 'took', 'watch']
Get Distribution
FreqDist
creates a dictionary of word and it’s frequency
word_tokens_freqdist = nltk.FreqDist(alice_lemmatized)
list(list(word_tokens_freqdist.keys())[:10])
['alice',
'adventure',
'wonderland',
'lewis',
'carroll',
'chapter',
'beginning',
'get',
'tired',
'sitting']
# Get words with frequency >60 and length >5.
freqwords = ['{}:{}'.format(w,dist[w]) for w in word_tokens_freqdist.keys() if len(w) > 5 and word_tokens_freqdist[w] > 60]
freqwords
['thought:76', 'little:128']
FreqDist class provides us with handy methods to get top frequent words from the dictionary of frequencies in a nice way.
Getting top 10 frequent words from a text!
# Get top 10 frequent words from a text
top_freq_words = word_tokens_freqdist.most_common(10)
print(top_freq_words)
[('said', 462), ('alice', 396), ('little', 128), ('one', 100), ('would', 90), ('know', 90), ('could', 86), ('like', 86), ('went', 83), ('thing', 79)]
We can see top frequent words with their occurence in the text presented in a list of tuples.
Cool!
These are some great basiscs of NLP. We could get get most frequent meaningfult words from a beloved tale Alice in Wonderland. Said alice little one…
Hope it was helpful. Your comments and insights are very welcomed.