Source: Text Mining in Python: Steps and Examples | KD nuggets
Definition
Deriving meaningful information from natural language text
Before you begin
Install all the word libraries that nltk
requires
Terminology
Tokenization
- Breaking strings into tokens1
- 3 steps
- break sentence into words
- understand the importance of each word (wrt the sentence)
- produce structural description on an input sentence.
- Find frequency distinct in the text
- using
nltk.stem.word_tokenize
Finding Frequencies
- using
nltk.probability.FreqDist
fdist.most_common(N)
gives theN
most common words and freqfdist.plot(N)
plots them
Note
This doesn’t give us the most important words in the sense it only gave us words like
or
,for
,not
etc which doesn’t have much significance.
Stemming
- normalizing words into its base or root form
- using
nltk.stem.PorterStemmer
- using
nltk.stem.LancasterStemmer
Comparison
Lancaster is more aggressive than Porter stemmer
Lemmatization
- group different inflected forms of a word called Lemma into one common root
- Outputs a proper word
- Lemmatize should convert
gone
,going
andwent
go
- using
nltk.stem.WordNetLemmatizer
- using
?.SpacyLemmatizer
- using
?.TextBlob
- using
?.StanfordCoreNLP
2
Stemming vs Lemmatization
Stemming only converts the word to its base (by removing the last few letters). Whereas Lemmatization considers the meaning of the word.
Stop Words
- most common words like
the
,a
,at
,for
,above
etc (prepositions and stuff). They almost never provide any meaning, but are just used for framing sentences - Can be removed using
nltk.corpus.stopwords
Part-of-speech tagging (POS)
Transclude of Text-mining-in-Python-2024-10-29-09.12.43.excalidraw
- assign the parts of speech to each word
- using
nltk.pos_tag
Named entity recognition
Transclude of Text-mining-in-Python-2024-10-29-09.26.43.excalidraw
- named entities such as
person
,location
,company
,quantities
andmonetary value
- using
nltk.ne_chunk
- Have to tokenize and POS before doing chunk
Chunking
- picking up individual pieces of information and grouping them into bigger pieces. In NLP and text mining, grouping of words or tokens into chunks
Transclude of Text-mining-in-Python-2024-10-29-09.40.50.excalidraw
Though I didn’t quite understand why we used the RegexpParser
here.