Source: Text Mining in Python: Steps and Examples | KD nuggets
Definition
Deriving meaningful information from natural language text
Before you begin
Install all the word libraries that nltk requires
# nltk, the prime library
# svgling for NE Recognization (gives us a tree)
!pip install nltk svgling
nltk.download('punkt_tab') # for tokenization
nltk.download('wordnet') # for stemming
nltk.download('stopwords') # for stopwords
nltk.download('averaged_perceptron_tagger_eng') # for POS
nltk.download('words') # for NE Recognition
nltk.download('maxent_ne_chunker_tab') # for NE RecognitionTerminology
Tokenization
- Breaking strings into tokens1
- 3 steps
- break sentence into words
- understand the importance of each word (wrt the sentence)
- produce structural description on an input sentence.
- Find frequency distinct in the text
- using
nltk.stem.word_tokenize
Finding Frequencies
- using
nltk.probability.FreqDist fdist.most_common(N)gives theNmost common words and freqfdist.plot(N)plots them
Note
This doesn’t give us the most important words in the sense it only gave us words like
or,for,notetc which doesn’t have much significance.
Stemming
- normalizing words into its base or root form
- using
nltk.stem.PorterStemmer - using
nltk.stem.LancasterStemmer
Comparison
Lancaster is more aggressive than Porter stemmer
Lemmatization
- group different inflected forms of a word called Lemma into one common root
- Outputs a proper word
- Lemmatize should convert
gone,goingandwentgo - using
nltk.stem.WordNetLemmatizer - using
?.SpacyLemmatizer - using
?.TextBlob - using
?.StanfordCoreNLP2
Stemming vs Lemmatization
Stemming only converts the word to its base (by removing the last few letters). Whereas Lemmatization considers the meaning of the word.
Stop Words
- most common words like
the,a,at,for,aboveetc (prepositions and stuff). They almost never provide any meaning, but are just used for framing sentences - Can be removed using
nltk.corpus.stopwords
Part-of-speech tagging (POS)
Transclude of Text-mining-in-Python-2024-10-29-09.12.43.excalidraw
- assign the parts of speech to each word
- using
nltk.pos_tag
pos = nltk.pos_tag(list_of_tokens)Named entity recognition
Transclude of Text-mining-in-Python-2024-10-29-09.26.43.excalidraw
- named entities such as
person,location,company,quantitiesandmonetary value - using
nltk.ne_chunk - Have to tokenize and POS before doing chunk
token = word_tokenize(text)
tags = nltk.pos_tag(token)
chunk = nltk.ne_chunk(tags)
chunkChunking
- picking up individual pieces of information and grouping them into bigger pieces. In NLP and text mining, grouping of words or tokens into chunks
Transclude of Text-mining-in-Python-2024-10-29-09.40.50.excalidraw
Though I didn’t quite understand why we used the RegexpParser here.
text = "We saw the yellow dog"
token = word_tokenize(text)
tags = nltk.pos_tag(token)
reg = "NP: {<DT>?<JJ>*<NN>}"
a = nltk.RegexpParser(reg)
result = a.parse(tags)
print(result)
# (S We/PRP saw/VBD (NP the/DT yellow/JJ dog/NN))