Source: Text Mining in Python: Steps and Examples | KD nuggets

Definition

Deriving meaningful information from natural language text

Before you begin

Install all the word libraries that nltk requires

# nltk, the prime library
# svgling for NE Recognization (gives us a tree)
!pip install nltk svgling
nltk.download('punkt_tab') # for tokenization
nltk.download('wordnet') # for stemming
nltk.download('stopwords') # for stopwords
nltk.download('averaged_perceptron_tagger_eng') # for POS
nltk.download('words') # for NE Recognition
nltk.download('maxent_ne_chunker_tab') # for NE Recognition

Terminology

Tokenization

Breaking strings into tokens¹
3 steps
1. break sentence into words
2. understand the importance of each word (wrt the sentence)
3. produce structural description on an input sentence.
Find frequency distinct in the text
using nltk.stem.word_tokenize

Finding Frequencies

using nltk.probability.FreqDist
fdist.most_common(N) gives the N most common words and freq
fdist.plot(N) plots them

Note

This doesn’t give us the most important words in the sense it only gave us words like or, for, not etc which doesn’t have much significance.

Stemming

normalizing words into its base or root form
using nltk.stem.PorterStemmer
using nltk.stem.LancasterStemmer

Comparison

Lancaster is more aggressive than Porter stemmer

Lemmatization

group different inflected forms of a word called Lemma into one common root
Outputs a proper word
Lemmatize should convert gone, going and went $\to$ go
using nltk.stem.WordNetLemmatizer
using ?.SpacyLemmatizer
using ?.TextBlob
using ?.StanfordCoreNLP²

Stemming vs Lemmatization

Stemming only converts the word to its base (by removing the last few letters). Whereas Lemmatization considers the meaning of the word.

Stop Words

most common words like the, a, at, for, above etc (prepositions and stuff). They almost never provide any meaning, but are just used for framing sentences
Can be removed using nltk.corpus.stopwords

Part-of-speech tagging (POS)

Transclude of Text-mining-in-Python-2024-10-29-09.12.43.excalidraw

assign the parts of speech to each word
using nltk.pos_tag

pos = nltk.pos_tag(list_of_tokens)

Named entity recognition

Transclude of Text-mining-in-Python-2024-10-29-09.26.43.excalidraw

named entities such as person, location, company, quantities and monetary value
using nltk.ne_chunk
Have to tokenize and POS before doing chunk

token = word_tokenize(text)
tags = nltk.pos_tag(token)
chunk = nltk.ne_chunk(tags)
chunk

Chunking

picking up individual pieces of information and grouping them into bigger pieces. In NLP and text mining, grouping of words or tokens into chunks

Transclude of Text-mining-in-Python-2024-10-29-09.40.50.excalidraw

Though I didn’t quite understand why we used the RegexpParser here.

text = "We saw the yellow dog"
token = word_tokenize(text)
tags = nltk.pos_tag(token)
reg = "NP: {<DT>?<JJ>*<NN>}"
a = nltk.RegexpParser(reg)
result = a.parse(tags)
print(result)
 
# (S We/PRP saw/VBD (NP the/DT yellow/JJ dog/NN))

small structures or units ↩
? because we will have to check which library provides them ↩

h.notes

Text mining in Python

Before you begin

Terminology

Tokenization

Finding Frequencies

Stemming

Lemmatization

Stop Words

Part-of-speech tagging (POS)

Named entity recognition

Chunking

Graph View

Table of Contents

Backlinks

h.notes

Text mining in Python

Before you begin

Terminology

Tokenization

Finding Frequencies

Stemming

Lemmatization

Stop Words

Part-of-speech tagging (POS)

Named entity recognition

Chunking

Footnotes

Graph View

Table of Contents

Backlinks