Source: Text Mining in Python: Steps and Examples | KD nuggets

Definition

Deriving meaningful information from natural language text

Before you begin

Install all the word libraries that nltk requires

# nltk, the prime library
# svgling for NE Recognization (gives us a tree)
!pip install nltk svgling
nltk.download('punkt_tab') # for tokenization
nltk.download('wordnet') # for stemming
nltk.download('stopwords') # for stopwords
nltk.download('averaged_perceptron_tagger_eng') # for POS
nltk.download('words') # for NE Recognition
nltk.download('maxent_ne_chunker_tab') # for NE Recognition

Terminology

Tokenization

  • Breaking strings into tokens1
  • 3 steps
    1. break sentence into words
    2. understand the importance of each word (wrt the sentence)
    3. produce structural description on an input sentence.
  • Find frequency distinct in the text
  • using nltk.stem.word_tokenize

Finding Frequencies

  • using nltk.probability.FreqDist
  • fdist.most_common(N) gives the N most common words and freq
  • fdist.plot(N) plots them

Note

This doesn’t give us the most important words in the sense it only gave us words like or, for, not etc which doesn’t have much significance.

Stemming

  • normalizing words into its base or root form
  • using nltk.stem.PorterStemmer
  • using nltk.stem.LancasterStemmer

Comparison

Lancaster is more aggressive than Porter stemmer

Lemmatization

  • group different inflected forms of a word called Lemma into one common root
  • Outputs a proper word
  • Lemmatize should convert gone, going and went go
  • using nltk.stem.WordNetLemmatizer
  • using ?.SpacyLemmatizer
  • using ?.TextBlob
  • using ?.StanfordCoreNLP2

Stemming vs Lemmatization

Stemming only converts the word to its base (by removing the last few letters). Whereas Lemmatization considers the meaning of the word.

Stop Words

  • most common words like the, a, at, for, above etc (prepositions and stuff). They almost never provide any meaning, but are just used for framing sentences
  • Can be removed using nltk.corpus.stopwords

Part-of-speech tagging (POS)

Transclude of Text-mining-in-Python-2024-10-29-09.12.43.excalidraw

  • assign the parts of speech to each word
  • using nltk.pos_tag
pos = nltk.pos_tag(list_of_tokens)

Named entity recognition

Transclude of Text-mining-in-Python-2024-10-29-09.26.43.excalidraw

  • named entities such as person, location, company, quantities and monetary value
  • using nltk.ne_chunk
  • Have to tokenize and POS before doing chunk
token = word_tokenize(text)
tags = nltk.pos_tag(token)
chunk = nltk.ne_chunk(tags)
chunk

Chunking

  • picking up individual pieces of information and grouping them into bigger pieces. In NLP and text mining, grouping of words or tokens into chunks

Transclude of Text-mining-in-Python-2024-10-29-09.40.50.excalidraw

Though I didn’t quite understand why we used the RegexpParser here.

text = "We saw the yellow dog"
token = word_tokenize(text)
tags = nltk.pos_tag(token)
reg = "NP: {<DT>?<JJ>*<NN>}"
a = nltk.RegexpParser(reg)
result = a.parse(tags)
print(result)
 
# (S We/PRP saw/VBD (NP the/DT yellow/JJ dog/NN))

Footnotes

  1. small structures or units

  2. ? because we will have to check which library provides them