Natural Language Processing

NLP can be divided in supervised and unsupervised techniques

Supervised:
- Text Classification (Spam Recognition, labeling, etc)
- Spam Detection
- Sentiment Analysis
- Intent Classification
- Multi-Label, Multi-Class Text Classification
Unsupervised: Topic Modeling

Applications

Sentiment Analysis
Speech Recognition
Chatbot
Machine Translation (Google Translate)
Spell Checking
Keyword Search
Information Extraction
Advertisement Matching

NLU - Natural Language Understanding

Mapping input to useful representations
Analyzing different aspects of the language

NLG - Natural Language Generation

Text planning
Sentence planning
Text realization

Ambiguities

Lexical Ambiguity - Two or more possible meanings in a word.
- She is looking for a match (matchstick vs partner)
- The fisherman went to the bank (riverbank vs bank)
Syntactic Ambiguity (aka structural/grammatical ambiguity) - Two or more possible meanings in a sentence or a sequence of words.
- The chicken is ready to eat
Referential Ambiguity - Referring to something using pronouns
- The boy told his father about the theft. He was very upset.

(Processing) Terminologies

Step1: Tokenization

Process of breaking the string into tokens, which are small structures eg words and special characters.

Break a complex sentence into words
Understand the importance of each word
Produce a structural description on an input sentence.

Bigrams, Trigrams, and Ngram - Token of 2,3 or n number of words written together

from nltk.tokenize import word_tokenize
<> = word_tokenize(<>)
# check distribution
for word in <>
	fdist=[word.lower()]+=1
fdist

from nltk.util import bigrams, trigrams, ngrams
singles = nltk.word_tokenize(<string>)
doubles = list(nltk.bigrams(<string>))
quads = list(nltk.ngrams(<string> , 4))

Step2: Stemming

Normalize words into its base or root form by cutting prefixes or suffixes. Eg affects, affected, affecting, affection are stems of affect.

Common Stemmers: Porter Stemmer (lenient) ; Lancaster Stemmer (aggressive) ; Snowball Stemmer (requires language input)

from nltk.stem import PorterStemmer
from nltk.stem import LancasterStemmer

PorterStemmer.stem(<word>)
LancasterStemmer.stem(<word>) #aggressive stemmer

Lemmatization

Considers the morphological analysis of the word. Needs a detailed dictionary to link the word back to its lemma.

Groups together different inflected forms of the word, called Lemma.
Somehow similar to stemming as it maps several words to a common root.
Output is a proper word

Eg gone going and went all maps to go

from nltk.stem import wordnet
from nltk.stem import WordNetLemmatizer

WordNetLemmatizer.lemmatize('<word>')

Stop Words

Sentence forming words not required for language processing. (eg: I, me, we, our, he, him_)_

Parts of Speech (POS)

Breaking the sentence into different parts of speech (eg, Noun, verb, adverb, adjective, etc.)

Named Entity Recognition (NER)

Connects the word to a named entity (eg: Movie, Organization, Person, Location, etc.). Knowledge Graphs are used to

Syntax

Set of rules, principle and process that govern sentence formation

Syntax Tree: representation of syntactic structure of sentences or strings

Chunking

Picking up Individual pieces of information (words, tokens) and grouping them into bigger pieces.

Topic Modeling

Topic modeling is a machine learning technique that automatically analyzes text data to determine cluster words for a set of documents. It is an unsupervised ML technique where the model is not trained from before.

Advantages: Quick, easy start
Disadvantages: Since unsupervised, lesser accuracy.

Topic Classification:

Topic modeling used with classification (supervised learning). Tags are needed to be created to predefine classes.

Methods:

Latent Semantic Analysis (LSA)
Latent Dirichlet Allocation (LDA)

NLP

NLP