Look on algorithms behind Natural Language Processing (NLP).

Milan Thapa
5 min readOct 21, 2020

Natural language processing (NLP) describes the interaction between human language and computers. Human language is different than what computers understands. Computers understand machine language or we can say the binary codes. Computers don’t speak or understand human language unless they are programmed to do so. And that’s where NLP comes into picture.

How does natural language processing works?

There are two main techniques used with NLP , the first one is syntax analysis and the second one is semantic analysis. Syntax is the structure or form of expressions, statements, and program units. Syntax can be used for assessing meaning from a language supported grammatical rules. There are some of the techniques used in syntax analysis which includes:

I.) parsing :- which is a grammatical analysis for a sentence.

II.) word segmentation :- which divides an outsized piece of text to units

III.) sentence breaking:- which places sentence boundaries in large texts

IV.) morphological segmentation:- which divides words into groups

V.) stemming:- which divides words with inflection in them to root forms

Semantics is the meaning of those expressions, statements, and program units. There are algorithms which NLP applies to know the meaning and structure of sentences. There are some of the techniques used in semantic analysis which includes:

I.) word meaning disambiguation:- which derives the meaning of a word supported context

II.) named entity recognition:- which determines words which will be categorized into groups

III.) natural language generation:- which will use a database to work out semantics behind words

Also, we can divide NLP field into two camps:

  1. Linguistics camp
  2. Statistics camp.

The idea of NLP started in the early era of AI. In fact, it came into existence during the time of Alan Turing, who is considered to be the founder of both AI and computing in general.

The challenge was to create a machine that can converse in a way that is indistinguishable from human which is also known as Turing test.

“ELIZA” one of the earliest famous AI program that can be considered as an attempt to beat the Turing test. As we know that there were no such algorithms that could really understand the human language at that time. So, we can say that ELIZA and other chat bot programs at that time used to be programmed manually crafting lots and lots of rules to respond the human conversation. So, it can be said that those programs never had the capacity of actually understanding the natural language rather we can say that they were the result of psychology, to fool humans.

So, the concept of linguistic arose which can be viewed as the science of how language is created. A pattern is searched in a language and the rules for constructing and interpreting all natural language utterances are formulated, which is done by linguists. And some models or grammars are generalized on the basis of that rule. (Linguistic rules are also used to parse and recognize the artificial language when building a compiler). The way of parsing natural language is also very much similar except that Context Free Grammars are limited so instead Context-Sensitive Grammars are used.

Then in the 90’s, a different perspective was approached to the NLP problem by a statisticians. After that essentially all the Linguistic theories were all thrown out. A simple model of language was introduced which was called “Bag of Words ” model. This model is very simple, it assumes that sentence is nothing but just a bag of words. This model doesn’t care for the order of words. For example, “I go for walk” and “walk I go for” are not dissimilar under this model, though one of these two sentence has a higher probability. When using this model, there is no necessity of meanings, it assumes that whenever it sees these four words, it likely has a similar meaning.

Why would anyone wants to use “Bag of Words” model when there is a sophisticated Linguistic model. What advantages does this statistic camp provides?

The statistics camp wants to avoid manual programming of rules to and look for automatic interpreting of language just like supervised fashion, by feeding in large amount of labelled data and learning patterns.

Let’s talk about some of the existing algorithms:

  • Algorithms can be simple as Vector Space Model where text can be represented as vector and data can be obtained by vector operations. Embedding is one such use case.
  • Inference driving algorithms such as Frequent item set is one such use case, where you can look into text corpus and try to make inference about what would come next.
  • Relevance ranking algorithms used in search engine such as Tf-IDF, BM25, pagerank, etc.
  • There are algorithms which are used understand meaning out of texts. Like Latent semantic analysis (LSA) , Probabilistic Semantic analysis (pLSA) and Latent Dirichlet allocation (LDA).
  • There are algorithms which try to derive sentiments, context and subject of written text. Like sentiment analysis is very popular as it tries to associate some sentiment value to the unknown words.
  • Also, in recent time there are deep learning models/algorithms which uses statistical methods to process tokens using multilayer ANNs.

As we can see there is no one type of algorithm for NLP. Various approaches to NLP information retrieval can be drawn from below image:

Coreference resolution:

Adam stabbed Bob, and he bled to death!

It’s huge problem in NLP to determine whether “he” in the above sentence refers to Adam or Bob.

It is very well-studied problem in NLP and also has a fancy name “Coreference Resolution”. In linguistics, coreference, sometimes written co-reference, occurs when two or more expressions in a text refer to the same person or thing; they have the same referent, like in above sentence.

Back in 2001, machine learning algorithms was approached (paper). The proposed classifier was decision tree, which classifies given candidate pair of words as either “Coreferential” (meaning refers to the same thing) or “Not Coreferential”. Following features were used for each candidate pair:

  • Distance: which can be computed as number of sentences between the two words. (more the distance we can say the words are less coreferential).
  • Pronoun: determines whether candidate pairs are pronouns, one of them is, or none.
  • String Match: which can be defined as the overlap between the two words. ( “Prime Minister XXX” and “The Prime Minister” can be considered coreferential).
  • Number Agreement: which defines whether candidate pair of words are singular, both plural, or neither.
  • Semantic Class Agreement: which defines whether candidate pair of words are of the same semantic class, if any. (“Person”, “Organization”, etc.).
  • Gender Agreement: can be defined as whether candidate pair of words are of the same gender, if any. (“Male”, “Female”, “Neither”).
  • Appositive: defines whether candidate pair of words are appositives (Say, If a sentence starts with “The Nepali President, XXX said…”, then “President” and “XXX” are appositives and are probably coreferential).
  • ..and a few more similar features.

References

--

--