5-Minutes Reading of Me Trying to Explain Natural Language Processing

Tami
5 min readJun 6, 2021
Source: https://aliz.ai/natural-language-processing-a-short-introduction-to-get-you-started/

Though it seems simple to the human’s brain, natural language isn’t as easy to be processed by computers as a mathematical function or a data analysis. Because unlike mathematical function and database that are structured, natural language is unstructured, it doesn’t have a simple formula.

In order for computers to be able to understand natural languages, scientists have been trying to develop technologies to help computer to understand our language because of how useful it can be. This is where Natural Language Processing comes in.

Natural Language Processing is a branch of artificial intelligence that deals with the interaction between computers and humans using the natural language. It can help us translate one language to another, it can help us checking if the articles we write (like this one!) have a grammatical error, it can help people with disabilities by reading text out loud or detecting texts and many more. Recently, natural language processing can even help to detect an illness based on health records and someone’s speech. Imagine how amazing the world would be when we get more advanced with natural language processing!

However, natural language processing is one of the most difficult problems to be solved in the computer science world. The word and the tone that we use, no matter how small the difference is, can make a huge difference to the meaning of our sentences. And unlike humans, computers don’t have feelings to differentiate the small difference in someone’s tone. They can’t understand sarcastic remarks and many other “rules” in natural language like most humans can.

Now we know that natural language processing is useful yet difficult, but how does it work? How can the emotionless computer understand something that’s so full of emotion such as human language?

First we need to understand the two primary techniques that are primarily used in natural language processing: Syntactic Analysis and Semantic Analysis. Syntax is the grammatical structure of a text or a sentence while Semantic is the meaning conveyed through a text or a sentence. Correctness of both syntax and semantic are what makes a sentence correct. A sentence could be correct syntactically but incorrect semantically, such sentence would still be considered incorrect.

Syntactic analysis is the process of analyzing natural language with the rules of a formal grammar of a language. Program will try to apply grammatical rules to group of words and try to assign a semantic structure to the text.

Semantic analysis is the process of understanding the meaning and interpretation of words, signs and sentence structure and lets computers partly understand natural language the way humans do. However, technologies we have now hasn’t fully solved the semantic analysis part of natural language processing yet.

Here are several techniques that are often used in natural language processing in order to understand natural languages:

  • Bag of Words: counting the number of words occurrences in a sentence or a text.

The downside of this technique is that an important word such as ‘universe’ has the same weight as a stop word such as ‘the’ or ‘a’. One way to solve this problem is by analyzing the occurrences of words in all text to see if a word is commonly use across texts or not.

  • Tokenization: breaking down a text into sentences and words.

This process usually also remove characters like punctuations, commas, and question marks and removal of these characters might cause changes in meaning. Another problem that might be caused is because token are usually separated by blank spaces, some token that consist of two or more words might be considered as separate token and give different meaning. (e.g. New York and San Francisco).

  • Stop Words Removal: removing common words from the text.

Words like articles, pronouns, and prepositions give little to no meanings in a sentence because of how common they are and therefore, removing them shouldn’t cause a lot of changes in meaning and can help saving space in the database.

  • Lemmatization: replacing derived words with their base form

For example in English language, ‘go’, ‘went’, ‘gone’, and ‘going’ come from the same base word ‘go’. So whenever those 4 words are found in a sentence, the program will transform it to the same word ‘go’.

  • Stemming: replacing words with their root form.

The difference between stemming and lemmatization is that stemming removes the last few characters of a word without considering the context where it’s being used while lemmatization considers the context.

For example, lemmatization can convert the word ‘stripes’ to both ‘stripe’ and ‘strip’ depends on the context the word is being used while stemming will always convert it to ‘strip’ regardless of where it is being used.

  • Morphological segmentation: dividing words into a unit called morphemes

For example, the word ‘unbreakable’ can be further broken down into 3 morphemes: ‘un’, ‘break’, and ‘able.’ The meaning of individual morphemes then can be combined to get the meaning of the word ‘unbreakable’ itself.

  • Word segmentation: dividing a text into smaller units, in this case, words.
  • Part-of-speech tagging: giving a tag to each words to identify which part of speech they are (e.g. noun, verb, adjective, etc.)
  • Parsing: analyzing the text based on the grammatical rule of the language.
  • Sentence breaking: deciding the start and the end of a sentence.

--

--