Hands-on Tutorials

Alternative distributional semantics approach

Resolving ambiguity is not ambiguous anymore

Nesrine Sfar
Towards Data Science
11 min readApr 21, 2021

--

If you landed here, it means that you’re curious enough to learn more about the different ways to resolve ambiguity in NLP/NLU.

Background information is the reason of ambiguity for machines. This ambiguous information arises from the human natural language used in communication. The process to “translate” this language into a comprehensive artificial language for machines could produce ambiguity. This can be explained by the fact that human language itself is inherently informal and ambiguous.

Traditional distributional semantics approaches are based on words vectorization to address semantics. The alternative shown here is based on a knowledge graph with straightforward requests to solve lexical ambiguity.

This tutorial will stress out some ambiguity resolution tasks that we can use to solve this problem, via an easy and convenient application, thus, a Natural Language API (NL API).

Photo by Paweł Czerwiński on Unsplash

Natively machines cannot interpret nor understand text; to do so, and to resolve language ambiguity, they need the text to be annotated through multi-level linguistic analysis. The phenomena that handles ambiguity is called “Disambiguation”. This is the process which helps the machine to detect meaning (semantics) in a text. The meaning is determined considering context, syntax and words relations.

The following article will emphasize different approaches that can be used to help machines reduce ambiguity and reveal text comprehension, such as Lemmatization, POS Tagging and so on.

The work will be based on the use of a Natural Language API called expert.ai NL API.

Expert.ai NL API ?

The expert.ai Natural Language API is an application capable to provide multiple level of information within a text through few lines of code. The API provides deep language understanding in order to build NLP modules. It shows a subset of features that perform deep linguistic analysis (Tokenization, Lemmatization, PoS tagging, morphological/syntactic/semantic analysis). On top of that, the library allows to solve problems such as Named Entity Recognition (NER), semantic relationships between these entities and Sentiment analysis. Document classification is also available through a ready-to-use taxonomy.

1/ How to use Expert.ai NL API for Python ?

Installation of the library

First things first, you need to install the client library using this command:

pip install expertai-nlapi

The API is available once you have created your credentials on the developer.expert.ai portal. The Python client code expects your developer account credentials to be specified as environment variables:

  • Linux:

export EAI_USERNAME=YOUR_USER
export EAI_PASSWORD=YOUR_PASSWORD

  • Windows:

SET EAI_USERNAME=YOUR_USER
SET EAI_PASSWORD=YOUR_PASSWORD

YOUR_USER is the email address you specified during registration.
You can also define credentials inside your code:

2/ Deep linguistic analysis

Linguistics separates the analysis of language into different parts. All these branches are interdependent, everything is linked in language.
The document is processed with multi-level text analysis; each text is split into sentences, which are parsed into tokens, lemmas and parts-of-speech, finding the relations between syntactic constituents and predicates, and interpreting syntax to build a full dependency tree.

To retrieve these pieces of information, you start by importing the client section of the library:

Let’s take an example to illustrate these operations:

“Sophia is a social humanoid robot developed by Hong Kong-based company Hanson Robotics. Sophia was activated on February 14, 2016.”

a/ Text Subdivision

This operation allows to divide the text from the longest form to the smallest, in this case, starting from the paragraph level, going through the sentences and phrases, until the tokens level. When the token is a collocation (compound word), text subdivision can get deeper in the analysis until the atom level which it cannot be further divided.

Once you have imported the library and instantiated the client, you should set the language of the text and the parameters of the API:

Inside the API request, you should mention the sentence to analyze inside the body and the language inside the params. The resource parameter is related to the operation you need to perform on your text, for instance, disambiguation here which is based on a multi-level text analysis provided with the expert.ai NL API.

This multi-level text analysis is generally broken down into three stages:
1. A lexical analysis: a text subdivision phase that allows the text to be broken down into elementary entities (tokens).
2. A syntactic analysis: consists in the recognition of combinations of lexemes forming syntactic entities (including POS Tagging).
3. A semantic analysis: The Disambiguation occurs at this level, it detects the meanings of these entities according to the communicative context and the possible relationships between them.

The lexical analysis starts by this first subdivision:

Paragraphs: Sophia is a social humanoid robot developed by Hong Kong-based company Hanson Robotics. Sophia was activated on February 14, 2016.

Since our text is already a paragraph (two sentences here), the output of the subdivision provides the same text as the input. Let’s try to break the paragraph into the sentence level, in this case, we just need to modify the element .paragraphs to .sentences. The most common way of delimiting a sentence is based on the dot (.):

Sentences: Sophia is a social humanoid robot developed by Hong Kong-based company Hanson Robotics.
Sentences: Sophia was activated on February 14, 2016.

We have indeed two sentences as a result. Let’s get deeper into the subdivision to retrieve the phrase level. We use the same procedure as above, modifying the element .sentences with .phrases:

Phrases: Sophia              
Phrases: is a social humanoid robot
Phrases: developed
Phrases: by Hong Kong-based company Hanson Robotics
Phrases:.
Phrases: Sophia
Phrases: was activated
Phrases: on February 14, 2016
Phrases:.

We notice that once we go deeper into the subdivision, we get more elements in the result. We can get the number of the phrases inside our text as well:

phrases array size:  10

b/ Tokenization

Furthermore, we can break down the phrase level into smaller units which are the tokens. This is the “Tokenization” task and it is very common in NLP. It helps the machine to understand the text. To perform the tokenization with Python, we can use the .split() function as shown below:

For example, consider this sentence:

“CNBC has commented on Sophia’s lifelike skin and her ability to emulate more than 60 facial expressions.”

These are the tokens of the sentence ['CNBC', 'has', 'commented', 'on', 'the', "robot's", 'lifelike', 'skin', 'and', 'her', 'ability', 'to', 'emulate', 'more', 'than', '60', 'facial', 'expressions.']

Without specifying the delimiter inside the split(), the text is separated according to the space.
With the expert.ai NL API , we can perform the tokenization as well, with additional features. In other words, the API provides different word-level tokens analysis; the tokens resulting from the API could be words, characters (like contractions) and even punctuations.
Let’s see how to perform this task with the API, we use the same procedure as above, modifying the element .phrases with .tokens:

TOKEN                
----
CNBC
has
commented
on
the
robot
's
lifelike
skin
and
her
ability
to
emulate
more
than
60
facial expressions
.

We notice that the tokens are either words like skin , ability, emulate, contractions such as : ‘s, numbers: 60 or even punctuation like the dot (.).

The tokenization results collocations as well like facial expressions which is impossible with the split() function. The API is capable to detect compound words inside the sentence, according to the positions of the words and the context. This collocation can be further divided, to the atom level, which is the last small lexical unit we can have:

CNBC                
has
commented
on
the
robot
's
lifelike
skin
and
her
ability
to
emulate
more
than
60
facial expressions
atom: facial
atom: expressions
.

c/ PoS Tagging

The tokenization leads to the second process in NLP, which is the POS tagging (Parts Of Speech tagging), working together in order to allow the machine to detect the meaning of the text. At this stage, we introduce the syntactic analysis that includes the POS tagging task. This latter consists in assigning a POS or a grammatical class to each token. POS characterizes the morpho-syntactic nature of each token. These labels attributed to the textual elements can reflect a part of the meaning of the text. We can list few parts of speech, commonly used, in the English language: DETERMINER,NOUN, ADVERB, VERB, ADJECTIVE, PREPOSITION, CONJUNCTION, PRONOUN, INTERJECTION

One word having the same form with other words (Homograph) can have different meanings (Polysemic). This word can have different POS even though it has the same form. The grammatical class depends on the position of the word in the text and its context. Let’s consider these two sentences:

The object of this exercise is to raise money for the charity.
A lot of people will object to the book.

From a linguistic point of view, in the first sentence, object is a noun whilst in the second one, object is a verb. The PoS Tagging is a crucial step towards the Disambiguation. Depending on this tagging, the meaning of the words is inferred from the context, from the form of the word (for instance, Capital letters in the beginning of a Proper Noun), from the position (SVO word order),etc. Consequently, semantic relationships are produced between the words; linking each concept to one another depending on the type of the relationship, building together a knowledge graph.
Let’s try to use our API to generate the POS tagging for the previous two sentences:

we start by importing the library and creating the client as below:

we have to declare the variables related to each sentence, object_noun for the sentence where the word object is a noun, and object_verb for the sentence with the verb:

The word object has the same form in both sentences but has a different POS. In order to demonstrate it with the expert.Ai NL API, we need to call this latter.

In the beginning, we specify the text on which we want to proceed the POS Tagging, for the first sentence, it’s the object_noun, for the second, it’s the object_verb. Then, the language of the examples and in the end the resource; that is related to the analysis performed, in this case the Disambiguation, like the following:

Once we set these parameters, an iteration over the tokens is necessary to assign a POS to each one, respectively for each example;

       Output of the first sentence :  

TOKEN POS
The DET
object NOUN
of ADP
this DET
exercise NOUN
is VERB
to PART
raise VERB
money NOUN
for ADP
the DET
charity NOUN
. PUNCT
Output of the second sentence :

TOKEN POS
A lot of ADJ
people NOUN
will AUX
object VERB
to ADP
the DET
book NOUN
. PUNCT

On the one hand, object is indeed a NOUN, preceded by the article/Determiner (DET) The. On the other hand, the word object is in fact the VERB of the sentence which links the subject “a lot of people” to its object “the book”.

Traditional POS Tagging tools used in NLP usually work with the same type of information to label a word in a text: its context and its morphology. The peculiar feature of POS Tagging within expert.Ai NL API is not only to identify for each token a grammatical label, but also to introduce the meaning.

In other words, one word can share the same form with other words but it includes several meanings (Polysemy). Each meaning is conveyed in a concept, linked with other concepts, creating a knowledge graph. The word object seen above has more than one meaning, hence, it belongs to different semantic concepts, that we call “Syncons” within the knowledge graph of the expert.ai NL API. The POS Tagging can reveal different labels of the same word, thus, different meanings. That is what we can examine with the API:

        Concept_ID for object when NOUN  

TOKEN POS ID
The DET -1
object NOUN 26946
of ADP -1
this DET -1
exercise NOUN 32738
is VERB 64155
to PART -1
raise VERB 63426
money NOUN 54994
for ADP -1
the DET -1
charity NOUN 4217
. PUNCT -1
Concept_ID for object when VERB

TOKEN POS ID
A lot of ADJ 83474
people NOUN 35459
will AUX -1
object VERB 65789
to ADP -1
the DET -1
book NOUN 13210
. PUNCT -1

As can be noted, the NOUN object belongs to the concept with the ID 26946. This concept includes other words with the same meanings (synonyms). By contrast, its homograph in the second sentence is related to the ID 65789. These ID are the identification of each concept inside the Knowledge Graph.

Therefore, a different POS leads to a different meaning, even though we have the same morphology of the word.

Please notice that the words having -1 as an ID such as ADP (Adposition referring to prepositions and postpositions), PUNCT (for Punctuation), DET (for Determiner) and so on, are not available in the knowledge graph because they are not inherently semantics.

d/ Lemmatization

Here is another core task in Natural Language Processing, called the Lemmatization. It’s an important step, along with Tokenization and POS Tagging, to perform information extraction and text normalization. Particularly useful for opinion mining and emotion detection, lemmas allow the emergence of major semantic trends in a document.

The lemmatization is a linguistic resource that groups certain tokens together. In a nutshell, it associates with each token the canonical form which represents it in a dictionary:

  • The infinitive for VERBS: wore, worn -> wear / ran, running, runs -> run
  • the singular form for NOUNS: mice -> mouse / die -> dice
  • etc.…

The concept (or Syncon) can contain many lemmas (lexemes). During the Disambiguation process, each token identified in the text is returned to its base form, removing inflectional affixes. Each lemma is associated to a concept in the Knowledge graph. Therefore, The lemmatization enables to reduce the set of distinct tokens to the set of distinct lemmas. This can be explained through this example;

Initially, a hearer of the lexeme “living” can discern, almost unconsciously, what the word means. This is possible for humans by making inferences based on the knowledge of the world, etc. This is quite impossible for machines if the context is not present.

For the machine to predict several meanings that a word, with same spelling and same sound arises, the lemmatization is the key solution to handle this lexical ambiguity.

We can perform this task with the expert.ai NL API. Let’s consider these two examples:

She’s living her best life.
What do you do for a living?

        Output of the first sentence :  

TOKEN LEMMA POS
She she PRON
's 's AUX
living live VERB
her her PRON
best good ADJ
life life NOUN
Output of the second sentence :

TOKEN LEMMA POS
What what PRON
do do AUX
you you PRON
do do VERB
for for ADP
a a DET
living living NOUN
? ? PUNCT

As stated above, “living” belongs to two different lemmas, depending on the context and its position within the sentence. In the first example, living corresponds to the lemma “live” which is the VERB of the sentence. On the contrary, living in the second sentence is a NOUN and has the lemma “living”. The meaning is different as well, the first lemma describes the concept of “remaining alive”, however, living as a noun belongs to the concept of “an income or the means of earning it”.

Consequently, the lemmatization helps the machine to deduce the meaning of a homographic word.

Conclusion

One expression or word can have more than one meaning, therefore, a problem in language comprehension for machines. Thanks to very basic NLP tasks like lemmatization, PoS Tagging, etc., and few lines of code, we can resolve this ambiguity, and this is what I aimed to share in this article.

Hoping that resolving ambiguity is less ambiguous now…

--

--

Double degree in Linguistics and NLP, I’m very keen on languages and a NLP/NLU passionate