New algorithms and technologies to understand written text emerge at a rapid pace. We’re living in a world of big data and text analytics with a lot of business and consumer applications. The majority of this processed data is unstructured. Sources for data are eMail, written reports, legal documents, or social media and messenger posts.
This established industry offers a huge range of available products. Big enterprises like Microsoft and IBM offer their solutions. Further, there are specialized text analytic tool vendors. Voyant Tools offer text analytic tools for websites, SAS text miner generates topics and rules automatically for online content, and WordStat by QDA miner provides content analysis, text analysis, and sentiment analysis.
But all of them still use many traditional text mining algorithms to get meaning from unstructured text documents. Sadly, these offered solutions are far from ideal.
To better understand the shortcomings of these algorithms used for natural language understanding (NLU), we take a detailed look at the problems of old school approaches at each step of flexible natural language processing. In this way, we get a better understanding of the need for an upper ontology to grasp the true meaning of a text. Upper ontologies bring value in offering universal general categories for semantic interoperability. They define concepts that are essential for understanding meaning.
When we look at the steps a traditional algorithm follows, we could identify seven different tasks. These tasks build a toolchain to extract meaning. The whole parsing of a text takes place sentence by sentence. It starts with language identification. Then tokenization and sentence breaking are done. The fourth step is speech tagging, followed by chunking and syntax parsing ‒ and concluding with the step of sentence chaining.
The main reason these algorithms have such a big deficiency in getting the meaning of human language right is ambiguity. The five hundred most common words of the English language have, on average, 23 different meanings per word. For example, according to the Merriam-Webster dictionary, the word “round” has 55 clearly different meanings.
This lexical ambiguity is only one part of the total ambiguity faced. The second main ambiguity is syntactic ambiguity. Examples for this are “Mary ate a salad with spinach from California for lunch on Tuesday.” or “The chicken is ready to eat.” We take one of these classic examples to show the problems of the traditional NLU algorithms. The wonderful joke “One morning I shot an elephant in my pajamas. How he got into my pajamas I’ll never know.” by Groucho Marx provides a good resource.
To understand a text, first, we have to identify which language it is written in. When we got a document with the sentence “One morning I shot an elephant in my pajamas.” And we want to know in which language this sentence is written in. We need an algorithm that knows all languages, or at least all languages in scope. The traditional language identification algorithm needs to be trained with a lot of language data. As most documents don’t have an indicating filename or metadata that helps to identify given documents or text from the internet, the algorithm only has the given sentence as input to make a decision.
Here n-gram models are used to identify the language. Due to performance requirements, these models don’t use word-based features but focus on character sets. This results in the problem that the best algorithms deliver only an overall classification accuracy ranging from 65 to 90 percent. Further, the performance on very short texts is quite poor and could be seen as unsolved.
After our text’s language is classified as English (but maybe not differentiated between British and American English), a preprocessing step is needed. The step of tokenization bundles the characters of the text into tokens. These tokens are words, sentences, and other language-specific symbols like punctuation. Our text “One morning I shot an elephant in my pajamas.” Is transformed into a sequence like tokenized words: [‘ One ‘, ‘ morning’, ‘I’, ‘shot’, ‘an’, ‘elephant’, ‘in’, ‘my’, ‘pajamas’, ‘.’]
The problems that arise with this algorithmic approach are manifold. As an example, when we look at the punctuation, the ‘.’ also denotes abbreviations. So, it’s hard to decide if the correct tokens are created. We can see this in a nice example of a T.V. show. It could be tokenized in the form of [‘T.V. show’] or [‘T’,’.’,’V’,’.’,’show’].
The next step is closely related to the tokenization. In longer texts, the algorithm has to decide where to break the sentences. In our example of the joke, there are two clearly identifiable ‘.’ which seem to make it easy for an algorithm to split the text. But be aware that the algorithm at this stage does not know about any meaning. When we have an unstructured text, there may be headlines without punctuation, and the separator could be a line break.
The now created sentence “One morning I shot an elephant in my pajamas.” is tagged with nouns, verbs, adjectives, adverbs. This is done with rule-based and stochastic methods. An English POS-tagger would create a marked-up text like the following:
<W TAG=”CD”>One</W> <W TAG=”NN”>morning</W ><W TAG=”PRP”>I</W>
<W TAG=”VBD”>shot</W> <W TAG=”DT”>an</W ><W TAG=”NNS”>elephant</W>
<W TAG=”IN”>in</W> <W TAG=” PRP$”>my</W ><W TAG=”NNS”>pajamas</W>
Where, in this case, the tagging follows the Penn Treebank Project using the listed tags:
CD Coordinating conjunction
NN Noun, singular or mass
PRP Personal pronoun
VBD Verb, past tense
NNS Noun, plural
IN Preposition or subordinating conjunction
PRP$ Possessive pronoun
From the performance side, literature shows a 97 percent accuracy per token. However, that seems quite good, but when you look at the details, this only is true for matching training data from the same domain and epoch. When you look at the sentence accuracy, the reliability drops to a non-satisfactory 57 percent. This results from lacking a strong linguistic basis.
In the next step of chunking, phrases are extracted from the given parts-of-speech. A syntax tree of relationships is created to show Noun Phrase (NP), Verb Phrase (VP), and other phrases. In this way, you could identify locations, person names, and other entities.
When we analyze our sentence with the online version of NLTK (a python Natural Language Toolkit), we see that 2 of 5 algorithms cannot extract any phrases or named entities. Even worse, the part-of-speech ‘one’ was designated as a location, and ‘pajamas’ was tagged as a time entity.
The treebank algorithm identified four noun phrases [‘One morning’,’I’,’an elephant’,’my pajamas’ ] without getting the verb phrase ‘shot’. This approach works purely on probabilities for the best sentence and fast parsing, but it does not follow a linguistic point of view. Grammar rules are evaluated on a pure likelihood base.
This compute-intensive step tries to find the correct meaning of the given phrases and entities. The goal is to get the syntax right as a human would. For this step, there are a lot of different algorithms and models available. As there is no general understanding of language, these algorithms are focused on their specific use cases. When you want to analyze sentiment, there is a specific solution. When you want to let a customer find the right product, there is a specific solution for that. When we take our example, “One morning I shot an elephant in my pajamas.” To the online IBM Watson NLU, it gets some things right, but not everything. Right in the beginning, it identifies falsely ‘one’ as a noun and number and not as an adjective. Watson got the sentiment right as neutral. It made the right analysis for the keywords ‘morning’, ’elephant’, and ’pajamas’. But it put the sentence in the category of /style and fashion/clothing/pajamas, which shows the focus of Watson on commerce. This focus on selling things reflects in the concept identification of ‘shot’ where, with a confidence of 0.86, ‘Mixed drink shooters and drink shots’ were identified. And this is a result of the massive cloud computing computational resources of IBM.
Lexical chaining is the step where individual sentences of a longer unstructured text are connected for a more in-depth analysis. Found concepts in sentences were taken as fragments of a more general concept, and a score of confidence is created to link these fragments. This approach is often used to create summaries of texts, or see if two given texts have the same topic.
When we look at our joke, “One morning I shot an elephant in my pajamas. How he got into my pajamas I’ll never know,” a lexical chain processor would identify the same word pajamas as belonging together easily. But, finding the relationship between ‘elephant’ and ‘he’ seems much more difficult. The denomination of animals with ‘he, she, it, his, her’ is often dependent on the emotional bond of the referring human. So, the capability of getting the meaning right depends on a higher understanding of linguistics and concepts.
When we look at today’s state-of-the-art lectures in Natural Language Processing with Deep Learning, we see that a good foundation in linguistics is needed for making progress in extracting meaning from text. Classification with neural networks has made great progress in recent years, but as an interpreter of meaning, they don’t perform well. This also reflects in industry solutions. When setting up traditional text analytics tools, they need to be configured to match a defined narrow scope. And each solution offers different features and capabilities.
Further, all approaches, from stochastic models to machine learning, require huge amounts of computing power and memory. The recent GPT-3 has 175 billion parameters. To train these kinds of models, massive GPU power and millions of U.S. Dollars are needed. And still, we have deficits in extracting meaning from a linguistic and upper ontology viewpoint.
Despite the fact that new capacities with quantum computing are coming slowly to life, we’re hitting a glass ceiling. Amazon is offering quantum computing instances, but this new way of solving problems requires new ideas. Understanding language requires an understanding of basic concepts and general categories, and that is what an upper-level ontology offers.