With the arrival of GPT-3 by OpenAI, natural language understanding (NLU) has made another big leap in creating human-like writing. When you take a look at the weaknesses of GPT-3, and you hear that

“Limited memory, repetition/divergence, BPE encoding. GPT-3 is, of course, not perfect”.

You think the main arguments on the website of Gwern Branwen are quite good. But then you have to realize a few seconds later that GPT-3 itself had fooled you. You may think, that’s impressive! And that we have come a long way. And indeed, the pace of development in NLU is breathtaking. The challenge of 2018 started the fierce competition. In early 2018 ELMo (Embeddings from Language Models) showed excellent performance through the use of forward and backward LSTMs. With the impressive steps forward taken by BERT (Bidirectional Encoder Representations from Transformers) in late 2018, almost all researchers were adopting this algorithm, and they were creating their own optimizations and derivatives of it. Only waiting for GPT-2 in 2019 to catch up and be on par.

Image for post
BERT is deeply bidirectional, OpenAI GPT is unidirectional, and ELMo is shallowly bidirectional.

However, despite all this exaltation, we need to focus on the still widespread shortcomings and failures of NLU. Here we take a look at the most pressing pain points of today’s NLU developments and the challenges to address. A model that extracts meaning from human utterances always consists of an algorithm, the selected training data, and a corpus. After training, an NLU creates a hypothesis with or without premise/context to be processed further down the chain to answer questions, create summaries, or to handle the extracted information. In this whole process, a lot of challenges emerge for research and commercial application. These applications can be chatbots, robotics, or other ranking or dialog efforts.

Major challenges in understanding meaning

Image for post

The problems arise in the first place with adequate benchmarking. With the scale and complexity of today’s models, a good benchmark system is needed to keep up with current developments to compare NLU algorithms’ quality. Then we have the immense need for human checking and interaction. The ground truth is utterances of human beings, and currently, they’re the ones who are most creative in breaking algorithms and data. Another big, third challenge is faced in the liveliness of languages. To adapt to the changes in language, the NLU models must learn continuously. On Twitter, every day new, never heard of hashtags emerge. We further have the inherent problem of heuristics. Algorithms are vulnerable to edge cases and targeted attacks. If you need dependable models for healthcare or robotics systems, you need to guarantee their safety. A general deep learning problem also hits the NLU hard — the bias in datasets. Especially the need for authors annotating data and writing hypothesis could create a bias that inhibits quality models. And the last point we cover is the algorithms’ behavior to learn statistical regularities instead of linguistic priors. Here algorithms like BERT are likely to look for certain linguistic patterns in the training data and then create patterns of random events due to found regularities.


Image for post
Diagram of Adversarial human-in-the-loop data collection. Source: Adversarial NLI: A New Benchmark for Natural Language Understanding

With large and complex models, you could not find the problems through testing and anecdotal guessing anymore. Here a more structured approach like a benchmark needs to be used in order to compare the different NLU algorithms and models. But how do you create a benchmark that could keep up with the fast pace of NLU progress? A paper from 2019 is introducing “AdversarialNLI: A New Benchmark for NLU.” Here researchers from Facebook AI Research and UNC-Chapel Hill describe the current problems of benchmarking. They found out that it is easy to fool state-of-the-art models, even untrained annotators can do that.

Further, they stated the importance of the training data and corpora. As a benchmark accompanies technology development, it hints towards the current challenges and research goals. Benchmarks are especially good at comparing emerging zoos of derivatives, for example, the success of BERT resulted in a whole family of specialized variants and optimizations.

Human in the loop

Image for post

Human labor is expensive and often has processing limitations due to logical fallacies and cognitive biases. So, the technical failures and the quality of new models are also depending on human performance and budget.

Moreover, are For well-trained and high-quality NLU systems, a lot of annotated data is required. These are mainly created by manual labor. The evaluation of training efforts and the creation of test cases also needs human workers. Many existing text corpora are available; however, these may not be enough for the individual application. Wikipedia offers a huge collection of fact-based data, the Stanford Natural Language Inference (SNLI) Corpus, and the Multi-Genre Natural Language Inference (MultiNLI) corpus are well balanced and manually labeled corpora with about 500 thousand word pairs, which are good datasets to start with. But current projects require a lot of resources to create transcripts, annotations, and counterexamples for creating domain-specific datasets.

Liveliness of languages

Image for post

From a linguistic perspective, change in language is unpredictable. The changes will happen incrementally or randomly. The lifespan of words can be extremely short, or they can endure for hundreds of years. With the fast-paced global interchange of information and messages, important new topics like COVID-19 or 5G must be integrated quickly. So, the challenge is to create a system that can change and adopt new meanings quickly. This work may be doable for large enterprises like Amazon, Google, Facebook, or Microsoft, but it is a big challenge for smaller-scale companies. Benchmarks have shown that the modification of pre-trained models can lead to losses in robustness and general performance. The dynamics of state-of-the-art models and their behavior is still unpredictable. So live changes can lead to catastrophic results. For example, when a crowd-sourced chatbot is trained while in service, vandalism can destroy it. In 2016 Microsoft’s Twitter bot Tay turned antisemitic after attacks from racists and trolls.

Heuristics don’t cover the edge

Image for post

The problem of heuristics does not only concern NLU. It also is relevant for most of the existing artificial intelligence algorithms. So, it has to be stated that you have to think about the general shortcomings of AI; particularly, when you test NLU. Depending on your industry or specific problem you like to solve, this can lead to even bigger problems. The famous “Time flies like an arrow. Fruit flies like a banana.” is a humorous saying used by linguists to show syntactic ambiguity, which may seem highly academic. But there may be cases in your project, where you run into problems when you get it wrong. It gets worse when you dip into sentiments. Think of utilizing the latest GoEmotions dataset to do sentiment analysis, and you judge a client’s emotion wrong. These cases may be rare, but when you need a judgment in a high-stake situation, it’s very problematic.

Bias in datasets

Image for post

Benchmarks and datasets are the main tools in NLU. So high-quality datasets are required to create high-quality NLU systems. But how to create neutral datasets? That is a very difficult task. When we look at GPT-3 with its roughly 500 billion tokens, we see that the majority of them were created through a common web crawl. This web crawl is filtered by a certain “quality.” And here, bias comes into effect. Who is making the decisions of what defines “a quality,” and what are the premises? And even worse in the annotators’ bias. Here a small group of workers creates a high-quality dataset. And these workers also create the accompanying examples. Due to their writing style optimized to get the work done quickly, they produce artifacts. These artifacts are problematic when they produce a hypothesis that is a negation of the original meaning.

Statistical regularities versus linguistic priors

Image for post

One of the main deficits of state-of-the-art algorithms like BERT is that their performance is very unpredictable. A good performance in one benchmark does not mean good performance in another benchmark or real-world scenario. Recent studies showed that generalizations of models do not show consistent behavior. This effect is attributed to the fact that the state-of-the-art algorithms do not learn linguistic priors, which are important for infrequent linguistic events, but they’re just learning statistical regularities. Language priors are powerful tools that are much closer to human processing of language, as they copy the inherent fundamental principles of the used language. The statistical regularities learned by the neural networks are more superficial, and they detect language patterns. This tendency is common among all deep neural networks. That leads to great abilities to generalize, but also they’re sensitive to adversarial perturbations.

Real-world implications

Image for post

With the arrival of lager models every day, and the optimization of algorithms, the challenges for NLU are still not solved, as they don’t yield conceptual understanding. The tasks of summarization, question answering, and information extraction are still not adequately handled. NLU still has big problems in coreference resolution and polysemy. Making the right choices to designate all expressions that refer to the same entity and finding all senses of a given-word are still tricky. And the fundamental ambiguity of language will not go away. Further, the representation of larger contexts is still inefficient. And the link between a world model and the language model is not learned uniformly.

Despite these problems, the business world is praising the success of the last two years in NLP and NLU. But the problems need to be addressed. As former IBM researcher David Ferrucci of the Watson team stated:

“Humans don’t even agree on most concepts, that is why you actually need dialogues, to establish a common interpretation.”

And so, building larger and larger models like GPT-3 is a marvelous effort. But if this approach could solve the problem of understanding is doubtful.