AI Talk: Super sizing language models

April 29, 2022 / By V. “Juggy” Jagannathan, PhD

Brain with interconnected lines and boxes

Super sizing language models

There is competition brewing in the world of language models. In the summer of 2020, I blogged about GPT-3, a mega-language model from Open AI. Also last year, in another blog, I explored the research which links the brain to language models. A language model essentially predicts the next word given the left context, sort of like the command completion words your phone prompts while you are texting or writing anything. The pretrained language models are quite versatile in a variety of natural language understanding and generation tasks.

So, what is happening in this corner of the deep learning world? Absolute pandemonium! In the past two years the usual suspects (Google, Open AI, Microsoft, Nvidia, DeepMind, Meta) have splurged on new large language models – each one bigger and claiming more capabilities. What do I mean by splurged? These models don’t come cheap – typically costing in the millions to train, not to mention the carbon footprint the training entails.

Let’s talk about the latest 540 billion parameter model from Google, dubbed PaLM, which stands for Pathways Language Model. For reference, the GPT-3 model only had 175 billion parameters, so PaLM is more than three times larger. Pathways refers to a parallelization infrastructure that allows Google to efficiently scale and create this massive model.

What do we get from such a huge model? Well, we get across the board improvements in accomplishing a variety of natural language understanding tasks, such as reading comprehension and common sense reasoning. We also get improvements in natural language generation tasks such as summarization and dialog generation (chat-bot capabilities).

But what is interesting is a whole new set of evaluation benchmarks that have now been crafted to figure out the limits of these models. The Beyond Imitation Game Benchmark (BIG-bench) is a take on Allen Turing’s famous Turing test – the longstanding test for intelligence created in 1950s. The classic Turing test defines intelligence using a communication game: If a user is unable to distinguish if his conversation partner is human or machine, then the machine is intelligent.

The BIG-bench is an evolving collection of benchmarks, currently numbering over 200. PaLM exhibits impressive results in zero-shot and few-shot settings. Zero-shot task execution implies the model can, for instance, be given an English text and a prompt in Spanish, then translate the text. Few-shot settings imply a few examples that are provided to the model to explain what is needed. Some of the really impressive results relate to arithmetic reasoning exhibited (part of BIG-bench suite of tests) where the model explains the chain of thought reasoning in arriving at a correct math result. Here is an example taken from an article:

 “Q. Roger has 5 tennis balls. He buys 2 more cans of tennis balls. How many tennis balls does he have now?

 Example output:

Roger started with 5 balls. 2 cans of 3 tennis balls each is 6 tennis balls. 5 + 6 = 11. The answer is 11.

Admittedly elementary math, but still impressive for a language model to be able to do. Another impressive area of performance is in the explanation of jokes. Despite these impressive results across a wide range of tasks, for specific tasks it does not beat a fine-tuned model custom created to solve that problem – the classic tension between generalist models and specialist models.

Now, what is the prognosis for the future of these models given the current state? Are we looking at larger and larger models? One of the points the PaLM creators allude to is a “discontinuous improvement in performance” from a smaller model to the largest model. Meaning, the model performance is significantly better than the size alone can predict when compared to a similar but smaller sized model. They compare performance between 8B, 64B and 540B parameter models. This line of reasoning implies there will be efforts to create even larger models.

This “bigger is better” line of thinking has been fairly prevalent, but not so fast is the determination from DeepMind researchers. These researchers have determined that you don’t have to necessarily increase the size of the model to get better performance – instead, just feed the training process with more data. To prove their point, they trained a new 70 billion parameter model called Chinchilla (no idea why they chose to name it after a rodent) but with 1.4 trillion training tokens. In contrast, the vastly larger GPT-3 was trained on only 300 billion tokens. By quadrupling the amount of training data, Chinchilla outperforms GPT-3 and other larger models like the 530 billion parameter Megatron-Turing NLG from Microsoft.

But the latest model PaLM outperforms Chinchilla. The authors of DeepMind opine that a smaller model can reach the performance of PaLM when trained with more data. I found this blog comparing and contrasting the differing approaches to language model training quite instructive. Essentially you can train a smaller sized model with lots more data and you can have a model performance optimized for that size.

Certainly, the large language models’ capabilities are getting more and more impressive. PaLM demonstrates that, in a range of tasks, the model is performing at an average human level – whatever that means. But they also point out there is a big gap in performance between the best human performance on language understanding and generation tasks. Models of the brain have indicated that we also use something akin to a language model to help with communication, but not necessarily reasoning.

There is also another problem with the way language models are trained. GPT-3 grasp of current events is frozen in 2020, PaLM in 2021. Unless they figure out how to continually update the model with new data, the model remains frozen in time. Perhaps they will keep training these models ad-infinitum! That thought is a bit scary.

We have not by any means reached the end of exploration of language models. They remain a super active area of research and practical solutions are not far behind.

I am always looking for feedback and if you would like me to cover a story, please let me know! Leave me a comment below or ask a question on my blogger profile page.

“Juggy” Jagannathan, PhD, is Director of Research for 3M M*Modal and is an AI Evangelist with four decades of experience in AI and Computer Science research.