AI talk: Models, models everywhere but what do they all do?

May 20, 2022 / By V. “Juggy” Jagannathan, PhD

This week I look at a slew of new models that have been released in the past month that have garnered significant interest. I was inspired to write this blog after I saw an excellent technical discussion hosted by Yannic Kilcher about a variety of recent models. This blog covers a few of the models discussed there.

Meta’s OPT 175B

The competition in large language models is heating up. In my last blog, I talked about the 540 billion parameter model from Google called PaLM. This new model is from Facebook – now Meta. OPT stands for Open Pre-trained Transformer language model. This release (May 3) is a 175 billion parameter model.

So, what’s new here? For one, the models (different sizes) are open for use. In fact, they are already available on the popular transformer model sharing site HuggingFace. This is a big help to all other companies who do not have the wherewithal to create such massive and expensive models. Meta basically set out to recreate the GPT-3 model that came out few years ago – but with a focus on evaluating bias, toxicity and hate speech – something Meta is constantly dealing with on its platform.

By releasing the model in the open, Meta believes it can foster best practices and promote responsible use and deployment of such large language models. One fascinating aspect of this open release is the release of log notes from training the model that give an accurate picture of the trials and tribulations in creating such huge models.

DALL-E 2

Open AI DALL-E 2 is a fascinating second iteration of a model, and was released last month. The moniker DALL-E is a clever juxtaposition of the animated film WALL-E and surrealist artist Salvador Dali – known for the creation of bizarre images. So, in a nutshell, what does it do? Given a textual description of something, the model creates stunning images from that description. Take a look at the glowing praise in the world of art here. A phrase such as “an astronaut riding a horse on the moon” literally generates a high resolution version of that image. You can actually play with it yourself, using a mini version of the model (which is still good, but lower resolution) here. How does it do this? It’s another transformer-based architecture that has been trained on images and its’ captions.

Flamingos

Flamingo

DeepMind released Flamingo late last month – a visual language model. It can, with a few examples (few shots in deep learning parlance), provide captions of images. So, if you provide a description such as: “This is a chinchilla. They are mainly found in Chile,” along with an image of a chinchilla – that constitutes as an example.

Now you can input a picture of a “flamingo” and it puts out the following description: “This is flamingo. They are found in the Caribbean.” The model also exhibits multimodal dialog capabilities – you can engage it in a conversation surrounding images – what they mean and how they are put together. The model is meant to help with image and video cognition tasks. The model is already demonstrating significant capabilities on this front. Check out this vlog on the model and Yannic Kilcher’s take.

Sea turtle

TorToise – Text-to-speech

It’s not often we talk about an individual engineer’s model, but this text-to-speech model by James Betker, a Google engineer, was built in his spare time. You can see what and how he did here. Why the name TorToise? The model is slow to train. What does the model do? Well, you give it a few samples of a reference speech and a text. It converts the text to speech in the same voice as the reference speech. Think about it. Give it a few samples of your voice, it can read text in your voice! He built this model with parts that are accessible to all in the open source community. Remarkable, albeit a bit scary to democratize a deep fake system for speech.

Jurassic-X

An Israeli startup, AI21labs, has developed a newer version of its language model dubbed Jurassic-X. It’s an implementation of the system which has a clever moniker – the Modular, Reasoning, Knowledge and Language (MRKL – pronounced “miracle”).

What does Jurassic-X do? The claim: it bridges the neuro-symbolic chasm. Essentially, they have developed a mechanism that can inject expert knowledge into neural model response. If you ask GPT-3 or any language model created before 2020, who is the current president of U.S., the answer is “Donald Trump.” The knowledge in these systems is frozen in time. Which is obviously not desirable. How does the MRKL system address this issue? They have a curated set of experts that can be triggered by the language model. In the above example, an expert that can query Wikidata ensures the correct response is “Joe Biden.” Of course, Google always gets this right. Perhaps they are trying to divine Google’s secret sauce.

A similar, ask the expert approach can be used to address math problems which language models have difficulties with. So, their vision is a language interface to everything supported by all kinds of experts. One immediate practical application is to provide a language interface to company’s proprietary databases. Does this kind of approach put a stop to the never ending versions of language models being released? The jury is out on that.

I am always looking for feedback and if you would like me to cover a story, please let me know! Leave me a comment below or ask a question on my blogger profile page.

“Juggy” Jagannathan, PhD, is Director of Research for 3M M*Modal and is an AI Evangelist with four decades of experience in AI and Computer Science research.