AI Talk: Transformer architecture

March 25, 2022 / By V. “Juggy” Jagannathan, PhD

In this week’s blog, I trace the history and potential of the transformer architecture.


Transformers is all you need

Mention “transformers” to my grandkids and they get excited. But I am not talking about the mechanical marvel that transforms a superhero to a super car, nor am I talking about the device that is used in power grids. I am talking about a deep learning architecture that is over taking the machine learning world.

In the summer of 2017, a landmark paper was published by Google researchers (Vaswani et. al) titled “Attention is all you need.” The machine learning architecture they proposed was dubbed Transformer perhaps based on the initial application they focused on (machine translation of text from one language to another). The “attention” the authors refer to is based loosely on the human ability to attend to specific parts of the input. They also showed that, for machine translation, this architecture performed significantly better than the existing state of the art (SOTA) models.

The transformer architecture seemed particularly suited to processing text. This fact was proven convincingly in 2018 when Google researchers published another pivotal paper dubbed “BERT” (Develin, et. al) – following a trend to name machine learning architectures after Sesame Street characters. The BERT transformer was used to pre-train a language representation model – also called contextual embeddings. Essentially by training the model using large quantities of text they came up with representations for every word which was informed by their unique context.

The researchers went further than just creating these representations. They showed that when this representation was used for a range of downstream language understanding tasks (detailed in the General Language Understanding Evaluation (GLUE) benchmark), they moved the SOTA by a significant margin. The use of pretrained representations in other tasks is called “transfer learning” – the model is transferring what it has learned and is being applied to other tasks.

GLUE is a series of tasks, such as identifying whether the sentiment of a sentence is positive or negative, semantic equivalence of a sentence and its paraphrase, etc. GLUE identified nine different tasks and established a way to figure out which systems were advancing the SOTA. As significant progress was achieved in the GLUE benchmark, it was replaced by a more difficult SuperGLUE benchmark.

The results showcased in the BERT paper galvanized the research community. Over the next few years, a slew of transformer-based pretraining models were released. OpenAI released a series of pretrained language models, each one progressively larger. In the summer of 2020, it released GPT-3 (Generative Pre-trained Transformer) which had 175 billion machine learning parameters. This model was even more remarkable in all the language tasks it could do – I blogged about this in the summer of 2020. A GPT-4 version is in the works and will probably be released next year.

All the above is old news. So, why am I rehashing this old transformer history? A recent article discusses the role of the transformer architecture for domains outside of text understanding. The article’s provocative title: “Will Transformers Take Over Artificial Intelligence?” intrigued me by the pronouncement that perhaps the transformer architecture can be useful in handling multi-modal data, not just text. Turns out, over the past few years and especially in the past year, every modality – text, audio, images and video – are being tackled using the transformer architecture. Transformers are perhaps all you need when it comes to deep learning architecture.

Vision Transformer (ViT) set the stage for applying transformer architecture to image recognition tasks and did very well as compared to other approaches. Conformer is a hybrid architecture that combines convolution networks with transformer architecture to develop a better automatic speech recognizer. PolyViT is another more recent transformer entry that pre-trains with multi-modal data – images, audio and video. A more recent entry into the multi-modal models is Google’s Multimodal Bottleneck Transformer (MBT). MBT is another transformer architecture where the pre-training process tries various ways in which there is cross-modal attention – i.e., attend a bit to speech when processing video segment.

It is hard to believe that the transformer models have been around for half a decade now. But one thing that is undisputed is they have transformed the landscape of deep learning solutions for language processing. It increasingly looks like they might do the same for other modalities as well.

I am always looking for feedback and if you would like me to cover a story, please let me know! Leave me a comment below or ask a question on my blogger profile page.

“Juggy” Jagannathan, PhD, is Director of Research for 3M M*Modal and is an AI Evangelist with four decades of experience in AI and Computer Science research.