What is Transformer Model in AI? Features and Examples

Table of Contents

Transformer model types
How does transformer model work?
Encoder in transformer model
Decoder in transformer model
Self-attention in transformer model
RNNs vs. LSTMs. vs. transformers
Transformer model examples across industries
Future of transformer model
Transformer model: Frequently asked questions (FAQs)

Earlier, translating and analyzing natural language was a lengthy and resource intensive process in machine learning.From defining hidden states to predicting text with transformer models, we have come a long way. These transformer models can automate text generation effortlessly and quickly without human intervention.

Powered with artificial neural network software, transformers has supercharged linguistics across different commercial domains of healthcare, retail, e-commerce, banking and finance. These models have bought about a revelation in deep learning and factored in latest natural language processing and parallelization methods to decipher long range dependencies and semantic syntaxes to generate contextual content.

Let's go deeper to understand the why and how of transformer models in generative AI.

What is a transformer model?

Transformer model is a type of machine learning architecture that is trained in natural language processing tasks and knows how to handle sequential data. It follows methods like "self-attention" and parallelization to execute multiple sentences simultaneously. These methods allow the model to derive semantic bonds between subject and object.

Transformer models have been a game changer in the world of content. Not only it helps design conversational typefaces for question-answering, it can read entire documents written in one specific language to generate an output counterpart in a different language.

Transformers can translate multiple text sequences together, unlike existing neural networks such as recurrent neural networks (RNNs), gated RNNs, and long short-term memory (LSTMs). This ability is derived from an underlying “attention mechanism” that prompts the model to tend to important parts of the input statement and leverage the data to generate a response.

Transformer models recently outpaced older neural networks and have become prominent in solving language translation problems. Original transformer architecture has formed the basis of AI text generators, like a Generative Pre-trained transformer like ChatGPT, bidirectional encoder representations from transformers (BERT), Turing (T5), and MegaMOIBART.

A transformer can be monolingual or multilingual, depending on the input sequence you feed. It analyzes text by remembering the memory locations of older words. All the words in the sequence are processed at once, and relationships are established between words to determine the output sentence. For this reason, transformers are highly parallelizable and can execute multiple lines of content.

Transformer model types

The architecture of a transformer depends on which AI model you train it on, the size of the training dataset, and the vector dimensions of word sequences. Mathematical attributes of input and pre-trained data are required to process desired outcomes.

Encoder-only architecture is a double-stacked transformer that uses the input tokens to predict output tokens. Examples are BERT and Google Gemini.
An encoder-decoder model uses all six layers of the neural network to position word sequences and derive language counterparts. Examples are Turing and Deepmind’s AlphaFold and AlphaStar.
Decoder-only architecture sees the input fed as a prompt to the model without recurrence. The output depends on the nature of input that determines the nature of new tokens. Examples are Open AI’s GPT and GPT-2.
Bidirectional Auto Regressive Transformer, or BART, is based on natural language processing (NLP) and designed to process and analyze text in one direction. It uses transfer learning to learn from the previous tokens and apply that context for newer word generation.

How does transformer model work?

Mainly used for language translation and text summarization, transformers can scan words and sentences with a clever eye. Artificial neural networks shot out of the gate as the new phenomenon that solved critical problems like computer vision and object detection. The introduction of transformers applied the same intelligence in language translation and generation.

transformer application

The main functional layer of a transformer is an attention mechanism. When you enter an input, the model tends to most important parts of the input and studies it contextually. A transformer can traverse long queues of input to access the first part or the first word and produce contextual output.

The entire mechanism is spread across 2 major layers of encoder and decoder. Some models are only powered with a pre-trained encoder, like BERT, which works with doubled efficiency.

A full-stacked transformer architecture contains six encoder layers and six decoder layers. This is what it looks like.

transformer architecture

Each sublayer of this transformer architecture is designated to treat data in a specific way for accurate results. Let’s break down these sub-layers in detail.

Encoder in transformer model

The job of an encoder is to convert a text sequence into abstract continuous number vectors and judge which words have the most influence over one another.

encoder

The encoder layer of a transformer network converts the information from textual input into numerical tokens. These tokens form a state vector that helps the model understand the input better. First, the vectors go under the process of input embedding.

1. Input Embedding

The input embedding or the word embedding layer breaks the input sequence into process tokens and assigns a continuous vector value to every token.

For example, If you are trying to translate “How are you” into German, each word of this arrangement will be assigned a vector number. You can refer to this layer as the “Vlookup” table of learned information.

input embedding

2. Positional encoding

Next comes positional encoding. As transformer models have no recurrence, unlike recurrent neural networks, you need the information on their location within the input sequence.

Researchers at Google came up with a clever way to use sine and cosine functions in order to create positional encodings. Sine is used for words in the even time step, and cosine is used for words in the odd time step.

positional encoding

Below is the formula that gives us positional information of every word at every time step in a sentence.

Positional encoding formula:

PE (Pos, 2i+1) = cos (pos/10000 raised to power 2i/dmodel)
PE(Pos, 2i) = sin (pos/10000 raised to power 2i/dmodel))

PE → Positional encoding

i → time step

D (model) → Total vector dimension of the input sequence

These positional encodings are kept as a reference so the neural networks can find important words and embed them in the output. The numbers are passed on to the “attention” layer of the neural network.

positional encoding

3. Multi-headed attention and self-attention

The multi-headed attention mechanism is one of a transformer neural network's two most important sublayers. It employs a " self-attention" technique to understand and register the pattern of the words and their influence on each other.

attention

Again taking the earlier example, for a model to associate “how” with “wie,” “are” with “heist,” and “you” with “du,” it needs to assign proper weightage to each English word and find their German counterparts. Models also need to understand that sequences styled in this way are questions and that there is a difference in tone. This sentence is more casual, whereas if it were "wie hiessen sie," it would have been more respectful.

The input sequence is broken down into query, key, and value and projected onto the attention layer.

The concept of query, key, and value in multi-head attention

Word vectors are linearly projected into the next layer, the multi-head attention. Each head in this mechanism divides the sentence into three parts: query, key, and value. This is the sub-calculative layer of attention where all the important operations are performed on the text sequence.

Did you know? The total vector dimension of a BERT model is 768. Like other models, the transformers convert input into vector embeddings of dimension 512.

Query and key undergo a dot product matrix multiplication to produce a score matrix. The score matrix contains the “weights” distributed to each word as per its influence on input.

The weighted attention matrix does a cross-multiplication with the "value" vector to produce an output sequence. The output values indicate the placement of subjects and verbs, the flow of logic, and output arrangements.

However, multiplying matrices within a neural network may cause exploding gradients and residual values. To stabilize the matrix, it’s divided by the square root of the dimension of the queries and keys.

4. Softmax layer

The softmax layer receives the attention scores and compresses them between values 0 to 1. This gives the machine learning model a more focused representation of where each word stands in the input text sequence.

In the softmax layer, the higher scores are elevated, and the lower scores get depressed. The attention scores [Q*K] are multiplied with the value vector [V] to produce an output vector for each word. If the resultant vector is large, it is retained. If the vector is tending towards zero, it is drowned out.

5. Residual and layer normalization

The output vectors produced in the softmax layers are concatenated to create one single resultant matrix of abstract representations that define the text in the best way.

The residual layer eliminates outliers or any dependencies on the matrix and passes it on to the normalization layer. The normalization layer stabilizes the gradients, enabling faster training and better prediction power.

normalization

The residual layer thoroughly checks the output transferred by the encoder to ensure no two values are overlapping neural network's activation layer is enabled, predictive power is bolstered, and the text is understood in its entirety.

Tip: The output of each sublayer (x) after normalization is = Layernorm (x+sublayer(x)), where the sublayer is a function implemented within the normalization layer.

6. Feedforward neural network

The feedforward layer receives the output vectors with embedded output values. It contains a series of neurons that take in the output and then process and translate it. As soon as the input is received, the neural network triggers the ReLU activation function to eliminate the “vanishing gradients” problem from the input.

This gives the output a richer representation and increases the network’s predictive power. Once the output matrix is created, the encoder layer passes the information to the decoder layer.

Did you know? The concept of attention was first introduced in recurrent neural networks and long short-term memory (LSTM) to add missing words to an input sequence. Even though they were able to produce accurate words, they couldn’t conduct the language operations through parallel processing, regardless of amount of computational power.

Benefits of encoders in transformer model

Some companies already utilize a double-stacked version of the transformer’s encoder to solve their language problems. Given the humongous language datasets, encoders work phenomenally well in language translation, question answering, and fill-in-the-blanks.

Besides language translation, encoders work well in industrial domains like medicine. Companies like AstraZeneca use encoder-only architecture like molecular AI to study protein structures like amino acids.

Other benefits include:

Masked language modeling: Encoders can derive context from previous words in a sentence to identify missing words. Gated RNNs and LSTMs have a shorter reference window, which prevents them from flowing backward and learning the importance of certain words. But encoders use the concept of “backpropagation” to understand words and produce output.

Bidirectional: Not only does the encoder derive meaning from the generated word, it also tends to all the words and their contextual bond with current word. This makes encoders better than RNNs and LSTMs, which are unidirectional feedforward models.

Sequence classification: Encoders can process sequence transduction, sequence-to-sequence, word-to-sequence, and sequence-to-word problems. It maps the input sequence to a numerical representation to classify the output.
Sentiment analysis: Encoders are great for sentiment analysis, as they can encode the emotion from the input text and classify it as positive, negative or neutral.

As the encoder processes and computes its share of input, all the learned information is then passed to the decoder for further analysis.

Decoder in transformer model

The decoder architecture contains the same number of sublayer operations as the encoder, with a slight difference in the attention mechanism. Decoders are autoregressive, which means it only looks at previous word tokens and previous output to generate the next word.

Let's look at the steps a decoder goes through.

Positional embeddings: The decoder takes the input generated by the encoder and previous output tokens and converts them into abstract numeric representations. However, this time, it only converts words until time series t -1, with t being the current word.
Masked multi-head attention 1: To further prevent decoders from processing future tokens, it undergoes the first layer of masked attention. In this layer, attention scores of words are calculated and multiplied by a masked matrix that contains a value between 0 and infinity.
Softmax layer: After multiplication, the output gets passed on to the softmax layer, which downsizes it and stabilizes the numbers. All the parts of the matrix containing next words are zeroed out. The masked matrix is structured in such a way that negative infinities get multiplied only by new tokens, which are nullified by the softmax layer.
Masked multi-head attention 2: In the second masked self-attention layer, the value and keys of the encoder output are compared with the decoder output query to get the best output path.
Feedforward neural network: Between these self-attention layers, a residual feedforward network exists to identify missing gradients, eliminate residue, and train the neural network on the data.
Linear classifier: The last linear classifier layer predicts the best class of output and processes it word by word.

While shifting data from encoders to decoders, the transformer model loses some of its performance. The additional GPU consumption and memory stress make the decoder less functional but more stable.

Benefits of decoders in transformer model

Unlike encoders, decoders do not traverse the left and right parts of sentences while analyzing the output sequence. Decoders handle the previous encoder input and decoder input and then weigh the attention parameters to generate the final output. For all the other words in the sentence, the decoder adds a mask layer so that their value reduces to zero.

Unidirectional: Decoders traverse in the left direction of a particular word at time step t-1. They are unidirectional and don’t have anything to do with future words. For example, while changing “How are you” into “I am fine,” the decoder uses masked self-attention to cancel out words falling after the t-1 time step, so decoder can access the word “am” and the words before “I".
Excellent text generation and translation: Decoders can create text sequences from a query or a sentence. Open AI’s generative pre-trained transformers like GPT-3.5 and GPT-4o are based on decoder mechanisms that use input text to predict the second-best word.
Casual language modeling: Decoders can tokenize plain textual datasets and predict newer or missing words. It derives context from the already existing tokens on the left and uses that probability distribution to hypothesize the next sensible word in a sentence.
Natural language generation (NLG): Decoder mechanisms are used in NLG models to build dialogue-based narratives on an input dataset. Microsoft’s Turing-NLG is an example of a decoder transformer. It is being used to develop dialogue-based conversational abilities in humanoids like Sophia.

Despite decoders being used for building ai text generators and large language model, it's unidirectional methodology restricts it's capability of working with multiple datasets.

What is casual language modeling?

Casual language modeling is an AI technique that predicts the token that follows sequential transduction. It attends to the left side of tokens that are unmasked during linear classification. This technique is mainly used in natural language generation or natural language processing.

Self-attention in transformer model

A self-attention mechanism is a technique that retains information inside a neural network about a particular token or sentence. It draws global dependencies between the input and the output of a transformer model.

For example, consider this sentence:

"No need to bear the brunt of your failures"

and

“I think I saw a polar bear rolling in the snow."

A simple neural network like RNN or LSTM wouldn’t be able to differentiate between these two sentences and might translate them in the same way. It takes proper attention to understand how the word “bear” affects the rest of the sentence. For instance, the word “brunt” and “failure” can help a model understand the contextual meaning of the word “bear” in the first sentence. The phenomenon of a model “tending to” certain words in the input dataset to build correlations is called "self-attention".

This concept was brought to life by a team of researchers at Google and the University of Toronto through a paper, Attention is All You Need, led by Ashish Vaswvani and a team of 9 researchers. The introduction of attention made sequence transduction simpler and faster.

The original sentence in the research paper “Attention is all you need” was:

The agreement on the European economic area was signed in August 1992.

In the French language, word order matters and cannot be shuffled around. The attention mechanism allows the text model to look at every word in the input while delivering its output counterparts. Self-attention in NLP maintains a rhythm of input sentences in the output.

attention

While converting the above sentence, the text model looks at economics and European to pick out the correct French word, “Européene.” Also, the model understands that the word Européene needs to be masculine to match with le zone.

RNNs vs. LSTMs vs. Transformers

The gaps and inconsistencies in RNNs and LSTMs led to the invention of transformer neural networks. With transformers, you can trace memory locations and recall words with less processing power and data consumption.

rnn vs lstm vs transformer

Recurrent neural networks, or RNNs, work on a recurrent word basis. The neural network served as a queue where each word of input was assigned to a different function. The function would store words in hidden state and supply new input word to the next layer of network, that has context from the previous word.

The model worked successfully on shorter-length sentences, but it failed drastically when the sentence became too information-heavy or site-specific.

Long short-term memory (LSTM) models tried to eliminate the problem with RNNs by implementing a cell state. The cell state retained information from the input and tried to map it in the decoding layer of the model. It performed minor multiplication in the cell state to eliminate irrelevant values and had a longer memory window.

Transformers use a stacked encoder-decoder architecture to form the best representation of the input. It enables the decoder to remember which number representations were used in the input through query, key, and value. Further, the attention mechanism draws inferences from previous words to logically place words in the final sentence.

Transformer model examples across industries

From understanding protein unfolding to designing chatbots, social media content or localized guides, transformer models are on a roll across industries.

Personalized recommendations in e-commerce: Algorithms like BERT are used in retail and e-commerce sector to `break down' search queries across multiple language, comply with search intent and display personalized feed of suggestions to improve conversions and revenue. Retail giants like eBay and Amazon integrate transformer models to translate content and personalize product recommendations.
Medical document analysis in healthcare: In the medical domain, transformer models can retrieve patient records, diagnose treatment and derive insights during pathological tests to unravel the condition of the patient. Transformer like MegaMOLBart or BioBERT is adopted to optimize medical operations and build accurate diagnostics.
Fraud detection and risk mitigation in finance: Transformer models can scrutinize customer transactions to flag fraudulent transactions and recover account details to prevent or mitigate further risks. Financial consulting firms like JP Morgan Chase Co. or Morgan Stanley employ transformer models to reduce the risk of credit frauds and generate financial summaries and statements for customers.
AI chatbots and intelligent agents in customer service: Companies are also keen to shift customer service tickets and escalations from human agents to AI chatbots that are programmed with transformer models. These chatbots attend to a myriad of customer queries and process resolution for all of them at the same time, while establishing a natural conversation and a sentimental tone.
Content generation and sentiment analysis in marketing. Marketers and content creators utilize transformer model to generate high value and engaging content for their audiences. Not only does transformer model generate content copy in response to text prompt, but also provide graphic suggestions, storytelling approaches, new narratives and so on. Examples include GPT, Gemini and Claude Anthropic.

Future of transformer model

In the future, transformers will be trained on billions or trillions of parameters to automate language generation with 100% accuracy. It’ll use concepts like AI sparsity and mixture of experts to infuse models with self-awareness capabilities, thereby reducing the hallucination rate. Future transformers will work on an even more refined form of attention technique.

Some transformers like BLOOM and GPT 4 are already being used globally. You can find it in intelligence bureaus, forensics, and healthcare. Advanced transformers are trained on a slew of data and industrial-scale computational resources. Slowly and gradually, the upshot of transformers will change how every major industry functions and build resources intrinsic to human survival.

A transformer also parallelises well, which means you can operationalize the entire sequence of input operations in parallel through more data and GPUs.

Transformer model: Frequently asked questions (FAQs)

What is dependency?

Long-term or short-term dependencies mean how much the neural network remembers what happened in the previous input layer and can recall it in the next layer. Neural networks like transformers build global dependencies between data to trace their way back and compute the last value. A transformer relies entirely on an attention mechanism to draw dependencies from an input dataset through numbers.

What is a time step?

A time step is a way of processing your data at regular intervals. It creates a memory path for the user wherein they can allot specific positions to words of the text sequence.

What is an autoregressive model?

Autoregressive or unidirectional models forecast future variables based on previous variables only. This only happens when there’s a correlation in a time series at the preceding step and the succeeding step. They don’t take anything else into consideration except the right-side values in a sentence and their calculative outputs to predict the next word.

What is the best transformer model?

Some of the best transformer models are BERT, GPT-4, DistilBERT, CliniBERT, RoBERTa, T5 (text-to-text transformer model), Google MUM, and MegaMOIBART by AstraZeneca.

Which transformer is the largest size?

Megatron is an 8.3 billion parameter large language model, the biggest to date. It has an 8-sub-layered mechanism and is trained on 512 GPUs (Nvidia’s Tesla V100).

Where are transformer models used?

Transformer models are used for critical tasks like making antidotes, drug discoveries, building language intermediates, multilingual AI chatbots, and audio processing.

“Attention” is the need of the hour

Day by day, machine learning architectures like transformer models are receiving quality input and data surplus to improve performance and process operations just like humans. We are not so far away from a hyperconnected future where all ideas and strategies will emerge from transformer models and the current level of hardware wastage and energy consumption will be reduced to build a fully automated ecosystem.

Discover the distinct qualities of NLP and LLM to comprehend what suits your needs better in the future.

Shreya Mattoo

Shreya Mattoo is a former Content Marketing Specialist at G2. She completed her Bachelor's in Computer Applications and is now pursuing Master's in Strategy and Leadership from Deakin University. She also holds an Advance Diploma in Business Analytics from NSDC. Her expertise lies in developing content around Augmented Reality, Virtual Reality, Artificial intelligence, Machine Learning, Peer Review Code, and Development Software. She wants to spread awareness for self-assist technologies in the tech community. When not working, she is either jamming out to rock music, reading crime fiction, or channeling her inner chef in the kitchen.