By: Santiago Máximo, R&D Engineer at Digital Sense.
ChatGPT, based on GPT-3.5, has gone viral since its release in November 2022. This conversational AI assistant, developed by OpenAI, has quickly gained one million users just five days after its launch, making it the fastest-growing consumer application in history. And it continues to evolve. GPT-4, its latest model, was launched on 14 March 2023 and is still making waves. Why so much buzz? How do these technologies work, and what are their capabilities and limitations? It will all be explored below.
ChatGPT boasts significant improvements in text generation quality compared to its predecessors, producing human-like responses to prompts with astonishing results. Its versatility is also noteworthy, as it can be used to write articles and poetry, specify a particular writer’s style, write programming code, and perform translations.
This article will explore the evolution of language models, with a focus on Transformers, introduced in 2017, as well as the further development of the GPT family of generative models. We will then analyze the architecture of ChatGPT, based on GPT-3.5, exploring the Reinforcement Learning from the Human Feedback technique — one of the key components of this technology. Toward the end, we will briefly mention what is currently known about GPT-4, available via ChatGPT Plus.
Throughout this study, we will examine the emerging capabilities of these large language models and provide insights into their probable underlying sources. Lastly, we will refer to the limitations and possibilities for future improvements of this disruptive chatbot.
Language Models
A language model is a probability distribution over sequences of words. To make it simple, we assume we are working with words, but language models can predict the probability of tokens or token sequences. Tokens can be words, subwords, or characters. They are obtained through a tokenization process, which consists of dividing the text into small units.

Language Models can be broadly classified into two major categories:
- Statistical Language Models: they use traditional statistical approaches like N-grams and Hidden Markov Models.
- Neural Language Models: they are based on neural network architectures such as Recurrent Neural Networks (RNN) or Transformers. [1]
Transformers
Language Models based on Transformers, first introduced by Google Brain in 2017 [2], represent the state-of-the-art in most NLP (Natural Language Processing) tasks, such as machine translation, text generation, or question answering.
Transformers handle in a better way long-distance dependencies and allow more parallelization than other neural architectures.
Transformers are based exclusively on the attention mechanism, which is used to calculate the relationships between all pairs of words in a sequence. This information is later used to weigh the importance of each word when making predictions.

Self-supervised Learning
The dominant approach is pre-training Transformer-based Language Models in a self-supervised manner, and then adapting them to specific tasks with fine-tuning. Self-supervised learning is a technique in which it is not necessary to explicitly label the dataset since the labels are obtained from the input data itself.
Self-supervision can be implemented by masking tokens in a text and trying to predict those tokens (labels) based on their context.

Depending on the location of the tokens that are masked during pre-training, Transformer Language Models can be classified into two main groups: Masked Language Models and Causal Language Models.
Masked Language Models
They predict a masked token in a sequence. They are bidirectional in nature as they can attend to tokens to the left and right of the masked input.

Masked language models are suitable for tasks that require full-sentence understanding, such as sentiment analysis. BERT, published in 2018 by Google, is one of the most popular models of this group.
Causal Language Models
These models predict the next token in a sequence and can attend only to tokens to the left. [3]

Causal Language Models are useful for tasks involving text generation. GPT-3, one of the predecessors of ChatGPT, is one of the most renowned generative models.
Transformers architecture
The original Transformer published in the paper “Attention Is All You Need” [2] was a neural machine translation model. This Transformer used an encoder-decoder architecture, following the approach of previous neural networks implemented for translation.


However, many of later developed Transformers use just one stack of Transformer blocks. BERT is an encoder-only model, while GPT-1, GPT-2, and GPT-3, are decoder-only models. [4]

The GPT family
OpenAI is a company founded in San Francisco in 2015 by Sam Altman and Elon Musk, among others. It was initially a non-profit organization but, in 2019 transitioned to a “capped” for-profit. The capped-profit model allowed OpenAI to attract investment from companies like Microsoft.
OpenAI introduced GPT-1 in 2018, GPT-2 in 2019, and GPT-3 in 2020. Each of these models showed significant progress compared to its predecessors. However, they share the same architecture inspired by decoder blocks from the original Transformer. The main differences between them are the number of parameters and the size of the training data.

GPT-3
Before the introduction of ChatGPT, GPT-3 was probably the language model that had the greatest impact, demonstrating significant advances in text generation.
GPT-3’s ability to solve tasks for which the model had not been trained, was one of the key reasons for its impact. For example, GPT-3 could solve arithmetic operations, correct spelling mistakes or translate words between English and French, even though it was not trained for it. This was possible due to a capability called In-context learning: if the model is provided with a few examples of a particular task, it is then expected to complete further instances of the task without requiring additional fine-tuning. [5] [6]

The ability to generate code also contributed to GPT-3 becoming viral in 2020, although it was not specifically designed for coding tasks. GPT-3 can generate syntactically correct and generally functional code, but it may not always produce the most efficient or effective code. [7]
Another emergent ability of GPT-3 was its World knowledge that probably came from the huge training dataset. [8]
With all the above advances mentioned, GPT-3 could generate highly convincing and coherent language, but it still showed the alignment problem in Large Language Models. Sometimes the models’ outputs do not align with the values and goals of their human users, resulting in text with a lack of helpfulness, hallucinations, misinformation, and toxic content.
How to improve the quality of the generated data?
Taking into account the drawbacks of GPT-3, the next big step for language models would be to improve the quality of the generated text so that it is more aligned with human intention.
But evaluating the quality of a text is a challenging task because when trying to write a novel, creativity is highly valued, while when referring to a historical event, factual content is preferred. On the other hand, if the intention is to generate code, it must be executable. Moreover, regardless of the type of content, the generated text must be ethically correct and avoid bias or hate speech.
Compiling all these values into a loss function seems like a complex task. Although there are NLP metrics, such as BLEU (Bilingual Evaluation Understudy), that are more suitable for evaluating text, they also fall short of the desired objective. And why not incorporate the before-mentioned values through human feedback, rather than defining a magic equation?
One of the main differentiators of ChatGPT and what makes it stand out from other existing models is the use of Reinforcement Learning from Human Feedback (RLHF). [9]
ChatGPT
Through blog posts published on OpenAI’s official website, it is possible to learn some details about the functionality and training of ChatGPT, but to date, no paper has been published with more detailed information. However, OpenAI has mentioned that ChatGPT was trained using the same methods as InstructGPT.
The following sections of this article were written based on available information from OpenAI [10][11], as well as on the basis of the Instruct GPT paper [13] and other publications about ChatGPT [12] and the RLHF technique [9], made by experts in the field.
The RLHF used to train ChatGPT is a three-phase process:
- Phase 1: Pretrain a generative language model that will be the policy fine-tuned with Reinforcement Learning during Phase 3.
- Phase 2: Train a Reward Model that will be part of the reward function used by the Reinforcement Learning method in Phase 3.
- Phase 3: Fine-tune the model generated in Phase 1 with Reinforcement Learning. The outcome of this last phase is the ChatGPT model.
Human AI trainers were hired by OpenAI to generate the datasets built in Phases 1 and 2. They were given instructions on how to generate text or rank responses.
Phase 1
In this phase, a pre-trained model is generated, which will be more suitable to act as a chatbot, compared to the previous GPT-3.

For this stage of training, a model from the GPT-3.5 series is used as a base, which was not trained only on a general dataset such as GPT-3, but on a blend of text and code from before the fourth quarter of 2021. Probably the baseline model is a GPT-3 model which was fine-tuned mostly on programming code.
A direct consequence of training on code is improving the capacity of the model for understanding and generating code. Additionally, although there is still no hard evidence, there are some emergent abilities of ChatGPT that might be side effects of training on code:
- ChatGPT exhibits chain-of-thought reasoning ability, in other words, it can generate a chain of thought that emulates an intuitive thought process when working through a complex problem. With chain-of-thought prompting, before giving the final answer to a problem, the model is prompted with intermediate reasoning steps. This prompting elicits reasoning, without the need for fine-tuning to achieve a similar result. [14] [15]

- When interacting with ChatGPT, it is easily noticeable that the model can remember what the user said earlier in the conversation. This long-term dependency capacity could also be a secondary effect of training on code. Next token prediction, when dealing with code, requires understanding complex structures with longer dependencies due to hierarchy, variables definitions, and relationships between objects or functions. [8]
During the first phase, the GPT-3.5 base model is fine-tuned using a supervised learning approach, with a dialogue dataset, to improve its performance in generating responses in a conversational way. The dialogue dataset is a combination of the instructions dataset from InstructGPT, transformed into a dialogue format, and a demonstration dataset built especially for this stage.
The demonstration dataset is composed of prompt/response pairs. The prompts were obtained from two different sources: some were selected from previous OpenAI API requests and others were generated by the reviewers themselves. To generate responses, reviewers were given guidelines on how to complete requests. The Assistant should refuse to answer to content that expresses hate, intends to harass, promotes self-harm, attempts to influence the political process, includes explicit sexual material, or tries to generate intrusive software. Additionally, the Assistant should reject false premises.
The Supervised Fine Tuning (SFT) model trained in this phase will be then used in the following two phases.
Phase 2

The Reward Model trained in the second phase is a model that takes in a text, consisting of a prompt and response, and outputs a scalar reward. Based on the InstructGPT paper, the base model for training the Reward Model is the SFT model trained in Phase 1 with the final unembedding layer replaced with a projection layer to output a scalar value.
Presumably, the SFT model is also used to generate responses from a set of prompts. For each prompt, several model outputs are sampled, and the reviewers rank the outputs from best to worst. This comparison data is utilized to build a dataset, where the labels are the rankings, that is then used for training the Reward Model.
The reason why reviewers are asked to compare results instead of simply scoring individual responses is that human scores can be uncalibrated, which may lead to a lower-quality dataset.
The guidelines provided to the reviewers include some categories that need to be taken into account when rating the model outputs. Then, when the Reward Model is used, the values in reviewers' feedback are generalized to a variety of inputs.
Reinforcement learning
Before diving into Phase 3, it is crucial to review the fundamental principles of Reinforcement Learning.
Reinforcement learning is a machine learning technique in which an agent interacts with an environment to learn how to make optimal decisions. For each action, the agent receives feedback in the form of rewards or penalties. The policy is the strategy used by the agent to decide the next action to take based on the current state. [16]

One of the families of Reinforcement Learning algorithms is the Policy Gradient Method which directly optimizes an agent's policy to maximize the expected cumulative reward. Policy Gradient Methods work like a neural network, learning the parameters of the policy, which is the function that maps states to a probability distribution over actions.
Policy gradient updates can be highly sensitive to the choice of step size. To solve this problem, the Proximal Policy Optimization (PPO) algorithm uses a surrogate objective function that constrains the policy updates to a small region around the current policy [17][18].
Proximal Policy Optimization is the Reinforcement Learning algorithm applied in Phase 3, with the policy being a language model.
Phase 3

For the Third Phase, a new Dataset is built which contains prompts without responses.
The policy to be updated with the Proximal Policy Optimization algorithm, from now on called the PPO model, is initialized with the SFT model trained on Phase 1.
The training process starts by giving a prompt (the current state) to the PPO model (policy) in order to obtain a response y1.
The prompt and the response y1 are passed to the Reward Model to produce a reward. If the PPO model is updated solely based on rewards, it may begin generating nonsensical text that tricks the Reward Model into producing high rewards. To prevent this, the objective function of the PPO algorithm comes into play, combining the reward with a constraint on policy shift. This constraint is achieved by adding a KL (Kullback–Leibler divergence) term that penalizes the PPO model from moving substantially away from the initial SFT model. The penalty is calculated by comparing the response y1 to another response y2 obtained from the initial SFT model.
InstructGPT, and probably ChatGPT, added some additional terms into the update rule for PPO. However, the goal of this article is not to go into more detail but to point out the most relevant concepts to understand the use of PPO in this context.
ChatGPT is the final PPO model generated after finishing the training process.
What do we know about GPT-4?
OpenAI provided limited information regarding the model architecture, number of parameters, and training process for GPT-4. The following information was compiled from OpenAI’s official website.
GPT-4 is a large multimodal model that accepts both text and image inputs and generates text outputs. The GPT-4’s text input capability can be accessed through ChatGPT Plus and the OpenAI API. However, image inputs are currently only available as a research preview and are not publicly accessible.
Like its predecessors, the GPT-4 base model was trained on publicly available data to predict the next word in a document. Its behavior was also fine-tuned using Reinforcement Learning from Human Feedback (RLHF).
During RLHF training, GPT-4 incorporates an additional safety reward signal to minimize harmful outputs. Compared to GPT-3.5, there has been an 82% reduction in the model’s tendency to respond to requests for disallowed content.
While the difference between GPT-3.5 and GPT-4 may not be noticeable in casual conversation, it becomes apparent when the task complexity exceeds a certain threshold. [19][20]

Despite its impressive improvements, GPT-4 has limitations similar to earlier GPT models, including hallucination, reasoning errors, and biases.
Final Comments
Since the release of ChatGPT, through its different iterations, until the launch of GPT-4, the quality of text generation has not stopped evolving. Although the knowledge base of these models is fixed, prior to the last quarter of 2021, Phases 2 and 3 of the RLHF process can be continuously iterated. Feedback collected from users who have interacted with ChatGPT or ChatGPT Plus can be used to generate new datasets to train the Reward Model and consequently, train a new Policy in Phase 3.
The leap in quality has been enormous and it has been possible to satisfactorily (though not completely) solve the misalignment problem that previous generative models presented. Further adjustments to the RLHF technique used may still be necessary, such as refining the selection of reviewers and the definition of guidelines, or even a totally different technique may emerge. However, it could be possible to demonstrate that it is feasible to align generative models with human intention and that the cost of alignment is modest compared to the cost required to pre-train large language models.
Much has been said about ChatGPT potentially dethroning Google’s search engine, but it clearly still shows shortcomings, generating hallucinations at times, with information that appears credible but is not. The technology that replaces current search engines will likely be different, such as the Retrieval Augmented Generation (RAG) architecture, which internally is composed of a conversational model similar to ChatGPT.
Last but not least, one of OpenAI’s great achievements has been to bring to light the capabilities that Artificial Intelligence can acquire in the short term, causing the debate to reach all areas. With the emergence of a new world, it is crucial for governments, educational institutions, and society as a whole to adapt to the new circumstances. To everyone looking to expand their LLM and NLP development services, there are lot of ways to do it, you just need to find the right path. If you want to get to know more about NLP vs. LLMs, check out our blog! Technology is advancing fast, and these new tools could be benefitial.
References:
- [1] Language Models in AI. Introduction; Dennis Ash; Medium.
- [2] Attention Is All You Need; Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, Illia Polosukhin; Cornell University.
- [3] Language modeling; Hugging Face.
- [4] The Illustrated GPT-2; Jay Alammar.
- [5] Language Models are Few-Shot Learners; Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu, Clemens Winter, Christopher Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, Scott Gray, Benjamin Chess, Jack Clark, Christopher Berner, Sam McCandlish, Alec Radford, Ilya Sutskever, Dario Amodei; Cornell University.
- [6] La Siguiente Gran Revolución: NLP (Procesamiento del Lenguaje Natural); Dot CSV; Youtube.
- [7] OpenAI GPT-3: Everything You Need to Know; Kindra Cooper; Springboard.
- [8] How does GPT Obtain its Ability?; Yao Fu, Hao Peng, Tushar Khot; University of Edinburgh and Allen Institute for AI.
- [9] Illustrating Reinforcement Learning from Human Feedback (RLHF); Nathan Lambert, Louis Castricato, Leandro von Werra lvwerra, Leandro von Werra, Alex Havrilla; Hugging Face.
- [10] Introducing ChatGPT; OpenAI.
- [11] How should AI systems behave, and who should decide?; OpenAI.
- [12] How ChatGPT actually works; OpenAI.
- [13] Training language models to follow instructions with human feedback; Long Ouyang, Jeff Wu, Xu Jiang, Diogo Almeida, Carroll L. Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, John Schulman, Jacob Hilton, Fraser Kelton, Luke Miller, Maddie Simens, Amanda Askell, Peter Welinder, Paul Christiano, Jan Leike, Ryan Lowe; Cornell University.
- [14] Chain-of-Thought Prompting Elicits Reasoning in Large Language Models, Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Brian Ichter, Fei Xia, Ed Chi, Quoc Le, Denny Zhou; Cornell University.
- [15] Language Models Perform Reasoning via Chain of Thought; Long Ouyang, Jeff Wu, Xu Jiang, Diogo Almeida, Carroll L. Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, John Schulman, Jacob Hilton, Fraser Kelton, Luke Miller, Maddie Simens, Amanda Askell, Peter Welinder, Paul Christiano, Jan Leike, Ryan Lowe; Cornell University.
- [16] Reinforcement Learning Tutorial; Javatpoint.
- [17] A Brief Introduction to Proximal Policy; sidsen99; Geeks for Geeks.
- [18] Proximal Policy Optimization; OpenAI.
- [19] GPT-4; OpenAI.
- [20] GPT-4 Technical Report; OpenAI.



.jpg)
