What Makes a Large Model a Foundation Model: a Historical Perspective

After the release of ChatGPT, the term "Foundation Model" suddenly became a hot topic. Both independent media and research institutions and enterprises are enthusiastically researching and launching their own foundation models. Tencent Research once conducted an incomplete survey, showing that there are over 200 foundation models under development in China. But what exactly is a foundation model? For an ordinary large-scale model to become a foundation model, what conditions must it meet? If the core criterion is "large," how large does it need to be to qualify as a foundation model? Few people discuss these questions, so I wrote this article to explain my viewpoint from a historical perspective.

Before 2017, deep learning had already become mainstream in machine learning. Compared to traditional machine learning algorithms, deep learning has a core advantage: it has never encountered data saturation. Traditional models, even when increasing in complexity, cannot effectively utilize data beyond a certain scale. The specific manifestation is that as training data increases, their accuracy no longer improves. This is a longstanding problem in the field of machine learning.

However, deep learning does not have this problem. As long as you provide it with more GPUs, more training data, and allow it to train more complex models, it will ensure better accuracy. To this day, we have not seen any signs of saturation in deep learning. This is why deep learning is so popular among companies, especially large ones. After all, being able to achieve better accuracy simply by investing money can create a natural moat for them.

But all of this in deep learning has a prerequisite. The neural network, or feedforward neural network, is particularly suitable for parallel processing, allowing efficient inference and optimization on GPUs. However, until 2017, the popular methods in the NLP field were still LSTM and other recurrent neural networks. Their characteristic is that to calculate the result of the next word, you must first completely calculate the result of the previous word. This creates a problem: it is a serial processing algorithm, unfriendly to GPUs and difficult to optimize efficiently in parallel. This is why, even though auto-regressive tasks appeared early—like masking a word in the middle of a sentence and guessing what it is from the context—they were never widely implemented on large-scale data. For example, models like Word2Vec that involve guessing a masked word are only two-layer neural networks and were not truly combined with deep learning.

In 2017, Google released the Transformer architecture, finally changing everything. Transformer, through a global attention mechanism, transforms the serial computation structure of LSTMs into one that can be processed in parallel. This allows efficient training on GPUs, bringing new perspectives to the NLP field. Consequently, just a few months later, OpenAI and Google independently released the GPT and BERT algorithms, which greatly improved the performance of all NLP tasks by introducing large-scale pre-training. The use of these algorithms generally follows two steps:

  1. The first step involves using the aforementioned technique of masking a word in a sentence to predict what it is, training on large-scale data to obtain a pre-trained neural network.
  2. Based on this, the second step involves further fine-tuning the model for specific tasks, such as sentiment analysis of a passage, allowing the foundational model to adapt to specific tasks and achieve excellent performance.

After their release, GPT and BERT quickly became standard algorithms in the NLP field. Since adding a new task required significant GPU resources for fine-tuning, this provided many job opportunities for scientists and engineers :D

In 2020, GPT-3 emerged. Although this model didn't seem particularly eye-catching to the public, it accomplished something crucial: it eliminated the model fine-tuning step. We only need to use the pre-trained GPT-3 model and provide a simple language description of the task, such as "Below is a user review from a public review site, tell me if the sentiment is positive, neutral, or negative," along with a few examples. It can achieve quite remarkable results on the task. Structurally, GPT-3 is not significantly different from the original GPT; the main differences are the scale of the model and the pre-training data being 500 times larger. This method aligns with the brute-force style of deep learning, once again proving the immense potential of deep learning.

In 2022, OpenAI released another achievement that initially did not attract widespread attention—GPT Instruct. This work first introduced RLHF, implementing an almost limitless scale of alignment training through reinforcement learning. This process first uses reinforcement learning to learn human preferences between two outputs from limited human annotations, then replaces humans with the trained model to score outputs and participate in model training. This is equivalent to obtaining an infinite amount of training data, further increasing the data volume for the alignment process. Supported by this new method, OpenAI achieved, for the first time, a language model that understands human intent and responds to human instructions (Instruction Following). In November of the same year, combining the results of GPT-3 and GPT Instruct, OpenAI launched ChatGPT. The subsequent developments are well known to everyone.

From this historical perspective, what exactly is a foundation model? I believe it has two hallmark capabilities:

  1. The ability to perform few-shot learning or in-context learning without model fine-tuning. More specifically, for a foundation model to adapt to a new scenario, it doesn't require the involvement of scientists or engineers, nor does it need a large amount of GPU support for training and fine-tuning, and there's no need to alter the model's weights. Instead, this adaptation process can be completed simply through interaction with ordinary users. This is revolutionary for actual product forms and represents a disruptive innovation. The reason is that it allows us to deploy just one model to accomplish many different tasks, which is where the "foundation" in foundation models comes from.

    An interesting observation is that, at least for now, these foundation models seem to be unsuitable for model fine-tuning. An example is BERT, which often yields excellent results after fine-tuning, but foundation models like CLIP, with open vocabulary cross-modal understanding capabilities, are infamous in the academic community for being difficult to leverage through fine-tuning. To effectively utilize CLIP, a better approach is often embedding backtracking, modifying the input tokens without changing the model's weights.

  2. Conversational UI. Before the emergence of this UI, leveraging the capabilities of machine learning or AI models required understanding how to write Python programs to call a model, such as learning what tokenization is, how to use PyTorch, and how to manage memory transfers between CUDA and CPU. However, with conversational AI, anyone who can communicate verbally can access its capabilities through direct dialogue. This is another clear example of disruptive innovation in product form.

When these two aspects are combined, they can create revolutionary products like ChatGPT. On one hand, AI can be applied to individual tasks without the involvement of machine learning scientists; on the other hand, no programming knowledge is needed to use the diverse capabilities of AI.

The technical means to achieve the two key traits of few-shot learning and conversational UI is the large-scale training. An interesting observation is that from Transformer to GPT, to GPT-3, and then to GPT-Instruct and ChatGPT, the fundamental unit of the model, namely the Transformer and self-attention mechanism, has never changed. What has primarily changed is the scale of pre-training data. BERT used only 3.3 billion words, but by the time of GPT-3, this scale had increased 500 times, and GPT-Instruct achieved nearly limitless alignment data through RLHF. More training data allows us to better cover the very complex problem-solving space, thereby better supporting a variety of different tasks.

So in summary, I believe the core of foundation models is not "large"; "large" is merely a technical means to achieve its two core traits and not the goal. From a machine learning perspective, its core feature is few-shot learning, enabling task adaptation without the involvement of scientists and solely through the power of end-users. From a UI perspective, another optional trait is the conversational UI, which further greatly reduces the barrier to entry, allowing more people to benefit.

Comments