The Power of Scale
How scaling up deep learning models is revolutionising the field as we know it
Background
On May 2020, OpenAI introduced GPT-3 which is the third iteration of GPT language generation model series. The model boasted of a capacity of 175 billion parameters, more than 10 times than any other language model created before it. And to say that the model was a significant improvement would be an understatement. It could write essays on any topic without any inconsistencies in the output text. When trained on code samples, it could generate small code snippets (which were actually functional!) by getting a description of the task in English by the user. A group of developers trained it to update financial statements on Microsoft Excel based on casual description of transactions. Needless to say, it worked better than expected.
After the monumental success of GPT-3, OpenAI released another language model by the name of DALL-E which is a 12 billion parameter model trained to generate a set of images based on a textual description as input. Although there have been some models created in the past for image generation with text as input but the sheer size of DALL-E coupled with the vast amount of data it was trained on made a huge difference. Another factor that sets it apart from the rest is its ability to render images with far more unusual prompts. The results are, unsurprisingly, groundbreaking:
Shortly after DALL-E, Google also came up with a mammoth 1 trillion parameter language model of its own trained on a 750 GB text dataset scrapped from almost all text available on the internet. It also announced a new language model called LaMDA (Language Model for Dialogue Applications) during Google I/O 2021 which is trained on dialogue unlike other language models. This means it can pick up conversational nuances unique to open ended conversations. The conversational agent in the movie “Her” should be a good simile to this model.
The Trend
Looking at these examples, one can notice a trend. Simple architectures having large amount of parameters and trained on huge datasets perform better than complex algorithms. These architectures are specifically designed to utilise the parallel computational power that modern hardware offers. This correlation seems especially correct for natural language processing problems.
One thing to note here is that the term “simple” refers to the notion of how complicated is the model to understand intuitively. For example, someone multiplying two 100-digit numbers may seem a complex task in terms of calculation but intuitively that person is only multiplying two numbers. Same is the case for transformer based architectures. In theory, these models are pretty straightforward in comparison to other sequence processing networks like LSTM. The only aspect that makes them complex is the amount of data it is trained on.
The Good
All the models discussed above produce extremely coherent results. Be it full page texts generated by GPT-3 or the abstract pictures by DALL-E, in both cases, the output produced is highly detailed and fairly consistent throughout while maintaining high quality. In LaMDA’s case too, the conversational agent could produce consistent, “sensible” dialogue.
Another much more significant advantage is the ability to learn from a very small number of training examples. This is called few-shot learning. Humans can learn a new task simply from one or two examples and in some cases not even requiring any. Computers on the other hand, require much more samples than that. Even fine-tuning a model requires at least tens of thousands of samples for getting a reasonable accuracy. GPT-3 and similar models however are quite versatile in this regard and need only one or two samples to suit the user’s needs.
The Bad
As much fascinating the quality of the outputs are, large scale transformer models like these have a significant drawback. They lack a semantic understanding of the real world. In other words, it doesn’t have a “common sense” and may output sentences that are nonsensical in its meaning. One can compare it to a student who is a rote learner. Such a student may be able to ace an exam by recalling and slightly improvising but will never have a strong conceptual understanding about the subject.
Because of this, many scientists and deep learning practitioners see it as a step backward in terms of creating a machine with general intelligence; a machine that actually understands the real world and is free from societal biases.
As Yann LeCun, the ‘Godfather of AI’ himself said,
“Trying to build intelligent machines by scaling up language models is like building high-altitude airplanes to go to the moon. You might beat altitude records, but going to the moon will require a completely different approach”
Conclusion
Scaling up deep learning models has produces some fascinating results even beyond our expectations in some aspects. More and more businesses and research organisations are contributing to this trend. It is also a significant step forward in making computers quick learners with requiring only a few samples to learn new patterns. However, models like these lack a semantic understanding of the real world and contain a reflection of biases present in modern day society which needs to be addressed.
Thanks for reading this article and if you like it, don’t forget to give a clap and leave your comments down below if you have any suggestions. You can connect with me via Twitter or LinkedIn. If you are looking to collaborate for a project, checkout my GitHub for more info.