What does it mean for a form to be large? The size of a model – a trained neural network – is measured by the number of parameters it has. These are the values in the network that are modified over and over during training and then used to make model predictions. Roughly speaking, the more parameters a model has, the more information it can absorb from its training data, and the more accurate its predictions about the new data will be.
GPT-3 has 175 billion parameters — 10 times more than its predecessor, GPT-2. But GPT-3 is dwindling ahead of the class of 2021. Jurassic-1, a large commercially available language model launched by US startup AI21 Labs in September, has outperformed GPT-3 with 178 billion parameters. The Gopher model, a new model that DeepMind released in December, contains 280 billion parameters. Megatron-Turing NLG owns 530 billion. Google’s Switch-Transformer and GLaM models contain one and 1.2 trillion parameters, respectively.
The trend is not only in the United States. This year, Chinese tech giant Huawei built a language model of 200 billion parameters called PanGu. Another Chinese company, Inspur, built Yuan 1.0, a model of 245 billion laboratories. Baidu and Peng Cheng Laboratory, a research institute in Shenzhen, has announced the PCL-BAIDU Wenxin, a model with 280 billion parameters already used by Baidu in a variety of applications, including Internet search, news feed, and smart speakers. And the Beijing Academy of Artificial Intelligence announced Wu Dao 2.0, which contains 1.75 trillion parameters.
Meanwhile, South Korean internet search company Naver announced a model called HyperCLOVA, with 204 billion parameters.
Each of these is an outstanding achievement of engineering. For a start, training a model with more than 100 billion parameters is a complex health problem: hundreds of individual GPUs must be connected and synchronized – the preferred device for training deep neural networks, and the training data must be broken into pieces and distributed between them in the correct order at the right time.
Large language models have become prestigious projects that showcase the company’s technical prowess. However, few of these new models push the research further beyond iteratively demonstrating that scaling works well.
There are a few innovations. Once trained, Google’s Switch-Transformer and GLaM use only a fraction of their parameters to make predictions, thus saving computing power. PCL-Baidu Wenxin combines a GPT-3-style model with a knowledge graph, a technology used in old-school symbolic AI to store facts. Along with Gopher, DeepMind has released RETRO, a language model with only 7 billion parameters that competes with others 25 times its size by cross-referencing a database of documents when you generate text. This makes RETRO less expensive to train than its giant competitors.