Google has partly broken OpenAI’s dominance on state-of-the-art AI; its new Gemini 1.0 Ultra and Gemini 1.5 models have surpassed the famous GPT-4 in most tests. Gemini is Bard’s replacement and Google’s most powerful large language model. Its Ultra version, released in early February, outperformed human experts in the MMLU (Massive Multitask Language Understanding) benchmark, which includes problems from 57 subjects. Only a week later, Google released the Gemini 1.5 Pro model, which can process huge amounts of information in text, audio, or video form, and required less processing power to train than Gemini 1.0 Ultra.
Gemini 1.5 is a model built on the Mixture of Experts architecture. Whereas most neural networks are composed of one network that takes all incoming information and processes it, a Mixture of Experts model contains multiple specialized networks called experts. On each layer, a router decides which expert it should send each token of information to. The upshot of this method is that Mixture of Experts models are less expensive to train than other models in terms of processing power.
According to Google’s technical paper, Gemini 1.5 Pro uses this architecture to great effect. It is able to take an input of 3 hours of video, 22 hours of audio, or 7 million words of text and recall information from this input. For example, the researchers gave Gemini 1.5 Pro a 45-minute movie from the 1924 movie “Sherlock Jr.”; in that movie, at timestamp 12:01, a character takes a piece of paper out of their pocket, and the camera focuses on the text written on the paper. The researchers asked the model to tell them “key information from the piece of paper that is removed from the person’s pocket, and the time code of that moment”. The model successfully found the scene and time code and read the text on the paper.
The paper also showcased Gemini 1.5 Pro’s translation skills with the obscure Kalamang language. Kalamang is a language spoken by fewer than 200 people in New Guinea, so there is essentially no Kalamang text in Gemini 1.5 Pro’s training. The researchers gave Gemini 1.5 Pro the small amount of data and documentation about Kalamang, and it was able to learn the language so that it could translate from and into Kalamang at the same level as a human. (Incidentally, it was much better at translating from English to Kalamang than the other way around.) The model also outperformed previous Gemini models in coding, mathematics, comprehending video, and following instructions.
Gemini 1.5’s abilities to analyze large amounts of information represent a major advance in AI, but considering the recent rapid progress of AI, it is likely that another model, whether it is ChatGPT or a newer version, will soon surpass Gemini 1.5.