Gemini is a family of large language models (LLMs) developed by Google AI. They are designed to be more powerful and flexible than previous LLMs, capable of handling a wider range of tasks and data formats. Here’s a detailed breakdown of key aspects:

Model(s) Architecture:

  • Multimodal Encoders: Encoders for specific data types i.e. text, audio, image to convert data to a common representation that the model can understand.
  • Early/Late Fusion: Combines the output from various encoders and allows the model to learn the relationship between different modalities.
  • Transformer Decoder: Generates the desired output based on the combined representation and the specific task at hand
  • Pathways Architecture: It allows for modular design for better scaling and customization. Custom components (“Pathways”) can be added or removed depending on the specific task or desired use-cases.

Gemini Models

  • Gemini Nano: 500M parameter model optimized for efficiency and resource-constrained environments. Ideal for basic text generation, translation, and summarization tasks.
  • Gemini Pro: 17B parameter model offering a balance of performance and versatility. Suitable for a wider range of tasks, including complex text generation, question answering, and multimodal reasoning.
  • Gemini Ultra: 170B parameter model pushing the boundaries of LLM capabilities. Designed for tackling the most demanding tasks requiring exceptional reasoning, knowledge, and adaptability.

Training Process

  • Distributed training: Leverages multiple TPU v4 pods, enabling efficient training on massive datasets.
  • Technologies Used: Python, JAX and Pathways, XLA Compiler, etc.
  • Pathway-specific training: Allows for fine-tuning individual Pathways to specific tasks or data formats, increasing performance and efficiency.

Training Data

They trained the models on multimodal and multilingual datasets. Below are the key details:

  • Tokenizer: SentencePiece tokenizer, capable of efficiently tokenizing non-Latin scripts too
  • Datasets: massive scale of multimodal data from web documents, books, and code along with some of Google’s internal data sources.
  • Dataset Size » Determining Factor for Model Size: The number of tokens used to train the largest models was determined following the approach in Hoffmann et al. (2022).
  • Dataset Size » Training Strategy for Smaller Models: Smaller models are trained for significantly more tokens to improve performance for a given inference budget.
  • Data Curation/Quality Control: Quality filters applied, including heuristic rules and model-based classifiers. Safety filtering is performed to remove harmful content.

Evaluation Process

Standard NLP tasks: Gemini models achieve state-of-the-art results on various tasks, including:

  • Text generation: Outperforming previous LLMs in generating realistic and coherent text formats like poems, code, scripts, and musical pieces.
  • Translation: Achieves high accuracy in translating between languages, preserving semantic meaning and stylistic nuances.
  • Question answering: Demonstrates exceptional ability to answer complex, open-ended questions, drawing on factual knowledge and reasoning.
  • Summarization: Creates concise and informative summaries of textual content, capturing key points and relevant information.

Novel benchmarks: To assess real-world comprehension and reasoning, Gemini was evaluated on tasks involving:

  • Causal reasoning: Understanding and predicting cause-and-effect relationships in complex scenarios.
  • Commonsense reasoning: Applying common-sense knowledge to solve problems and answer questions.
  • Multimodal reasoning: Integrating information from different modalities (text, images, audio) to make informed decisions.

Performance Comparison:

DeepMind has compared the outcomes of most of the previous state-of-the-art models. You can find detailed data points here: https://deepmind.google/technologies/gemini/#capabilities

This is an initial draft of my findings according to the research done on Gemini. I will keep this post updated.

References: