Grok released by xAI

Grok is an artificial intelligence inspired by "The Hitchhiker's Guide to the Galaxy," designed to answer almost any question and even provide suggestions for tricky problems! Grok is designed with a sense of humor and a rebellious spirit, so if you don't like humor, please don't use it! Grok's unique and fundamental advantage lies in its ability to understand the world in real-time through the 𝕏 platform. It also answers questions that most other AI systems refuse to address. Grok is still in its early testing phase - this is the best result we could achieve in two months of training - so expect it to improve rapidly each week with your help. Thank you, xAI Team

Why We Built Grok#

At xAI, we aim to create AI tools that assist humans in the pursuit of understanding and knowledge.

By creating and improving Grok, our goals are:

To gather feedback and ensure we are building AI tools that benefit all of humanity to the greatest extent. We believe it is essential to design AI tools that are useful for people from various backgrounds and political viewpoints. We also want to provide AI tools to users while complying with the law. Our goal with Grok is to explore and demonstrate this approach in public.
To empower research and innovation: We want Grok to be a powerful research assistant for anyone, helping them quickly access relevant information, process data, and generate new ideas. Our ultimate goal is for our AI tools to assist in the pursuit of understanding.

The Journey to Grok-1#

The engine powering Grok is Grok-1, a cutting-edge LLM we have developed over the past four months. Grok-1 has undergone multiple iterations during this time.

After launching xAI, we trained a prototype LLM (Grok-0) with 33 billion parameters. This early model performed comparably to LLaMA 2 (70B) on standard LM benchmarks, but only used half of its training resources. Over the past two months, we have made significant progress in reasoning and coding capabilities, culminating in Grok-1, a state-of-the-art language model that is more powerful, achieving 63.2% on the HumanEval coding task and 73% on MMLU.

To understand the capability improvements we made with Grok-1, we conducted a series of evaluations using standard machine learning benchmarks designed to measure mathematical and reasoning abilities.

GSM8k: Middle school math application problems (Cobbe et al. 2021), using chain-of-thought prompting.

MMLU: Multidisciplinary multiple-choice questions (Hendrycks et al. 2021), providing 5 context examples.

HumanEval: Python code completion tasks (Chen et al. 2021), zero-shot evaluation targeting pass@1.

Math: Middle and high school math problems written in LaTeX (Hendrycks et al. 2021), prompted with fixed 4 prompts.

Benchmark	Grok-0 (33B)	LLaMa 2 70B	Inflection-1	GPT-3.5	Grok-1	Palm 2	Claude 2	GPT-4
GSM8k	56.8% 8-shot	56.8% 8-shot	62.9% 8-shot	57.1% 8-shot	62.9% 8-shot	80.7% 8-shot	88.0% 8-shot	92.0% 8-shot
MMLU	65.7% 5-shot	68.9% 5-shot	72.7% 5-shot	70.0% 5-shot	73.0% 5-shot	78.0% 5-shot	75.0% 5-shot + CoT	86.4% 5-shot
HumanEval	39.7% 0-shot	29.9% 0-shot	35.4% 0-shot	48.1% 0-shot	63.2% 0-shot	-	70% 0-shot	67% 0-shot
MATH	15.7% 4-shot	13.5% 4-shot	16.0% 4-shot	23.5% 4-shot	23.9% 4-shot	34.6% 4-shot	-	42.5% 4-shot

In these benchmark tests, Grok-1 demonstrated outstanding performance, surpassing all other models in its computational category, including ChatGPT-3.5 and Inflection-1. Only models like GPT-4, which have been trained on vast amounts of data and computational resources, can exceed it. This showcases the rapid progress we have made at xAI in training LLMs with exceptional efficiency.

Since these benchmarks may exist online, we cannot rule out the possibility that our model inadvertently trained on these benchmarks, so we manually scored our model (as well as Claude-2 and GPT-4) on the Hungarian National High School Math Exam held in May 2023, which was released after we collected our dataset. Grok passed the exam with a C (59%), while Claude-2 received the same score (55%), and GPT-4 received a B with a score of 68%. All models were evaluated at a temperature of 0.1 and used the same prompt. It should be noted that we did not attempt to tune for this evaluation. This experiment serves as a "real-life" test on a dataset our model has never been explicitly tuned on.

Human-graded evaluation	Grok-0	GPT-3.5	Claude 2	Grok-1	GPT-4
Hungarian National High School Math Exam (May 2023)	37% 1-shot	41% 1-shot	55% 1-shot	59% 1-shot	68% 1-shot

We provide a summary of important technical details for Grok-1 in our model card.

xAI Engineering Design#

At the forefront of deep learning research, reliable infrastructure must be built as carefully as datasets and learning algorithms. To create Grok, we built a custom training and inference stack based on Kubernetes, Rust, and JAX.

LLM training moves forward rapidly like a train; if one car derails, the entire train can be thrown off track, making it difficult to get back on track. GPU failures come in many forms: manufacturing defects, loose connections, incorrect configurations, degraded memory chips, occasional random bit flips, and so on. During training, we synchronize computations across thousands of GPUs for months, and all these failure modes become frequent due to scale. To overcome these challenges, we adopted a custom distributed system that ensures immediate identification and automatic handling of every type of failure. At xAI, we have made maximizing useful computation per watt a key focus of our efforts. Over the past few months, our infrastructure has allowed us to minimize downtime and maintain high model Flop utilization (MFU) even in the face of unreliable hardware.

Rust has proven to be an ideal choice for building scalable, reliable, and maintainable infrastructure, providing high performance, a rich ecosystem, and preventing most of the errors typically found in distributed systems. Given our small team size, the reliability of the infrastructure is crucial; otherwise, maintenance would limit innovation. Rust gives us confidence that any code modification or refactoring can produce workhorses that require minimal supervision for months.

We are now preparing for the next leap in our model capabilities, which will require reliably coordinating training runs across tens of thousands of accelerators, running internet-scale data pipelines, and integrating new types of features and tools into Grok. If this sounds exciting to you, please apply to join our team.

xAI Research#

We provide Grok with search tools and real-time access to information, but like all next-token prediction-based LLMs, our model can still generate false or contradictory information. We believe that achieving reliable reasoning is the most important research direction to address the current limitations of systems. Here, we want to highlight some promising research directions that excite us at xAI:

Scalable tool-assisted supervision. Human feedback is crucial. However, providing consistent and accurate feedback can be challenging when dealing with lengthy code or complex reasoning steps, especially when complex reasoning is involved. AI can assist scalable supervision by looking up references from different sources, using external tools to verify intermediate steps, and seeking human feedback when necessary. Our goal is to leverage our model's assistance to make the most efficient use of our AI mentors time.
Integration with formal verification for safety, reliability, and foundations. To create AI systems with profound reasoning capabilities, we plan to cultivate reasoning abilities in less ambiguous and more verifiable contexts. This allows us to evaluate our systems without human feedback or interaction with the real world. A primary goal of this approach is to provide formal guarantees for code correctness, particularly in the area of formal verification related to AI safety.
Long-term context understanding and retrieval. Training models to efficiently discover useful knowledge in specific contexts is at the core of generating truly intelligent systems. We are exploring ways to discover and retrieve information when needed.
Adversarial robustness. Adversarial examples show that optimizers can easily exploit vulnerabilities in AI systems, not only during training but also during service time, leading them to make serious mistakes. These vulnerabilities are long-standing weaknesses of deep learning models. We are particularly focused on improving the robustness of LLMs, reward models, and monitoring systems.
Multimodal capabilities. Currently, Grok lacks other senses, such as vision and audio. To better assist users, we will equip Grok with these different senses, enabling a broader range of applications, including real-time interaction and assistance.

We believe AI has tremendous potential to provide significant scientific and economic value to society, and we will strive to develop reliable safeguards to prevent catastrophic malicious use. We believe that every effort should be made to ensure that AI remains a positive force.

If you want to contribute to our mission, please apply to join our team.