Our Work

Stay tuned for more updates!

Benchmark

Benchmark

A pioneering benchmark measuring the performance of Large Language Models (LLMs) in solving advanced academic problems, up to graduate-level. We also explore evaluation using GPT-4 by generating rubrics.

Learn more →

Computer Activity Dataset

A dataset for training a multimodal computer agent that can do a variety of tasks from interacting with a web page to playing computer games. We collect video screenshots, audio, keyboard and touchpad inputs (and even eye tracking data) to better emulate how a human interacts with a computer.

Minerva-OCR

An OCR model of quality comparable to proprietary OCR solutions with LaTeX support.

Multilingual LLM

Building higher-quality non-English datasets and tokenizers, and finetune non-English GPTs out of a pretrained English GPT. Our goal is to improve the performance of non-English language models and make them more accessible to a wider audience.

Video Pretraining

Exploring efficient video pretraining methods to improve the downstreaming performance of generalist agent.