Our Work
Stay tuned for more updates!
Benchmark
A pioneering benchmark measuring the performance of Large Language Models (LLMs) in solving advanced academic problems, up to graduate-level. We also explore evaluation using GPT-4 by generating rubrics.
Learn more →Computer Activity Dataset
A dataset for training a multimodal computer agent that can do a variety of tasks from interacting with a web page to playing computer games. We collect video screenshots, audio, keyboard and touchpad inputs (and even eye tracking data) to better emulate how a human interacts with a computer.
Minerva-OCR
An OCR model of quality comparable to proprietary OCR solutions with LaTeX support.
Multilingual LLM
Building higher-quality non-English datasets and tokenizers, and finetune non-English GPTs out of a pretrained English GPT. Our goal is to improve the performance of non-English language models and make them more accessible to a wider audience.
Video Pretraining
Exploring efficient video pretraining methods to improve the downstreaming performance of generalist agent.