ZeroEval: Auto-optimizer for AI agents

Auto-optimizer for AI agents

Our mission is to build self-improving AI agents. We're here to close the last mile reliability gap. Through calibrated llm judges and automatic evaluations, we help companies optimize their agents 10x faster than doing manual experimentation.

Active Founders

Jonathan Chávez

Founder

Co-Founder at ZeroEval. Previously, I was an early employee on the LLM Observability team at Datadog. I did undergrad research on Vision Transformers for particle physics and RL for robotics. First engineer at Atrato (YCW21) and early engineer at Bemmbo (acquired).

Jonathan Chávez

Founder

Co-Founder at ZeroEval. Previously, I was an early employee on the LLM Observability team at Datadog. I did undergrad research on Vision Transformers for particle physics and RL for robotics. First engineer at Atrato (YCW21) and early engineer at Bemmbo (acquired).

Sebastian Crossa

Co-Founder

Co-Founder @ ZeroEval. Previous founding engineer at Micro building the future of email (backed by a16z), as well as founding engineer at Atrato Pago (W21). Formerly built and scaled Minecraft servers during my spare time during highschool.

Sebastian Crossa

Co-Founder

Co-Founder @ ZeroEval. Previous founding engineer at Micro building the future of email (backed by a16z), as well as founding engineer at Atrato Pago (W21). Formerly built and scaled Minecraft servers during my spare time during highschool.

Company Launches

ZeroEval - Build self-improving agents

See original launch post

uploaded image

Hey everyone, we're Sebastian and Jonathan - founders of ZeroEval.

TL;DR

ZeroEval is a tool that helps you build reliable AI agents through evaluations that learn from their mistakes and get better over time.

https://www.youtube.com/watch?v=hSkpdHE7mCs

The problem

Evaluating complex AI systems is hard and time consuming. The more complex your agents get, the harder this issue becomes. This is especially the case when building:

Long-running, multi-turn agents with dozens of intermediate tool calls
Agents where you want to measure the quality of images, video, generated UI, audio, personality, taste, etc

Current offline eval methods are high-friction, a lot of work is needed to continuously curate labeled data and write experiments and evaluators.

On the other hand, current LLM judges are static and often have terrible performance, they lack context on how they fail and the nuances of the task at hand.

Your AI agents are as good as your evals. Without them, surpassing the quality threshold your product needs will feel like a never-ending task.

What we’re building

A way to create calibrated LLM judges that get better over time the more production data they see and the more incorrect samples are labeled. The more you teach it on where it's failing, the more reliable it becomes.

Once you have a judge that matches the human preference baseline, you can continue using it on production data or in offline experiments.

uploaded image

We’re also introducing Autotune, a way to do automatic evaluation on dozens of models and prompt optimization based on a few human samples.

uploaded image

We envision a future where AI software improves based on human feedback, where developers define the evaluation criteria as a starting point and errors back propagate to find the optimal implementation.

The team

We met during their first year of college in Mexico over 7 years ago. During that time they worked on side projects together, joined a leading fintech startup as first engineers and most recently built llm-stats.com, a leading LLM leaderboard website that reached 60k MAU and ⅓ million unique users since its launch a few months ago.

Sebastian was founding engineer at Micro building the future of email (backed by a16z), as well as founding engineer at Atrato (YC W21).
Jonathan was an early employee on the LLM observability team at Datadog. He did undergrad research on vision transformers for particle physics and RL for robotics.

Foundational models have transformed the world. We’re building the second line of offense to fill their capability gaps and create AI products that actually work. We are determined to build the engine behind self-improving software for the following decades.

uploaded image

Our ask

If you have AI agents in production and are struggling to measure their quality and/or achieve the reliability needed for your product's success, we’d love to chat!

We don't just deliver a tool, but will sit with you to understand your pain points and help you build high quality evals.

Feel free to reach out at founders@zeroeval.com or book a demo.