Kashikoi - Simulation Engine for Benchmarking AI Agents

Autonomously interview your Agents!

Aaksha Meghawat

7 months ago

https://www.getkashikoi.com

#generative_ai#developer_tools#machine_learning#artificial_intelligence

Hey YC!
We are Tim and Aaksha - cofounders of Kashikoi. Kashikoi is a simulation engine to benchmark GenAI Agents. We generate CPU friendly world models that autonomously interview agents and generate deep behavioral assessments.

The Problem

Building high-performing AI agents is becoming increasingly complex. Teams face many challenges:

Managing prompt bloat and keeping up with endless prompt tuning cycles.
Evaluating their agents (or competitors') meaningfully and efficiently.
Understanding agent performance in ways that reflect real-world values and expectations—not just public benchmarks.

Despite growing interest and investment, most solutions rely heavily on prompt engineering, public benchmarks, or surface-level observability. These approaches often mislead more than they inform, creating a false sense of progress.

The Solution

uploaded image

Today’s “LLMs” are adaptive systems which run test time adaptation loops behind that tiny “Thinking…” blinking on the screen. We are building a scalable version of test time adaptation and inference scaling a.k.a World Models that bring the power of these techniques to you.

Simply put you can simulate highly customized benchmarks, diverse data and align your evaluations while maintaining all of these for the long run, all without writing prompts! Our world models unlock automatic prompt optimization and detecting regression test staleness as fun side effects.

LLM based Systems are getting smarter and so should you using our World models!

Check out our simulation engine adaptively interviewing RAG Agents and multi-turn evaluation in action here.

Why Us

Tim and Aaksha used similar world models tech at Moveworks to massively reduce dev cycles for shipping 250+ customized enterprise ready agents.
Aaksha has done cutting edge research in Transformers at CMU (long before OpenAI made them cool). She shipped edge speech models on 1bn+ iPhones. The innovation behind these models was published as a paper at Interspeech 2021 and nominated for a Best Paper Award that same year.
Tim has found many high impact security vulnerabilities throughout his career. One of his top bug discoveries was in all Qualcomm GPS chips leading to a 50 mile 0-click exploit that had no mitigations. Tim has many public CVEs for a variety of Apple products, including: Safari, MacOS, iOS, tvOS, & iTunes.

Our Ask

Instant evals on your agent (or a competitor’s 😉) and we’ll generate such a report for you.
Making advanced features—like automatic prompt optimization, world models, and inference scaling—work seamlessly for you.
Reliable prompt free evaluations that are aligned with your values and expectations (which we auto-encode in a special edge-friendly world model for you).
Don’t come to us if:
- You love writing prompts
- You trust public benchmarks
- You think that good observability is enough to make your agent win
- You aren’t ready to have an honest conversation about your agents’ performance
Jokes aside 😅, if you know enterprises building agents suffering from prompt bloat, please (pretty please 🥺) send them our way!