Most benchmarks measure momentary intelligence.
We measure how well reasoning holds up over time.
Models are tested on static questions. Real work isn't static.
Benchmarks measure peaks, not stability.
Performance spikes without context
Sustained, reliable performance
We design evaluations that unfold over time.
Our experiments introduce change, drift, and feedback.
Measuring how models maintain coherent reasoning patterns across hours and days of operation.
Tracking degradation or improvement in response quality over extended task sequences.
Identifying when and how model behavior shifts from expected patterns.
Novel evaluation frameworks that test model behavior across extended time periods and complex task sequences.
Infrastructure for running continuous, time-based assessments of model performance and reliability.
Transparent research outputs documenting our findings on long-term model behavior and evaluation methodologies.
The Lab Notes — ongoing insights, experiments, and discoveries from our research into model reliability.
Compendia Labs is led by Hitesh (AI systems, agent pipelines) and Tanya (biological grounding, constraint modeling).
Together, we study how intelligence holds up under time and pressure.
Be the first to know when we release our benchmarks, initial findings, and insights on long-horizon model behavior.
No spam. Unsubscribe anytime.
If you care about model reliability over time — follow our research.