Independent Research Group

Studying long-horizon behavior in large language models

Most benchmarks measure momentary intelligence.

We measure how well reasoning holds up over time.

The Challenge

The Problem

Models are tested on static questions. Real work isn't static.

Benchmarks measure peaks, not stability.

Current Evaluations
Single-point performance measurement

Performance spikes without context

Long-Horizon Evaluations
Continuous performance tracking
Long-horizon performance visualization

Sustained, reliable performance

Our Approach

Research Focus

We design evaluations that unfold over time.

Our experiments introduce change, drift, and feedback.

Consistency Across Time
How models maintain reasoning quality over extended interactions

Temporal Consistency

Measuring how models maintain coherent reasoning patterns across hours and days of operation.

Output Quality Metrics

Tracking degradation or improvement in response quality over extended task sequences.

Behavioral Drift Detection

Identifying when and how model behavior shifts from expected patterns.

Our Work

What We're Building

  • Long-horizon benchmarks for LLMs

    In Development

    Novel evaluation frameworks that test model behavior across extended time periods and complex task sequences.

  • Scalable evaluation environments

    In Development

    Infrastructure for running continuous, time-based assessments of model performance and reliability.

  • Open reports and publications

    Coming Soon

    Transparent research outputs documenting our findings on long-term model behavior and evaluation methodologies.

  • Public research blog

    Coming Soon

    The Lab Notes — ongoing insights, experiments, and discoveries from our research into model reliability.

Team

People

Compendia Labs is led by Hitesh (AI systems, agent pipelines) and Tanya (biological grounding, constraint modeling).

Together, we study how intelligence holds up under time and pressure.

FAQ

Frequently Asked Questions

Stay Updated

Be the first to know when we release our benchmarks, initial findings, and insights on long-horizon model behavior.

No spam. Unsubscribe anytime.

If you care about model reliability over time — follow our research.