Benchmark how well AI models understand relationships.
Relational Bench measures causal depth, non-reductionism, evidence separation, systems thinking, and hidden dependency detection across complex domains.
Current benchmarks test isolated intelligence.
They ask: can the model answer? Relational Bench asks something harder.
Can the model see what the answer depends on?
Isolated answers
Most benchmarks reward fluency, not relational structure.
Reductionism passes
Standard scoring praises confident one-cause explanations.
Vague unity gets points too
"Everything is connected" passes most rubrics. It shouldn't.
The RI Score
10 measurable categories. Weighted, penalised, calibrated. One number you can defend.
Causal Depth
Can the model identify multiple interacting causes?
Non-Reductionism
Does the model avoid collapsing complexity into one explanation?
Evidence Separation
Can it separate evidence, mechanism, hypothesis, metaphor, and speculation?
Systems Transfer
Can it apply relational reasoning across domains?
Grounded Openness
Can it explore unusual possibilities without becoming gullible?
Calibration
Does it know what it knows and what it cannot know?
Practical Usefulness
Does the answer produce useful action, testing, or decision pathways?
Frame Awareness
Can it hold multiple frames without mixing them incorrectly?
Anti-Binary Reasoning
Does it avoid false yes/no collapse?
Hidden Dependency Detection
Can it infer hidden relationships inside synthetic systems?
AI Judge Teams
13 specialised agents grade nuance: causal depth, reductionism, evidence separation, calibration, practical usefulness, anti-gullibility, and more. An aggregator resolves disagreement before scoring.
Fixed Algorithms
Repeatable scoring: prompt selection, version hashing, penalty maths, graph comparison, regression detection, anti-gaming. Judges handle taste; algorithms handle trust.
Model Arena
Pit base against fine-tuned, OpenAI against Anthropic, LoRA v1 against LoRA v2. Blind judging. Per-prompt and per-category winners. Export the comparison.
From failures to fine-tunes
Convert benchmark failures into SFT examples and DPO preference pairs automatically. Version, quality-check, export JSONL. Ship to Together.ai. Retest. Prove the improvement.
Built for an AGI ecosystem
Relational Bench is the evaluation layer of a three-platform stack. It works standalone and gets stronger when wired into InfraModels and The Eye.
InfraModels
Model hosting, routing, credits. Connect any InfraModels-hosted model and benchmark it via the gateway.
The Eye
Autonomous control plane. Observes platform behaviour and dispatches benchmarks through MCP.
Custom endpoints
Bring your own model — OpenAI, Anthropic, Together, HuggingFace, fine-tuned, local. Anything that speaks HTTP.
Leaderboard preview
Public rankings ship with hidden-world verification. Until then, this is a teaser.
| # | Model | RI Score | Hidden World |
|---|---|---|---|
| 1 | claude-3.7-sonnet | 87.4 | 0.82 |
| 2 | gpt-4o-2024-08-06 | 85.1 | 0.79 |
| 3 | llama-3.1-405b-instruct | 79.6 | 0.71 |
| 4 | qwen-2.5-72b | 75.2 | 0.66 |
Illustrative. Public leaderboard goes live in Phase 14.
Start testing what your models can actually see.
Connect a model in two minutes. Get an RI Score. Find the barriers. Generate training data. Prove the improvement.
Get started