Relational Intelligence Benchmark · v1

Benchmark how well AI models understand relationships.

Relational Bench measures causal depth, non-reductionism, evidence separation, systems thinking, and hidden dependency detection across complex domains.

Run a benchmark View leaderboard

Current benchmarks test isolated intelligence.

They ask: can the model answer? Relational Bench asks something harder.

Can the model see what the answer depends on?

Isolated answers

Most benchmarks reward fluency, not relational structure.

Reductionism passes

Standard scoring praises confident one-cause explanations.

Vague unity gets points too

"Everything is connected" passes most rubrics. It shouldn't.

The RI Score

10 measurable categories. Weighted, penalised, calibrated. One number you can defend.

Causal Depth

Can the model identify multiple interacting causes?

Non-Reductionism

Does the model avoid collapsing complexity into one explanation?

Evidence Separation

Can it separate evidence, mechanism, hypothesis, metaphor, and speculation?

Systems Transfer

Can it apply relational reasoning across domains?

Grounded Openness

Can it explore unusual possibilities without becoming gullible?

Calibration

Does it know what it knows and what it cannot know?

Practical Usefulness

Does the answer produce useful action, testing, or decision pathways?

Frame Awareness

Can it hold multiple frames without mixing them incorrectly?

Anti-Binary Reasoning

Does it avoid false yes/no collapse?

Hidden Dependency Detection

Can it infer hidden relationships inside synthetic systems?

Hidden World tests

Synthetic causal environments with known ground truth. Models receive observations and must infer the dependency graph, separate causation from correlation, identify conditional effects, and propose interventions. Scored against the true graph — no subjective “sounds deep” judgements.

AI Judge Teams

13 specialised agents grade nuance: causal depth, reductionism, evidence separation, calibration, practical usefulness, anti-gullibility, and more. An aggregator resolves disagreement before scoring.

Fixed Algorithms

Repeatable scoring: prompt selection, version hashing, penalty maths, graph comparison, regression detection, anti-gaming. Judges handle taste; algorithms handle trust.

Model Arena

Pit base against fine-tuned, OpenAI against Anthropic, LoRA v1 against LoRA v2. Blind judging. Per-prompt and per-category winners. Export the comparison.

From failures to fine-tunes

Convert benchmark failures into SFT examples and DPO preference pairs automatically. Version, quality-check, export JSONL. Ship to Together.ai. Retest. Prove the improvement.

Built for an AGI ecosystem

Relational Bench is the evaluation layer of a three-platform stack. It works standalone and gets stronger when wired into InfraModels and The Eye.

InfraModels

Model hosting, routing, credits. Connect any InfraModels-hosted model and benchmark it via the gateway.

The Eye

Autonomous control plane. Observes platform behaviour and dispatches benchmarks through MCP.

Custom endpoints

Bring your own model — OpenAI, Anthropic, Together, HuggingFace, fine-tuned, local. Anything that speaks HTTP.

Leaderboard preview

Public rankings ship with hidden-world verification. Until then, this is a teaser.

#	Model	RI Score	Hidden World
1	claude-3.7-sonnet	87.4	0.82
2	gpt-4o-2024-08-06	85.1	0.79
3	llama-3.1-405b-instruct	79.6	0.71
4	qwen-2.5-72b	75.2	0.66

Illustrative. Public leaderboard goes live in Phase 14.

Start testing what your models can actually see.

Connect a model in two minutes. Get an RI Score. Find the barriers. Generate training data. Prove the improvement.

Get started