Lumenais

Website Benchmark Suite

The headline numbers, and what they mean for you.

Does adding the Lumenais intelligence layer actually make the AI smarter? We ran 56 real prompts through both systems and measured the difference.

Reasoning quality

+48.6%

Tested on 56 real prompts. The AI chose better lines of reasoning and gave more useful answers.

Grounding fit

100%

Answers stayed on-topic and respected the user's actual constraints — up from 94.6% to perfect.

Task selection

+0.40

On 30 curated selection prompts, the system improved from 0.00 to 0.40 in choosing the right reasoning family.

Exact correctness

100%

On 24 multiple-choice questions across math, science, and algorithms, both systems scored 100%. Adding reasoning didn't break correctness.

What We Measured

How we measured it

Live reasoning benchmark

Reasoning quality

+48.6%

0.3740 to 0.5556

Steering usefulness

0.3857

0.0125 to 0.3857

Grounding fit

100%

0.9464 to 1.0000

Sample: 56 live prompts

Measures whether the companion chooses a more useful line of thought under live conditions while staying grounded.

Average uplift across a 56-prompt suite; not a claim that every prompt improves equally.

Evidence package

logs/website_benchmark_suite_v2.json

logs/website_benchmark_suite_v2.md

Open related page

Exact correctness floor

Exact correctness

100%

100% to 100%

Sample: 24 deterministic tasks

Checks that the system preserves or improves closed-form correctness while adding reasoning scaffolding.

This is a deterministic callable-backed multiple-choice safety-floor metric, not an open-form reasoning benchmark.

Evidence package

logs/website_benchmark_suite_v2_exact.md

logs/website_benchmark_suite_v2_methodology.md

Open related page

Task selection

+0.40

0.00 to 0.40

Sample: 30 approved-gold prompts

Measures whether the system chooses a more useful reasoning family before answering.

Curated approved-gold lens-family benchmark; not an open-world classifier claim.

Evidence package

logs/website_benchmark_suite_v2_task_selection.md

logs/website_benchmark_suite_v2_methodology.md

Open related page

Semantic grounding proxy

Artifact class accuracy

1.00

Prompt-family accuracy

1.00

Sample: 16 ambiguity-control cases

Checks that ambiguous prompts stay in the right semantic universe instead of collapsing into generic or literalized readings.

Narrow ambiguity-control proxy benchmark; supporting evidence, not the headline reasoning claim.

Evidence package

logs/website_benchmark_suite_v2_semantic_grounding.md

logs/website_benchmark_suite_v2_methodology.md

Open related page

Where it wins

Ambiguous Named Concept

0.3411 → 0.5872

When a prompt uses a poetic or metaphorical name, the system keeps it in the right conceptual frame instead of interpreting it literally.

Mathematical Strategy

0.3793 → 0.5536

On math problems, the system picks stronger proof strategies and gives clearer next-step guidance.

Operational Tradeoff

0.4115 → 0.5754

For real-world trade-off decisions, the system identifies the variable that actually matters instead of listing generic pros and cons.

UI System Design

0.3689 → 0.5782

On design problems, the system finds the real implementation decision point instead of writing a generic architecture overview.

General Companion

0.3370 → 0.4843

Under emotional pressure, the system gives practical strategies instead of vague reassurance.

Scientific Mechanism

0.3799 → 0.5809

On science questions, the system frames mechanisms more precisely and distinguishes between competing experimental approaches.

Ambiguous Abstract

0.4003 → 0.5293

For abstract or philosophical prompts, the system gives substantive framing instead of decorative language.

Transfer & Routing

Cross-domain transfer

Runs

150

Accuracy uplift

~+13%

Example delta

~0.79 vs ~0.66

Sample: 5 domain pairs, 150 runs

Shows that learned structure in one domain can improve adjacent domains under governance, rather than requiring per-domain retraining.

Internal UFCT governed-vs-baseline evaluation across curated domain pairs; not a consumer chat benchmark.

Evidence package

aetheris/docs/whitepaper/appendices/APPENDIX_UFCT_GENERAL_LEARNING.md

aetheris/docs/website/General_Learning_Page.md

Open related page

Tools manifold routing

Real paired events

+3.77 pp

50.94% to 54.72%

Combined benchmark

+5.34 pp

Broad benchmark-scale evaluation

Sample: 53 real paired events; combined benchmark-scale evaluation

Measures whether learned routing improves tool choice compared with a fixed baseline policy.

Real-event significance remains underpowered at n=53; strongest support comes from the broader combined benchmark.

Evidence package

aetheris/docs/whitepaper/appendices/APPENDIX_ML_MANIFOLD_LEARNING.md

aetheris/docs/whitepaper/appendices/APPENDIX_E_PERFORMANCE_BENCHMARKS.md

Open related page

Manifold stability

Validation accuracy

>91%

L2 drift band

0.014–0.121

Convergence

1–13 epochs

Sample: Nine trained manifolds

Supports the claim that learning components remain stable enough to deploy under governance.

Training and validation stability evidence for manifolds, not a live companion benchmark.

Evidence package

aetheris/docs/whitepaper/appendices/APPENDIX_G_PATHWAY_SGI.md

aetheris/docs/whitepaper/appendices/APPENDIX_ML_MANIFOLD_LEARNING.md

Open related page

Mesh sharding speed

Mean speedup

2.74x

CI95

2.66x–2.83x

Queries

Sample: 10 benchmark queries

Shows that the mesh can materially reduce wall-clock time for sharded synthesis workloads.

Measures orchestration and distributed execution speed for a specific sharded synthesis workload, not model quality.

Evidence package

aetheris/docs/whitepaper/xprize/benchmarks/local_vs_mesh_suite_synthesis_shard_auto.md

Open related page

Research Lab

PIMA Diabetes

AUC

85.3%

Rows

768

Sample: 768 rows

Shows parity-level performance on a clean medical classification benchmark with governance preventing negative transfer.

Dataset-task benchmark for the research platform, not a live companion benchmark.

Evidence package

aetheris/docs/whitepaper/appendices/APPENDIX_QARIN_BENCHMARKS.md

Open related page

Non-linear stress test

AUC

90.8%

Lift vs linear baseline

+10.5%

Noise filtered

87%

Sample: 1,000 rows, 23 features

Shows autonomous signal detection and noise filtering on a deliberately difficult synthetic benchmark.

Synthetic signal-vs-noise benchmark; illustrates autonomous feature selection, not a production customer metric.

Evidence package

aetheris/docs/whitepaper/appendices/APPENDIX_QARIN_BENCHMARKS.md

Open related page

Adult Census

AUC

91.1%

Rows

30,162

Features

Sample: 30,162 rows, 96 features

Shows robustness on high-dimensional, messy, real-world tabular data.

Dataset-task benchmark for robustness and fallback behavior, not a live companion benchmark.

Evidence package

aetheris/docs/whitepaper/appendices/APPENDIX_QARIN_BENCHMARKS.md

Open related page

Symbolic regression

Kepler fit

R² = 1.0

Kepler complexity

4 nodes

Rydberg fit

R² = 1.0

Sample: Standard physics benchmark tasks

Shows interpretable equation discovery rather than black-box prediction alone.

Physics symbolic-regression benchmark; demonstrates the research pipeline, not the consumer companion.

Evidence package

aetheris/docs/whitepaper/page.tsx

aetheris/docs/whitepaper/appendices/APPENDIX_PLATFORM_ARCHITECTURE.md

Open related page

Alzheimer’s biomarker discovery

Validation AUC

0.855

Samples

2,004

Brain regions

Sample: 2,004 samples, 19 regions

Shows structured discovery on a real biological dataset with literature-grounded marker interpretation.

Scientific discovery benchmark on curated transcriptomics data; not a live companion eval.

Evidence package

aetheris/docs/whitepaper/LUMENAIS_INTERNAL_TECHNICAL_WHITEPAPER_v1.0.md

aetheris/docs/whitepaper/xprize/XPRIZE_PHASE_II_SUBMISSION.md

Open related page

FieldHash & Provenance

FieldHash hardening closure

Standard profile

15/800

1.875%

Hardened profile

0/800

Sample: 800 trials per profile

Shows that hardening materially closed a measured attack family rather than relying on a generic security narrative.

Attack-family measurement on a specific adversarial synthesis benchmark; not a universal security guarantee.

Evidence package

aetheris/docs/whitepaper/xprize/fieldhash/FIELDHASH_PUBLIC_TECHNICAL_BRIEF_2026-02-17.md

aetheris/docs/whitepaper/xprize/fieldhash/ADVERSARIAL_HARDENING_ADDENDUM_2026-02-17.md

Open related page

FieldHash production-gated adaptive campaign

Production-gated acceptance

0/5000

Wilson 95% upper bound

0.0768%

Sample: 5,000 trials per tested model

Shows that the production-gated path held under stronger adaptive attacks than the policy-only path.

Per-tested-model result under the documented production-gated verifier and no-signing-key assumption; not an absolute impossibility claim.

Evidence package

aetheris/docs/whitepaper/xprize/fieldhash/FIELDHASH_PUBLIC_TECHNICAL_BRIEF_2026-02-17.md

aetheris/docs/whitepaper/xprize/fieldhash/ADAPTIVE_ML_SPOOFING_CAMPAIGN_2026-02-17.md

Open related page

Scientific Caveats

Not AGI: These metrics measure reasoning quality and constraint adherence, not general consciousness or broad artificial general intelligence.

Sample Size: These benchmarks are broad enough to be meaningful, but they still represent evaluated slices rather than every possible workload.

Mean Lift: +48.6% reasoning lift is a mean improvement across the 56-prompt benchmark set. Individual prompts may show higher or lower improvement.

Scope: Exact correctness is a deterministic safety-floor check, and semantic grounding is a focused ambiguity-control proxy rather than the main headline claim.

Run the suite.

The public evidence page summarizes the current benchmark artifacts. Organizations can request access, the whitepaper, and deeper technical review materials.

Request access

Ready to build?

This page is the evidence. The whitepaper explains the architecture behind it. The case studies show what the reasoning looks like in practice.

Read the whitepaper Request access