The headline numbers, and what they mean for you.
Does adding the Lumenais intelligence layer actually make the AI smarter? We ran 56 real prompts through both systems and measured the difference.
Reasoning quality
+48.6%
Tested on 56 real prompts. The AI chose better lines of reasoning and gave more useful answers.
Grounding fit
100%
Answers stayed on-topic and respected the user's actual constraints — up from 94.6% to perfect.
Task selection
+0.40
On 30 curated selection prompts, the system improved from 0.00 to 0.40 in choosing the right reasoning family.
Exact correctness
100%
On 24 multiple-choice questions across math, science, and algorithms, both systems scored 100%. Adding reasoning didn't break correctness.
How we measured it
Live reasoning benchmark
Reasoning quality
+48.6%
0.3740 to 0.5556
Steering usefulness
0.3857
0.0125 to 0.3857
Grounding fit
100%
0.9464 to 1.0000
Sample: 56 live prompts
Measures whether the companion chooses a more useful line of thought under live conditions while staying grounded.
Average uplift across a 56-prompt suite; not a claim that every prompt improves equally.
Evidence package
logs/website_benchmark_suite_v2.json
logs/website_benchmark_suite_v2.md
Exact correctness floor
Exact correctness
100%
100% to 100%
Sample: 24 deterministic tasks
Checks that the system preserves or improves closed-form correctness while adding reasoning scaffolding.
This is a deterministic callable-backed multiple-choice safety-floor metric, not an open-form reasoning benchmark.
Evidence package
logs/website_benchmark_suite_v2_exact.md
logs/website_benchmark_suite_v2_methodology.md
Task selection
Task selection
+0.40
0.00 to 0.40
Sample: 30 approved-gold prompts
Measures whether the system chooses a more useful reasoning family before answering.
Curated approved-gold lens-family benchmark; not an open-world classifier claim.
Evidence package
logs/website_benchmark_suite_v2_task_selection.md
logs/website_benchmark_suite_v2_methodology.md
Semantic grounding proxy
Artifact class accuracy
1.00
Prompt-family accuracy
1.00
Sample: 16 ambiguity-control cases
Checks that ambiguous prompts stay in the right semantic universe instead of collapsing into generic or literalized readings.
Narrow ambiguity-control proxy benchmark; supporting evidence, not the headline reasoning claim.
Evidence package
logs/website_benchmark_suite_v2_semantic_grounding.md
logs/website_benchmark_suite_v2_methodology.md
Where it wins
Ambiguous Named Concept
0.3411 → 0.5872
When a prompt uses a poetic or metaphorical name, the system keeps it in the right conceptual frame instead of interpreting it literally.
Mathematical Strategy
0.3793 → 0.5536
On math problems, the system picks stronger proof strategies and gives clearer next-step guidance.
Operational Tradeoff
0.4115 → 0.5754
For real-world trade-off decisions, the system identifies the variable that actually matters instead of listing generic pros and cons.
UI System Design
0.3689 → 0.5782
On design problems, the system finds the real implementation decision point instead of writing a generic architecture overview.
General Companion
0.3370 → 0.4843
Under emotional pressure, the system gives practical strategies instead of vague reassurance.
Scientific Mechanism
0.3799 → 0.5809
On science questions, the system frames mechanisms more precisely and distinguishes between competing experimental approaches.
Ambiguous Abstract
0.4003 → 0.5293
For abstract or philosophical prompts, the system gives substantive framing instead of decorative language.
Transfer & Routing
Cross-domain transfer
Runs
150
Accuracy uplift
~+13%
Example delta
~0.79 vs ~0.66
Sample: 5 domain pairs, 150 runs
Shows that learned structure in one domain can improve adjacent domains under governance, rather than requiring per-domain retraining.
Internal UFCT governed-vs-baseline evaluation across curated domain pairs; not a consumer chat benchmark.
Evidence package
aetheris/docs/whitepaper/appendices/APPENDIX_UFCT_GENERAL_LEARNING.md
aetheris/docs/website/General_Learning_Page.md
Tools manifold routing
Real paired events
+3.77 pp
50.94% to 54.72%
Combined benchmark
+5.34 pp
Broad benchmark-scale evaluation
Sample: 53 real paired events; combined benchmark-scale evaluation
Measures whether learned routing improves tool choice compared with a fixed baseline policy.
Real-event significance remains underpowered at n=53; strongest support comes from the broader combined benchmark.
Evidence package
aetheris/docs/whitepaper/appendices/APPENDIX_ML_MANIFOLD_LEARNING.md
aetheris/docs/whitepaper/appendices/APPENDIX_E_PERFORMANCE_BENCHMARKS.md
Manifold stability
Validation accuracy
>91%
L2 drift band
0.014–0.121
Convergence
1–13 epochs
Sample: Nine trained manifolds
Supports the claim that learning components remain stable enough to deploy under governance.
Training and validation stability evidence for manifolds, not a live companion benchmark.
Evidence package
aetheris/docs/whitepaper/appendices/APPENDIX_G_PATHWAY_SGI.md
aetheris/docs/whitepaper/appendices/APPENDIX_ML_MANIFOLD_LEARNING.md
Mesh sharding speed
Mean speedup
2.74x
CI95
2.66x–2.83x
Queries
10
Sample: 10 benchmark queries
Shows that the mesh can materially reduce wall-clock time for sharded synthesis workloads.
Measures orchestration and distributed execution speed for a specific sharded synthesis workload, not model quality.
Evidence package
aetheris/docs/whitepaper/xprize/benchmarks/local_vs_mesh_suite_synthesis_shard_auto.md
Research Lab
PIMA Diabetes
AUC
85.3%
Rows
768
Sample: 768 rows
Shows parity-level performance on a clean medical classification benchmark with governance preventing negative transfer.
Dataset-task benchmark for the research platform, not a live companion benchmark.
Evidence package
aetheris/docs/whitepaper/appendices/APPENDIX_QARIN_BENCHMARKS.md
Non-linear stress test
AUC
90.8%
Lift vs linear baseline
+10.5%
Noise filtered
87%
Sample: 1,000 rows, 23 features
Shows autonomous signal detection and noise filtering on a deliberately difficult synthetic benchmark.
Synthetic signal-vs-noise benchmark; illustrates autonomous feature selection, not a production customer metric.
Evidence package
aetheris/docs/whitepaper/appendices/APPENDIX_QARIN_BENCHMARKS.md
Adult Census
AUC
91.1%
Rows
30,162
Features
96
Sample: 30,162 rows, 96 features
Shows robustness on high-dimensional, messy, real-world tabular data.
Dataset-task benchmark for robustness and fallback behavior, not a live companion benchmark.
Evidence package
aetheris/docs/whitepaper/appendices/APPENDIX_QARIN_BENCHMARKS.md
Symbolic regression
Kepler fit
R² = 1.0
Kepler complexity
4 nodes
Rydberg fit
R² = 1.0
Sample: Standard physics benchmark tasks
Shows interpretable equation discovery rather than black-box prediction alone.
Physics symbolic-regression benchmark; demonstrates the research pipeline, not the consumer companion.
Evidence package
aetheris/docs/whitepaper/page.tsx
aetheris/docs/whitepaper/appendices/APPENDIX_PLATFORM_ARCHITECTURE.md
Alzheimer’s biomarker discovery
Validation AUC
0.855
Samples
2,004
Brain regions
19
Sample: 2,004 samples, 19 regions
Shows structured discovery on a real biological dataset with literature-grounded marker interpretation.
Scientific discovery benchmark on curated transcriptomics data; not a live companion eval.
Evidence package
aetheris/docs/whitepaper/LUMENAIS_INTERNAL_TECHNICAL_WHITEPAPER_v1.0.md
aetheris/docs/whitepaper/xprize/XPRIZE_PHASE_II_SUBMISSION.md
FieldHash & Provenance
FieldHash hardening closure
Standard profile
15/800
1.875%
Hardened profile
0/800
Sample: 800 trials per profile
Shows that hardening materially closed a measured attack family rather than relying on a generic security narrative.
Attack-family measurement on a specific adversarial synthesis benchmark; not a universal security guarantee.
Evidence package
aetheris/docs/whitepaper/xprize/fieldhash/FIELDHASH_PUBLIC_TECHNICAL_BRIEF_2026-02-17.md
aetheris/docs/whitepaper/xprize/fieldhash/ADVERSARIAL_HARDENING_ADDENDUM_2026-02-17.md
FieldHash production-gated adaptive campaign
Production-gated acceptance
0/5000
Wilson 95% upper bound
0.0768%
Sample: 5,000 trials per tested model
Shows that the production-gated path held under stronger adaptive attacks than the policy-only path.
Per-tested-model result under the documented production-gated verifier and no-signing-key assumption; not an absolute impossibility claim.
Evidence package
aetheris/docs/whitepaper/xprize/fieldhash/FIELDHASH_PUBLIC_TECHNICAL_BRIEF_2026-02-17.md
aetheris/docs/whitepaper/xprize/fieldhash/ADAPTIVE_ML_SPOOFING_CAMPAIGN_2026-02-17.md
Scientific Caveats
Not AGI: These metrics measure reasoning quality and constraint adherence, not general consciousness or broad artificial general intelligence.
Sample Size: These benchmarks are broad enough to be meaningful, but they still represent evaluated slices rather than every possible workload.
Mean Lift: +48.6% reasoning lift is a mean improvement across the 56-prompt benchmark set. Individual prompts may show higher or lower improvement.
Scope: Exact correctness is a deterministic safety-floor check, and semantic grounding is a focused ambiguity-control proxy rather than the main headline claim.
Run the suite.
The public evidence page summarizes the current benchmark artifacts. Organizations can request access, the whitepaper, and deeper technical review materials.
Request accessReady to build?
This page is the evidence. The whitepaper explains the architecture behind it. The case studies show what the reasoning looks like in practice.