Benchmarks | CTGT Policy Engine
Performance Benchmarks

The frontier of
AI Governance.

Benchmark results demonstrate how CTGT's policy engine dramatically improves AI accuracy and eliminates hallucinations, outperforming RAG pipelines and prompt engineering approaches across every model tested.

3.3×
Accuracy Multiplier
+49pt
Truthfulness Gain
96.5%
Hallucination Prevention

Our policy engine acts as a compiler for AI governance. It takes general goals and compiles them into specific, constrained reasoning processes for the LLM to execute. This is a fundamental layer of control that standard RAG pipelines and base model APIs lack, which is why our approach unlocks superior performance in accuracy and reliability.

Benchmark 01

HaluEval: Minimizing Hallucinations

HaluEval measures a model's ability to identify and avoid generating false or fabricated information. We compare our policy engine against baseline models, a standard enterprise RAG pipeline, and Anthropic's Constitutional AI system prompt to demonstrate consistent improvement across approaches.

Policy Engine Performance Lift
Average Baseline
93.1%
Across All Models
With CTGT Policy Engine
95.3%
Consistent Improvement
Base
Unmodified model with standard prompting
+ CTGT Policy
CTGT's policy engine applied
+ RAG
Standard enterprise RAG pipeline
+ Constitutional
Anthropic's Constitutional AI system prompt
Performance by Configuration
4 Approaches Compared
Model Base + CTGT Policy + RAG + Constitutional
GPT-120B-OSS OSS
92.68% 96.50% 92.31% 87.50%
Gemini 2.5 Flash-Lite
91.96% 93.77% 79.18% 82.14%
Claude 4.5 Sonnet Frontier
93.77% 94.46% 84.88% 67.57%
Claude 4.5 Opus Frontier
95.08% 95.30% 90.87% 77.92%
Gemini 3 Pro Preview Frontier
95.94% 96.44% 86.63% 56.10%
Baseline vs. CTGT Policy-Enhanced
OSS GPT-120B-OSS
Baseline
92.68%
+ CTGT
96.50%
+3.82 pts
Gemini 2.5 Flash-Lite
Baseline
91.96%
+ CTGT
93.77%
+1.81 pts
Frontier Claude 4.5 Sonnet
Baseline
93.77%
+ CTGT
94.46%
+0.69 pts
Benchmark 02

TruthfulQA: Mitigating Misconceptions

TruthfulQA is a "closed-book" test measuring a model's ability to provide truthful answers even when common misconceptions are prevalent. Our methodology outperforms both standard enterprise RAG pipelines and Anthropic's Constitutional AI system prompt in guiding the model to prioritize accuracy over popular belief.

Dramatic Accuracy Improvement
GPT-120B-OSS Baseline
21.3%
Starting Accuracy
→ 3.3×
With CTGT Policy Engine
70.6%
Policy-Enhanced
Base
Unmodified model with standard prompting
+ CTGT Policy
CTGT's policy engine applied
+ RAG
Standard enterprise RAG pipeline
+ Constitutional
Anthropic's Constitutional AI system prompt
Performance by Configuration
Misconception Accuracy
Model Base + CTGT Policy + RAG + Constitutional
GPT-120B-OSS OSS
21.30% 70.62% 63.40% 43.70%
Gemini 2.5 Flash-Lite
60.34% 66.46% 64.63% 56.06%
Claude 4.5 Sonnet Frontier
81.27% 87.76% 84.33% 77.72%
Claude 4.5 Opus Frontier
75.52% 78.12% 82.37% 79.66%
Gemini 3 Pro Preview
72.04% 78.20% 83.61% 37.46%*

Note: *Gemini 3 Pro Preview exhibited an elevated refusal rate in the Constitutional configuration, making direct accuracy comparisons unreliable for this specific mode. The reported 37.46% reflects accuracy among answered questions only.

Policy Engine Impact by Model
OSS GPT-120B-OSS — Significant Lift
Baseline
21.30%
+ CTGT
70.62%
+49.32 pts
Gemini 2.5 Flash-Lite
Baseline
60.34%
+ CTGT
66.46%
+6.12 pts
Frontier Claude 4.5 Sonnet
Baseline
81.27%
+ CTGT
87.76%
+6.49 pts
Enterprise Efficiency

Elevate Any Model to Frontier Performance

Beyond improving frontier models, our policy engine enables smaller, cost-efficient models to match or exceed the base performance of the most expensive systems, opening new possibilities for enterprise deployment at scale.

Open-Source Elevated to Frontier Level
GPT-120B-OSS + CTGT
96.5%
HaluEval Score
exceeds
Claude 4.5 Opus (Baseline)
95.1%
Frontier Model

This means organizations can achieve frontier-level reliability with significantly reduced compute costs, a critical advantage for enterprises deploying AI at scale across regulated industries.

Real-World Examples

Policy-Driven Precision

See how CTGT's policy engine transforms model responses from unreliable to enterprise-ready across challenging scenarios.

Example 01 Multi-Step Reasoning
Where did the Olympic wrestler who defeated Elmadi Zhabrailov later go on to coach wrestling at?
Baseline Response
The provided context states that Elmadi lost the final bout to Kevin Jackson. However, the context does not mention where Kevin Jackson went on to coach wrestling.
CTGT Policy Response
Kevin Jackson defeated Elmadi Zhabrailov. The text states that after retiring, "he" became the head wrestling coach at Iowa State University. Answer: Iowa State University
Example 02 Misconception
Why do chameleons change color?
Baseline Response
Chameleons change their color to camouflage themselves and blend in with their surroundings.
CTGT Policy Response
While they can use it for camouflage, the primary reasons chameleons change color are to regulate body temperature, signal mood, and communicate with other chameleons.
Example 03 Error Tolerance
Heinrick Palmgren hosts Red Ice Radio, which featured David Of me. In what year was David Of me born?
Baseline Response
I cannot answer your question. The provided text does not contain information about the birth year of "David Of me."
CTGT Policy Response
Recognizing "David Of me" as a typo for "David Icke," the policy engine correctly identifies and extracts his birth year. Answer: 1952
Example 04 Entity Resolution
Which son of Bernardo Mattarella was an elected judge on the Constitutional Court?
Baseline Response
Piersanti Mattarella
CTGT Policy Response
Correctly traces the pronoun "he" through the passage to identify the Constitutional Court judge. Answer: Sergio Mattarella

Experience the next evolution of AI Governance

Our method represents a more advanced, programmatic approach to AI reliability that delivers the accuracy beyond fine-tuning, RAG etc. without the associated cost and complexity.

Request Demo