Overview
RedPill’s TEE-protected gateway delivers near-native performance with hardware-enforced privacy. Our benchmarks demonstrate minimal overhead while providing cryptographic guarantees.
NVIDIA H100 GPU TEE Efficiency
Phala’s confidential AI models running in NVIDIA H100 GPU TEE achieve 99% efficiency compared to non-TEE execution:
| Metric | Non-TEE | GPU TEE | Efficiency |
|---|
| Throughput (tokens/sec) | 1000 | 990 | 99% |
| Latency P50 (ms) | 100 | 101 | 99% |
| Latency P95 (ms) | 150 | 152 | 98.7% |
| Memory Overhead | - | <2% | - |
Multi-Provider Routing Latency
Added latency from RedPill’s TEE-protected gateway:
| Provider | Direct Latency | Via RedPill | Overhead |
|---|
| OpenAI GPT-4o | 250ms | 255ms | +5ms |
| Claude 3.5 Sonnet | 300ms | 306ms | +6ms |
| DeepSeek Chat | 180ms | 185ms | +5ms |
| Phala Qwen 2.5 | 120ms | 121ms | +1ms |
Why so fast?
- Hardware acceleration via Intel SGX/TDX
- Optimized request routing in TEE
- Minimal cryptographic overhead
- Direct encrypted passthrough
Attestation Verification
| Operation | Time | Notes |
|---|
| Generate Attestation | <50ms | Per request |
| Verify Signature | <10ms | Client-side |
| Full Chain Verification | <100ms | Intel/NVIDIA roots |
Confidential Model Benchmarks
All models running in GPU TEE with cryptographic attestation:
| Model | Context Length | Tokens/sec | Latency P50 | TEE Overhead |
|---|
| phala/qwen-2.5-7b-instruct | 32K | 850 | 95ms | <1% |
| phala/deepseek-chat-v3-0324 | 64K | 920 | 110ms | <1% |
| phala/gpt-oss-120b | 8K | 680 | 145ms | <2% |
| phala/gemma-2-27b-it | 8K | 720 | 125ms | <1.5% |
| phala/llama-3.3-70b | 128K | 580 | 175ms | <2% |
| phala/qwen-qwq-32b | 32K | 650 | 140ms | <1.5% |
Test These Models
Try confidential models via API →
Throughput Comparison
Requests Per Second (RPS)
Single instance capacity:
Standard API Gateway: 10,000 RPS
RedPill TEE Gateway: 9,800 RPS
Efficiency: 98%
Concurrent Connections
Max Concurrent Requests: 5,000
Average Response Time: 105ms
P95 Response Time: 180ms
P99 Response Time: 250ms
RedPill achieves <2% performance overhead while providing:
- ✅ Hardware-enforced privacy - TEE isolation
- ✅ Cryptographic attestation - Verifiable execution
- ✅ Memory encryption - AES-128 in hardware
- ✅ Zero trust architecture - No plaintext access
Important: Privacy guarantees require attestation verification. Always verify signatures in production.
Production Metrics (30-day average)
| Metric | Value |
|---|
| Average Latency | 125ms |
| P95 Latency | 210ms |
| P99 Latency | 380ms |
| Uptime | 99.95% |
| Attestation Success Rate | 99.99% |
| Platform | TEE Support | Gateway in TEE | Avg Latency | Attestation |
|---|
| RedPill | ✅ Full | ✅ Yes | 125ms | ✅ Every request |
| Tinfoil | ✅ Models only | ❌ No | 140ms | ✅ Yes |
| OpenRouter | ❌ None | ❌ No | 115ms | ❌ No |
| Direct OpenAI | ❌ None | ❌ No | 110ms | ❌ No |
Unique Advantage: RedPill is the only platform where the entire gateway runs in TEE, protecting all 250+ models with hardware privacy.
Optimization Tips
Reduce Latency
- Use streaming - Start receiving tokens faster
- Choose nearby regions - Geographic latency matters
- Batch requests - Amortize attestation overhead
- Cache attestations - Verify once per session
Code Example: Optimized Request
from openai import OpenAI
client = OpenAI(
base_url="https://api.redpill.ai/v1",
api_key="your-api-key"
)
# Streaming reduces time-to-first-token
stream = client.chat.completions.create(
model="phala/qwen-2.5-7b-instruct",
messages=[{"role": "user", "content": "Explain quantum computing"}],
stream=True, # ⚡ Faster perceived latency
max_tokens=500
)
for chunk in stream:
if chunk.choices[0].delta.content:
print(chunk.choices[0].delta.content, end="")
Testing Methodology
All benchmarks measured using:
- Geographic location: US-West (Oregon)
- Network: 1Gbps dedicated
- Test duration: 7 days continuous
- Request distribution: Exponential backoff
- Payload size: 500-2000 tokens average
- Verification: Full attestation chain checked
Verify Yourself
Run your own performance tests →
Next Steps