Available Models

RedPill offers 6 confidential AI models from Phala Network, all running entirely in GPU TEE with FP8 quantization for optimal performance.

Model Comparison

ModelParametersContextModalityPrice (Prompt/Completion)
DeepSeek V3685B (MoE)164KText0.00049/0.00049 / 0.00114 per 1K
GPT-OSS 120B117B (MoE)131KText0.0001/0.0001 / 0.00049 per 1K
GPT-OSS 20B21B (MoE)131KText0.0001/0.0001 / 0.0004 per 1K
Qwen2.5 VL 72B72B128KVision + Text0.00059/0.00059 / 0.00059 per 1K
Qwen 2.5 7B7B33KText0.00004/0.00004 / 0.0001 per 1K
Gemma 3 27B27B54KText0.00011/0.00011 / 0.0004 per 1K

Model Details

phala/deepseek-chat-v3-0324

Best Overall Quality

Flagship model for complex reasoning and analysis
Specifications:
  • Parameters: 685 billion (Mixture-of-Experts)
  • Context Length: 163,840 tokens (~123K words)
  • Quantization: FP8
  • Modality: Text → Text
Description: DeepSeek V3 is a 685B-parameter mixture-of-experts model, the flagship of the DeepSeek family. It excels at:
  • Complex reasoning and analysis
  • Mathematical problem solving
  • Code generation and debugging
  • Long-form content creation
  • Multi-turn conversations
Use Cases:
  • Financial analysis and modeling
  • Legal document review
  • Medical diagnosis support
  • Research paper analysis
  • Advanced code generation
Example:
response = client.chat.completions.create(
    model="phala/deepseek-chat-v3-0324",
    messages=[{
        "role": "user",
        "content": "Analyze the legal implications of this contract clause: ..."
    }]
)

phala/gpt-oss-120b

OpenAI Architecture

OpenAI’s open-weight model with familiar behavior
Specifications:
  • Parameters: 117 billion (MoE, 5.1B active)
  • Context Length: 131,072 tokens
  • Quantization: FP8
  • Modality: Text → Text
Description: GPT-OSS-120B is OpenAI’s open-weight model designed for high-reasoning and agentic use cases. Optimized for single H100 GPU with:
  • Configurable reasoning depth
  • Full chain-of-thought access
  • Native function calling
  • Structured output generation
Use Cases:
  • AI agents and automation
  • Complex task planning
  • Tool use and API integration
  • Production workloads requiring reasoning
Example:
response = client.chat.completions.create(
    model="phala/gpt-oss-120b",
    messages=[{
        "role": "user",
        "content": "Create a step-by-step plan to migrate our infrastructure to TEE"
    }],
    tools=[{
        "type": "function",
        "function": {
            "name": "get_infrastructure_status",
            "description": "Get current infrastructure state"
        }
    }]
)

phala/gpt-oss-20b

Efficient & Fast

Smaller model for low-latency applications
Specifications:
  • Parameters: 21 billion (MoE, 3.6B active)
  • Context Length: 131,072 tokens
  • Quantization: FP8
  • Modality: Text → Text
Description: GPT-OSS-20B is optimized for lower-latency inference and consumer/single-GPU deployment. Features:
  • OpenAI Harmony response format
  • Reasoning level configuration
  • Function calling and tool use
  • Structured outputs
  • Apache 2.0 license
Use Cases:
  • Real-time chatbots
  • Edge deployment
  • Cost-sensitive applications
  • High-throughput workloads
Example:
response = client.chat.completions.create(
    model="phala/gpt-oss-20b",
    messages=[{
        "role": "user",
        "content": "Summarize this customer support ticket"
    }],
    max_tokens=150
)

phala/qwen2.5-vl-72b-instruct

Vision + Language

Multimodal model for image understanding
Specifications:
  • Parameters: 72 billion
  • Context Length: 128,000 tokens
  • Quantization: FP8
  • Modality: Text + Image → Text
Description: Qwen2.5-VL is proficient in:
  • Recognizing common objects (flowers, birds, fish, insects)
  • Analyzing texts, charts, icons, graphics
  • Understanding layouts within images
  • Document understanding
  • Visual reasoning
Use Cases:
  • Medical image analysis
  • Document OCR and understanding
  • Chart and graph analysis
  • Visual quality inspection
  • Satellite imagery analysis
Example:
response = client.chat.completions.create(
    model="phala/qwen2.5-vl-72b-instruct",
    messages=[{
        "role": "user",
        "content": [
            {
                "type": "text",
                "text": "Analyze this medical X-ray for potential issues"
            },
            {
                "type": "image_url",
                "image_url": {
                    "url": "https://example.com/xray.jpg"
                }
            }
        ]
    }]
)

phala/qwen-2.5-7b-instruct

Budget-Friendly

Most cost-effective confidential model
Specifications:
  • Parameters: 7 billion
  • Context Length: 32,768 tokens
  • Quantization: FP8
  • Modality: Text → Text
Description: Qwen 2.5 7B brings significant improvements:
  • Enhanced coding and mathematics capabilities
  • Better instruction following
  • Improved long text generation (8K+ tokens)
  • Structured data understanding (tables, JSON)
  • Multilingual support (29+ languages)
Use Cases:
  • High-volume applications
  • Multilingual support
  • Simple chatbots
  • Text classification
  • Data extraction
Supported Languages: Chinese, English, French, Spanish, Portuguese, German, Italian, Russian, Japanese, Korean, Vietnamese, Thai, Arabic, and more. Example:
response = client.chat.completions.create(
    model="phala/qwen-2.5-7b-instruct",
    messages=[{
        "role": "user",
        "content": "Extract key information from this invoice (JSON format)"
    }],
    response_format={"type": "json_object"}
)

phala/gemma-3-27b-it

Google's Latest

Multimodal capabilities with strong multilingual support
Specifications:
  • Parameters: 27 billion
  • Context Length: 53,920 tokens
  • Quantization: FP8
  • Modality: Text → Text
Description: Gemma 3 introduces:
  • Multimodality support
  • Context windows up to 128K tokens
  • 140+ language understanding
  • Improved math and reasoning
  • Structured outputs
  • Function calling
Use Cases:
  • Multilingual applications (140+ languages)
  • Math and reasoning tasks
  • Structured data generation
  • Function calling workflows
  • Chat applications
Example:
response = client.chat.completions.create(
    model="phala/gemma-3-27b-it",
    messages=[{
        "role": "user",
        "content": "Solve this calculus problem step by step"
    }]
)

Feature Comparison

FeatureDeepSeek V3GPT-OSS 120BGPT-OSS 20BQwen2.5 VLQwen 2.5 7BGemma 3 27B
TEE Protected
Function Calling
Vision
Structured Output
Streaming
Multilingual✅ (140+)

Selection Guide

By Quality Requirements

Highest Quality:
  1. phala/deepseek-chat-v3-0324 (685B) - Best overall
  2. phala/gpt-oss-120b (117B) - OpenAI architecture
  3. phala/qwen2.5-vl-72b-instruct (72B) - Vision tasks
Balanced:
  1. phala/gemma-3-27b-it (27B) - Good quality, reasonable cost
  2. phala/gpt-oss-20b (21B) - Fast and efficient
Budget:
  1. phala/qwen-2.5-7b-instruct (7B) - Most economical

By Use Case

Complex Reasoning:
  • phala/deepseek-chat-v3-0324 - Best for complex analysis
  • phala/gpt-oss-120b - OpenAI-style reasoning
Vision Tasks:
  • phala/qwen2.5-vl-72b-instruct - Only vision model
Multilingual:
  • phala/gemma-3-27b-it - 140+ languages
  • phala/qwen-2.5-7b-instruct - 29+ languages
High Volume:
  • phala/qwen-2.5-7b-instruct - Lowest cost
  • phala/gpt-oss-20b - Fast inference
Function Calling:
  • phala/gpt-oss-120b - Best for agents
  • phala/gemma-3-27b-it - Good function support

Performance Benchmarks

All models run at ~99% of native performance in TEE mode:
ModelNative SpeedTEE SpeedOverhead
DeepSeek V385 tok/s84 tok/s~1%
GPT-OSS 120B95 tok/s94 tok/s~1%
GPT-OSS 20B120 tok/s118 tok/s~2%
Qwen2.5 VL 72B75 tok/s74 tok/s~1%
Qwen 2.5 7B150 tok/s148 tok/s~1%
Gemma 3 27B100 tok/s99 tok/s~1%

Attestation Support

All models provide cryptographic attestation:
# Get attestation for any Phala model
curl "https://api.redpill.ai/v1/attestation/report?model=phala/deepseek-chat-v3-0324" \
  -H "Authorization: Bearer YOUR_API_KEY"

Attestation Guide

Learn how to verify TEE execution →

Pricing Comparison

ModelCost per 1M TokensQuality/$ Ratio
Qwen 2.5 7B$40 (prompt)⭐⭐⭐⭐ Excellent
GPT-OSS 20B$100 (prompt)⭐⭐⭐⭐ Excellent
Gemma 3 27B$110 (prompt)⭐⭐⭐ Good
DeepSeek V3$490 (prompt)⭐⭐⭐⭐⭐ Best
Qwen2.5 VL 72B$590 (prompt)⭐⭐⭐⭐ Vision
GPT-OSS 120B$100 (prompt)⭐⭐⭐⭐⭐ Excellent

Migration Guide

From Regular Models to Phala

Simply change the model name:
# Before (regular model)
response = client.chat.completions.create(
    model="openai/gpt-4o",
    messages=[...]
)

# After (Phala confidential model)
response = client.chat.completions.create(
    model="phala/gpt-oss-120b",  # Similar to GPT-4
    messages=[...]  # Same API!
)
No other code changes required!

FAQs

phala/gpt-oss-120b - It’s OpenAI’s architecture and has similar capabilities.
phala/qwen-2.5-7b-instruct (150 tok/s) - Smallest and fastest.
phala/qwen2.5-vl-72b-instruct - The only vision model currently.
phala/deepseek-chat-v3-0324 matches or exceeds GPT-4 on many benchmarks, with full TEE protection.
Enterprise customers can fine-tune models in TEE. Contact sales@redpill.ai
FP8 reduces model size and increases speed with minimal quality loss (~1%). Enables efficient TEE inference.

Next Steps