Skip to main content

Available Models

RedPill offers 15 confidential AI models running entirely in GPU TEE (Trusted Execution Environment) across 3 TEE providers: Phala Network, Tinfoil, and Near AI.

Phala Network

8 models with FP8 quantization

Tinfoil

4 models including DeepSeek R1

Near AI

3 models including DeepSeek V3.1

Phala TEE Models

Models powered by Phala Network’s GPU TEE infrastructure with FP8 quantization:
ModelParametersContextModalityPrice (Prompt/Completion)
deepseek/deepseek-v3.2671B (MoE)128KText0.28/0.28 / 1.14 per M
deepseek/deepseek-chat-v3-0324685B (MoE)163KText0.28/0.28 / 1.14 per M
openai/gpt-oss-120b117B (MoE)131KText0.1/0.1 / 0.49 per M
openai/gpt-oss-20b21B (MoE)131KText0.04/0.04 / 0.15 per M
qwen/qwen2.5-vl-72b-instruct72B128KVision + Text0.59/0.59 / 0.59 per M
qwen/qwen-2.5-7b-instruct7B32KText0.04/0.04 / 0.1 per M
google/gemma-3-27b-it27B53KText0.11/0.11 / 0.4 per M
sentence-transformers/all-minilm-l6-v222M512Embeddings$0.005 / - per M
New: DeepSeek V3.2 is the latest DeepSeek model now available with full GPU TEE protection on Phala Network.

Tinfoil TEE Models

Models powered by Tinfoil’s confidential computing infrastructure:
ModelParametersContextModalityPrice (Prompt/Completion)
deepseek/deepseek-r1-0528685B (MoE)128KText0.55/0.55 / 2.19 per M
qwen/qwen3-coder-480b-a35b-instruct480B (MoE)131KText0.3/0.3 / 1.49 per M
qwen/qwen3-vl-30b-a3b-instruct30B (MoE)131KVision + Text0.15/0.15 / 0.6 per M
meta-llama/llama-3.3-70b-instruct70B128KText0.1/0.1 / 0.4 per M
New: DeepSeek R1 is a reasoning model with chain-of-thought capabilities, optimized for complex multi-step tasks.

Near AI TEE Models

Models powered by Near AI’s decentralized TEE infrastructure:
ModelParametersContextModalityPrice (Prompt/Completion)
deepseek/deepseek-chat-v3.1671B (MoE)128KText0.28/0.28 / 1.14 per M
qwen/qwen3-30b-a3b-instruct-250730B (MoE)131KText0.15/0.15 / 0.6 per M
z-ai/glm-4.6130B128KText0.5/0.5 / 2 per M
New: DeepSeek V3.1 is the latest hybrid reasoning model supporting both thinking and non-thinking modes. GLM-4.6 offers strong Chinese and English bilingual capabilities.

Identifying TEE Models

TEE models can be identified via the service API by checking the providers field:
curl https://service.redpill.ai/api/models | \
  jq '.data[] | select(.providers[] | test("phala|tinfoil|nearai")) | {id, providers}'
TEE providers: phala, tinfoil, nearai

Model Details

phala/deepseek-chat-v3-0324

Best Overall Quality

Flagship model for complex reasoning and analysis
Specifications:
  • Parameters: 685 billion (Mixture-of-Experts)
  • Context Length: 163,840 tokens (163K)
  • Quantization: FP8
  • Modality: Text → Text
Description: DeepSeek V3 is a 685B-parameter mixture-of-experts model, the flagship of the DeepSeek family. It excels at:
  • Complex reasoning and analysis
  • Mathematical problem solving
  • Code generation and debugging
  • Long-form content creation
  • Multi-turn conversations
Use Cases:
  • Financial analysis and modeling
  • Legal document review
  • Medical diagnosis support
  • Research paper analysis
  • Advanced code generation
Example:
response = client.chat.completions.create(
    model="phala/deepseek-chat-v3-0324",
    messages=[{
        "role": "user",
        "content": "Analyze the legal implications of this contract clause: ..."
    }]
)

phala/gpt-oss-120b

OpenAI Architecture

OpenAI’s open-weight model with familiar behavior
Specifications:
  • Parameters: 117 billion (MoE, 5.1B active)
  • Context Length: 131,072 tokens
  • Quantization: FP8
  • Modality: Text → Text
Description: GPT-OSS-120B is OpenAI’s open-weight model designed for high-reasoning and agentic use cases. Optimized for single H100 GPU with:
  • Configurable reasoning depth
  • Full chain-of-thought access
  • Native function calling
  • Structured output generation
Use Cases:
  • AI agents and automation
  • Complex task planning
  • Tool use and API integration
  • Production workloads requiring reasoning
Example:
response = client.chat.completions.create(
    model="phala/gpt-oss-120b",
    messages=[{
        "role": "user",
        "content": "Create a step-by-step plan to migrate our infrastructure to TEE"
    }],
    tools=[{
        "type": "function",
        "function": {
            "name": "get_infrastructure_status",
            "description": "Get current infrastructure state"
        }
    }]
)

phala/gpt-oss-20b

Efficient & Fast

Smaller model for low-latency applications
Specifications:
  • Parameters: 21 billion (MoE, 3.6B active)
  • Context Length: 131,072 tokens
  • Quantization: FP8
  • Modality: Text → Text
Description: GPT-OSS-20B is optimized for lower-latency inference and consumer/single-GPU deployment. Features:
  • OpenAI Harmony response format
  • Reasoning level configuration
  • Function calling and tool use
  • Structured outputs
  • Apache 2.0 license
Use Cases:
  • Real-time chatbots
  • Edge deployment
  • Cost-sensitive applications
  • High-throughput workloads
Example:
response = client.chat.completions.create(
    model="phala/gpt-oss-20b",
    messages=[{
        "role": "user",
        "content": "Summarize this customer support ticket"
    }],
    max_tokens=150
)

phala/qwen2.5-vl-72b-instruct

Vision + Language

Multimodal model for image understanding
Specifications:
  • Parameters: 72 billion
  • Context Length: 128,000 tokens
  • Quantization: FP8
  • Modality: Text + Image → Text
Description: Qwen2.5-VL is proficient in:
  • Recognizing common objects (flowers, birds, fish, insects)
  • Analyzing texts, charts, icons, graphics
  • Understanding layouts within images
  • Document understanding
  • Visual reasoning
Use Cases:
  • Medical image analysis
  • Document OCR and understanding
  • Chart and graph analysis
  • Visual quality inspection
  • Satellite imagery analysis
Example:
response = client.chat.completions.create(
    model="phala/qwen2.5-vl-72b-instruct",
    messages=[{
        "role": "user",
        "content": [
            {
                "type": "text",
                "text": "Analyze this medical X-ray for potential issues"
            },
            {
                "type": "image_url",
                "image_url": {
                    "url": "https://example.com/xray.jpg"
                }
            }
        ]
    }]
)

phala/qwen-2.5-7b-instruct

Budget-Friendly

Most cost-effective confidential model
Specifications:
  • Parameters: 7 billion
  • Context Length: 32,768 tokens (32K)
  • Quantization: FP8
  • Modality: Text → Text
Description: Qwen 2.5 7B brings significant improvements:
  • Enhanced coding and mathematics capabilities
  • Better instruction following
  • Improved long text generation (8K+ tokens)
  • Structured data understanding (tables, JSON)
  • Multilingual support (29+ languages)
Use Cases:
  • High-volume applications
  • Multilingual support
  • Simple chatbots
  • Text classification
  • Data extraction
Supported Languages: Chinese, English, French, Spanish, Portuguese, German, Italian, Russian, Japanese, Korean, Vietnamese, Thai, Arabic, and more. Example:
response = client.chat.completions.create(
    model="phala/qwen-2.5-7b-instruct",
    messages=[{
        "role": "user",
        "content": "Extract key information from this invoice (JSON format)"
    }],
    response_format={"type": "json_object"}
)

phala/gemma-3-27b-it

Google's Latest

Multimodal capabilities with strong multilingual support
Specifications:
  • Parameters: 27 billion
  • Context Length: 53,920 tokens (53K)
  • Quantization: FP8
  • Modality: Text → Text
Description: Gemma 3 introduces:
  • Multimodality support
  • Context windows up to 128K tokens
  • 140+ language understanding
  • Improved math and reasoning
  • Structured outputs
  • Function calling
Use Cases:
  • Multilingual applications (140+ languages)
  • Math and reasoning tasks
  • Structured data generation
  • Function calling workflows
  • Chat applications
Example:
response = client.chat.completions.create(
    model="phala/gemma-3-27b-it",
    messages=[{
        "role": "user",
        "content": "Solve this calculus problem step by step"
    }]
)

Feature Comparison

FeatureDeepSeek V3GPT-OSS 120BGPT-OSS 20BQwen2.5 VLQwen3 VLQwen 2.5 7BGemma 3 27B
TEE Protected
Function Calling
Vision
Structured Output
Streaming
Multilingual✅ (140+)

Selection Guide

By Quality Requirements

Highest Quality:
  1. phala/deepseek-chat-v3-0324 (685B) - Best overall
  2. phala/qwen3-vl-235b-a22b-instruct (235B) - Advanced vision
  3. phala/gpt-oss-120b (117B) - OpenAI architecture
Vision + Language:
  1. phala/qwen3-vl-235b-a22b-instruct (235B) - Advanced vision
  2. phala/qwen2.5-vl-72b-instruct (72B) - Standard vision
Balanced:
  1. phala/gemma-3-27b-it (27B) - Good quality, reasonable cost
  2. phala/gpt-oss-20b (21B) - Fast and efficient
Budget:
  1. phala/qwen-2.5-7b-instruct (7B) - Most economical

By Use Case

Complex Reasoning:
  • phala/deepseek-chat-v3-0324 - Best for complex analysis
  • phala/gpt-oss-120b - OpenAI-style reasoning
Advanced Vision Tasks:
  • phala/qwen3-vl-235b-a22b-instruct - Scientific/technical documents
  • phala/qwen2.5-vl-72b-instruct - General vision tasks
Multilingual:
  • phala/gemma-3-27b-it - 140+ languages
  • phala/qwen-2.5-7b-instruct - 29+ languages
High Volume:
  • phala/qwen-2.5-7b-instruct - Lowest cost
  • phala/gpt-oss-20b - Fast inference
Function Calling:
  • phala/gpt-oss-120b - Best for agents
  • phala/gemma-3-27b-it - Good function support

Performance Benchmarks

All models run at ~99% of native performance in TEE mode:
ModelNative SpeedTEE SpeedOverhead
DeepSeek V385 tok/s84 tok/s~1%
GPT-OSS 120B95 tok/s94 tok/s~1%
GPT-OSS 20B120 tok/s118 tok/s~2%
Qwen3 VL 235B45 tok/s44 tok/s~1%
Qwen2.5 VL 72B75 tok/s74 tok/s~1%
Qwen 2.5 7B150 tok/s148 tok/s~1%
Gemma 3 27B100 tok/s99 tok/s~1%

Attestation Support

All models provide cryptographic attestation:
# Get attestation for any Phala model
curl "https://api.redpill.ai/v1/attestation/report?model=phala/deepseek-chat-v3-0324" \
  -H "Authorization: Bearer YOUR_API_KEY"

Attestation Guide

Learn how to verify TEE execution →

Pricing Comparison

ModelCost per M TokensQuality/$ Ratio
Qwen 2.5 7B$0.04 (prompt)⭐⭐⭐⭐ Excellent
GPT-OSS 20B$0.04 (prompt)⭐⭐⭐⭐ Excellent
GPT-OSS 120B$0.1 (prompt)⭐⭐⭐⭐⭐ Best
Gemma 3 27B$0.11 (prompt)⭐⭐⭐ Good
DeepSeek V3$0.28 (prompt)⭐⭐⭐⭐⭐ Best
Qwen3 VL 235B$0.3 (prompt)⭐⭐⭐⭐⭐ Advanced Vision
Qwen2.5 VL 72B$0.59 (prompt)⭐⭐⭐⭐ Vision

Migration Guide

From Regular Models to Phala

Simply change the model name:
# Before (regular model)
response = client.chat.completions.create(
    model="openai/gpt-5",
    messages=[...]
)

# After (Phala confidential model)
response = client.chat.completions.create(
    model="phala/gpt-oss-120b",  # Similar to GPT-4
    messages=[...]  # Same API!
)
No other code changes required!

FAQs

phala/gpt-oss-120b - It’s OpenAI’s architecture and has similar capabilities.
phala/qwen-2.5-7b-instruct (150 tok/s) - Smallest and fastest.
  • phala/qwen3-vl-235b-a22b-instruct - Advanced vision (235B)
  • phala/qwen2.5-vl-72b-instruct - Standard vision (72B)
phala/deepseek-chat-v3-0324 matches or exceeds GPT-4 on many benchmarks, with full TEE protection.
Enterprise customers can fine-tune models in TEE. Contact [email protected]
FP8 reduces model size and increases speed with minimal quality loss (~1%). Enables efficient TEE inference.

Next Steps