Skip to main content

All Confidential AI Models

RedPill offers 17 confidential AI models running entirely in GPU TEE (Trusted Execution Environment) across 4 TEE providers: Phala Network, Tinfoil, Near AI, and Chutes.

Phala Network

8 models with FP8 quantization

Tinfoil

4 models including Kimi K2 Thinking

Near AI

3 models including DeepSeek V3.1

Chutes

2 models including DeepSeek V3.2

Phala TEE Models

Models powered by Phala Network’s GPU TEE infrastructure with FP8 quantization:
ModelParametersContextModalityPrice (Prompt/Completion)
phala/glm-5Large202KText1.2/1.2 / 3.5 per M
phala/gpt-oss-120b117B (MoE)131KText0.1/0.1 / 0.49 per M
phala/qwen3-vl-30b-a3b-instruct30B (MoE)128KVision + Text0.2/0.2 / 0.7 per M
phala/gemma-3-27b-it27B53KVision + Text0.11/0.11 / 0.4 per M
phala/uncensored-24b24B32KText0.2/0.2 / 0.9 per M
phala/gpt-oss-20b21B (MoE)131KText0.04/0.04 / 0.15 per M
phala/glm-4.7-flash~30B202KText0.1/0.1 / 0.43 per M
phala/qwen-2.5-7b-instruct7B32KText0.04/0.04 / 0.1 per M
New: GLM-5 and GLM 4.7 Flash are the latest ZhipuAI models now available with full GPU TEE protection. Venice Uncensored 24B offers an alignment-free model for advanced use cases.
The model ID phala/qwen2.5-vl-72b-instruct is a legacy alias that now routes to phala/qwen3-vl-30b-a3b-instruct.

Tinfoil TEE Models

Models powered by Tinfoil’s confidential computing infrastructure:
ModelParametersContextModalityPrice (Prompt/Completion)
deepseek/deepseek-r1-0528685B (MoE)163KText2/2 / 2 per M
qwen/qwen3-coder-480b-a35b-instruct480B (MoE)262KText2/2 / 2 per M
moonshotai/kimi-k2-thinking1T (MoE, 32B active)262KText2/2 / 2 per M
meta-llama/llama-3.3-70b-instruct70B131KText2/2 / 2 per M
New: Kimi K2 Thinking is Moonshot AI’s most advanced open reasoning model, optimized for persistent step-by-step thought and dynamic tool invocation across hundreds of turns. Tinfoil models use flat-rate pricing at $2/M tokens.

Near AI TEE Models

Models powered by Near AI’s decentralized TEE infrastructure:
ModelParametersContextModalityPrice (Prompt/Completion)
deepseek/deepseek-chat-v3.1671B (MoE)163KText1/1 / 2.5 per M
qwen/qwen3-30b-a3b-instruct-250730B (MoE)262KText0.15/0.15 / 0.45 per M
z-ai/glm-4.7130B131KText0.85/0.85 / 3.3 per M
GLM-4.7 is ZhipuAI’s latest flagship model with enhanced programming capabilities and more stable multi-step reasoning. DeepSeek V3.1 supports both thinking and non-thinking modes.

Chutes TEE Models

Models powered by Chutes’ confidential computing infrastructure:
ModelParametersContextModalityPrice (Prompt/Completion)
deepseek/deepseek-v3.2685B (MoE)163KText0.27/0.27 / 0.4 per M
moonshotai/kimi-k2.5Large (MoE)262KText + Image0.6/0.6 / 3 per M
New: Chutes offers competitive pricing on DeepSeek V3.2 and Kimi K2.5. Kimi K2.5 is a native multimodal model with state-of-the-art visual coding and agentic capabilities.

Identifying TEE Models

TEE models can be identified in two ways:

1. Use the dedicated endpoint

# Get all models available through Phala
curl https://api.redpill.ai/v1/models/phala \
  -H "Authorization: Bearer YOUR_API_KEY"

2. Check the providers field

Every model in the API response includes a providers array. Filter for TEE provider names:
curl https://api.redpill.ai/v1/models \
  -H "Authorization: Bearer YOUR_API_KEY" | \
  jq '.data[] | select(.providers[] | test("phala|tinfoil|near-ai|chutes")) | {id, providers}'
TEE providers: phala, tinfoil, near-ai, chutes

3. Use the phala/ model ID prefix

Any model can be accessed via the phala/ prefix to explicitly route through Phala’s TEE:
# These are equivalent - both route through Phala TEE
response = client.chat.completions.create(model="openai/gpt-oss-120b", ...)
response = client.chat.completions.create(model="phala/gpt-oss-120b", ...)

Model Details

phala/glm-5

Best Overall Quality

Flagship model for complex systems engineering and agent workflows
Specifications:
  • Context Length: 202,752 tokens (202K)
  • Quantization: FP8
  • Modality: Text -> Text
Description: GLM-5 is an open-source foundation model built for complex systems engineering and long-horizon agent workflows. It delivers:
  • Production-grade productivity for large-scale programming tasks
  • Performance aligned to top closed-source models
  • Expert-level system design capabilities
Use Cases:
  • Complex systems engineering
  • Long-horizon agent workflows
  • Large-scale programming tasks
  • Advanced code generation
Example:
response = client.chat.completions.create(
    model="phala/glm-5",
    messages=[{
        "role": "user",
        "content": "Design a distributed system architecture for real-time event processing"
    }]
)

phala/gpt-oss-120b

OpenAI Architecture

OpenAI’s open-weight model with familiar behavior
Specifications:
  • Parameters: 117 billion (MoE, 5.1B active)
  • Context Length: 131,072 tokens
  • Quantization: FP8
  • Modality: Text -> Text
Description: GPT-OSS-120B is OpenAI’s open-weight model designed for high-reasoning and agentic use cases. Optimized for single H100 GPU with:
  • Configurable reasoning depth
  • Full chain-of-thought access
  • Native function calling
  • Structured output generation
Use Cases:
  • AI agents and automation
  • Complex task planning
  • Tool use and API integration
  • Production workloads requiring reasoning
Example:
response = client.chat.completions.create(
    model="phala/gpt-oss-120b",
    messages=[{
        "role": "user",
        "content": "Create a step-by-step plan to migrate our infrastructure to TEE"
    }],
    tools=[{
        "type": "function",
        "function": {
            "name": "get_infrastructure_status",
            "description": "Get current infrastructure state"
        }
    }]
)

phala/gpt-oss-20b

Efficient & Fast

Smaller model for low-latency applications
Specifications:
  • Parameters: 21 billion (MoE, 3.6B active)
  • Context Length: 131,072 tokens
  • Quantization: FP8
  • Modality: Text -> Text
Description: GPT-OSS-20B is optimized for lower-latency inference and consumer/single-GPU deployment. Features:
  • OpenAI Harmony response format
  • Reasoning level configuration
  • Function calling and tool use
  • Structured outputs
  • Apache 2.0 license
Use Cases:
  • Real-time chatbots
  • Edge deployment
  • Cost-sensitive applications
  • High-throughput workloads
Example:
response = client.chat.completions.create(
    model="phala/gpt-oss-20b",
    messages=[{
        "role": "user",
        "content": "Summarize this customer support ticket"
    }],
    max_tokens=150
)

phala/qwen3-vl-30b-a3b-instruct

Vision + Language

Multimodal model for image and video understanding
Specifications:
  • Parameters: 30 billion (MoE, 3B active)
  • Context Length: 128,000 tokens
  • Quantization: FP8
  • Modality: Text + Image -> Text
Description: Qwen3-VL-30B unifies strong text generation with visual understanding for images and videos:
  • Real-world and synthetic object perception
  • 2D/3D spatial grounding
  • Long-form visual comprehension
  • GUI automation and visual coding
  • Document AI and OCR
Use Cases:
  • Document OCR and understanding
  • Chart and graph analysis
  • Visual quality inspection
  • UI/UX automation
  • Video timeline analysis
Example:
response = client.chat.completions.create(
    model="phala/qwen3-vl-30b-a3b-instruct",
    messages=[{
        "role": "user",
        "content": [
            {
                "type": "text",
                "text": "Analyze this document and extract the key data"
            },
            {
                "type": "image_url",
                "image_url": {
                    "url": "https://example.com/document.jpg"
                }
            }
        ]
    }]
)

phala/gemma-3-27b-it

Google's Latest

Multimodal capabilities with strong multilingual support
Specifications:
  • Parameters: 27 billion
  • Context Length: 53,920 tokens (53K)
  • Quantization: FP8
  • Modality: Text + Image -> Text
Description: Gemma 3 introduces:
  • Multimodality support
  • Context windows up to 128K tokens
  • 140+ language understanding
  • Improved math and reasoning
  • Structured outputs
  • Function calling
Use Cases:
  • Multilingual applications (140+ languages)
  • Math and reasoning tasks
  • Structured data generation
  • Function calling workflows
  • Chat applications
Example:
response = client.chat.completions.create(
    model="phala/gemma-3-27b-it",
    messages=[{
        "role": "user",
        "content": "Solve this calculus problem step by step"
    }]
)

phala/uncensored-24b

Uncensored

Alignment-free model for unrestricted use cases
Specifications:
  • Parameters: 24 billion
  • Context Length: 32,768 tokens (32K)
  • Quantization: FP8
  • Modality: Text -> Text
Description: Venice Uncensored Dolphin Mistral 24B is a fine-tuned variant of Mistral-Small-24B, designed as an uncensored instruct-tuned LLM:
  • Full user control over alignment and behavior
  • Steerability and transparent behavior
  • No default safety layers
  • Advanced and unrestricted use cases
Use Cases:
  • Creative writing without content filters
  • Research requiring unrestricted outputs
  • Custom alignment experimentation
  • Red-teaming and safety research
Example:
response = client.chat.completions.create(
    model="phala/uncensored-24b",
    messages=[{
        "role": "user",
        "content": "Write a detailed analysis of this controversial topic"
    }]
)

phala/glm-4.7-flash

Fast & Efficient

30B-class model optimized for agentic coding
Specifications:
  • Parameters: ~30B
  • Context Length: 202,752 tokens (202K)
  • Quantization: FP8
  • Modality: Text -> Text
Description: GLM 4.7 Flash balances performance and efficiency with:
  • Optimized agentic coding capabilities
  • Strengthened long-horizon task planning
  • Tool collaboration
  • Leading open-source performance at its size class
Use Cases:
  • Agentic coding workflows
  • Long-horizon task planning
  • Tool-assisted development
  • Cost-effective general purpose
Example:
response = client.chat.completions.create(
    model="phala/glm-4.7-flash",
    messages=[{
        "role": "user",
        "content": "Implement a REST API endpoint with error handling and tests"
    }]
)

phala/qwen-2.5-7b-instruct

Budget-Friendly

Most cost-effective confidential model
Specifications:
  • Parameters: 7 billion
  • Context Length: 32,768 tokens (32K)
  • Quantization: FP8
  • Modality: Text -> Text
Description: Qwen 2.5 7B brings significant improvements:
  • Enhanced coding and mathematics capabilities
  • Better instruction following
  • Improved long text generation (8K+ tokens)
  • Structured data understanding (tables, JSON)
  • Multilingual support (29+ languages)
Use Cases:
  • High-volume applications
  • Multilingual support
  • Simple chatbots
  • Text classification
  • Data extraction
Supported Languages: Chinese, English, French, Spanish, Portuguese, German, Italian, Russian, Japanese, Korean, Vietnamese, Thai, Arabic, and more. Example:
response = client.chat.completions.create(
    model="phala/qwen-2.5-7b-instruct",
    messages=[{
        "role": "user",
        "content": "Extract key information from this invoice (JSON format)"
    }],
    response_format={"type": "json_object"}
)

Feature Comparison

FeatureGLM 5GPT-OSS 120BGPT-OSS 20BQwen3 VL 30BGemma 3 27BUncensored 24BGLM 4.7 FlashQwen 2.5 7B
TEE ProtectedYesYesYesYesYesYesYesYes
Function CallingYesYesYesYesYesYesYesYes
VisionNoNoNoYesYesNoNoNo
Structured OutputYesYesYesYesYesYesYesYes
StreamingYesYesYesYesYesYesYesYes
MultilingualYesYesYesYesYes (140+)YesYesYes (29+)

Selection Guide

By Quality Requirements

Highest Quality:
  1. phala/glm-5 - Best overall for systems engineering
  2. phala/gpt-oss-120b (117B) - OpenAI architecture
Vision + Language:
  1. phala/qwen3-vl-30b-a3b-instruct (30B) - Vision + text
  2. phala/gemma-3-27b-it (27B) - Multimodal with 140+ languages
Balanced:
  1. phala/glm-4.7-flash (~30B) - Fast with long context (202K)
  2. phala/gemma-3-27b-it (27B) - Good quality, reasonable cost
  3. phala/gpt-oss-20b (21B) - Fast and efficient
Budget:
  1. phala/qwen-2.5-7b-instruct (7B) - Most economical
  2. phala/gpt-oss-20b (21B) - Great value

By Use Case

Complex Reasoning:
  • phala/glm-5 - Systems engineering and agent workflows
  • phala/gpt-oss-120b - OpenAI-style reasoning
Vision Tasks:
  • phala/qwen3-vl-30b-a3b-instruct - Document AI, OCR, visual coding
  • phala/gemma-3-27b-it - General multimodal tasks
Agentic Coding:
  • phala/glm-4.7-flash - Optimized for agentic coding
  • phala/gpt-oss-120b - Best for tool use agents
Long Context:
  • phala/glm-5 (202K) - Longest context
  • phala/glm-4.7-flash (202K) - Fast with long context
Multilingual:
  • phala/gemma-3-27b-it - 140+ languages
  • phala/qwen-2.5-7b-instruct - 29+ languages
High Volume:
  • phala/qwen-2.5-7b-instruct - Lowest cost
  • phala/gpt-oss-20b - Fast inference
Unrestricted:
  • phala/uncensored-24b - No alignment filters

Attestation Support

All Phala models provide cryptographic attestation. Models with an appid in their metadata support TEE verification:
# Get attestation for any Phala model
curl "https://api.redpill.ai/v1/attestation/report" \
  -H "Authorization: Bearer YOUR_API_KEY"

Attestation Guide

Learn how to verify TEE execution ->

Pricing Comparison

Phala Models

ModelPrompt per MCompletion per M
Qwen 2.5 7B$0.04$0.10
GPT-OSS 20B$0.04$0.15
GLM 4.7 Flash$0.10$0.43
GPT-OSS 120B$0.10$0.49
Gemma 3 27B$0.11$0.40
Qwen3 VL 30B$0.20$0.70
Uncensored 24B$0.20$0.90
GLM 5$1.20$3.50

Tinfoil Models

All Tinfoil models use flat-rate pricing: $2/M tokens for both prompt and completion.

Near AI Models

ModelPrompt per MCompletion per M
Qwen3 30B$0.15$0.45
GLM 4.7$0.85$3.30
DeepSeek V3.1$1.00$2.50

Chutes Models

ModelPrompt per MCompletion per M
DeepSeek V3.2$0.27$0.40
Kimi K2.5$0.60$3.00

Migration Guide

From Regular Models to Phala

Simply change the model name:
# Before (regular model)
response = client.chat.completions.create(
    model="openai/gpt-5",
    messages=[...]
)

# After (Phala confidential model)
response = client.chat.completions.create(
    model="phala/gpt-oss-120b",  # Similar to GPT-4
    messages=[...]  # Same API!
)
No other code changes required!

FAQs

Check the providers field in the API response. Models with "phala" in their providers array run on Phala’s GPU TEE infrastructure. You can also use the phala/ prefix in the model ID, or call the dedicated /v1/models/phala endpoint.
phala/gpt-oss-120b - It’s OpenAI’s architecture and has similar capabilities.
phala/qwen-2.5-7b-instruct - Smallest and fastest for simple tasks. phala/glm-4.7-flash offers the best speed-to-quality ratio.
  • phala/qwen3-vl-30b-a3b-instruct - Vision + text (30B)
  • phala/gemma-3-27b-it - Vision + text (27B)
This model ID is a legacy alias that now routes to phala/qwen3-vl-30b-a3b-instruct. Both IDs work, but we recommend using the qwen3 variant.
Enterprise customers can fine-tune models in TEE. Contact sales@redpill.ai
FP8 reduces model size and increases speed with minimal quality loss (~1%). Enables efficient TEE inference.

Next Steps

Start Using Models

Make your first request

Verify Attestation

Cryptographic proof of TEE

API Reference

Complete API documentation

Pricing Details

Compare model costs