All Confidential AI Models

RedPill offers GPU TEE model entries running entirely in Trusted Execution Environments across 4 TEE providers: Chutes, Near AI, Phala Network, and Tinfoil. Some legacy aliases are also accepted for compatibility; use GET /v1/models for the live catalog.

Chutes

DeepSeek, MiniMax, Kimi, GLM, Qwen, and MiMo

Near AI

DeepSeek V3.1, GLM, GPT-OSS, and Qwen

Phala Network

Qwen, Gemma, GPT-OSS, GLM, and embeddings

Tinfoil

DeepSeek R1, Qwen Coder, Kimi Thinking, and Llama

Chutes TEE Models

Models powered by Chutes’ confidential computing infrastructure:

Model	Parameters	Context	Modality	Price (Prompt/Completion)
`z-ai/glm-5.1`	Large	203K	Text	$1.21 /$ 4.20 per M
`moonshotai/kimi-k2.6`	Large (MoE)	262K	Text + Image	$1.09 /$ 4.60 per M
`qwen/qwen3.5-397b-a17b`	397B (MoE)	262K	Text	$0.55 /$ 3.50 per M
`qwen/qwen3-coder-next`	Large	262K	Text	$0.18 /$ 1.20 per M
`minimax/minimax-m2.5`	Large	197K	Text	$0.20 /$ 1.38 per M
`xiaomi/mimo-v2-flash`	Large	262K	Text	$0.10 /$ 0.30 per M
`deepseek/deepseek-v3.2`	685B (MoE)	164K	Text	$0.32 /$ 0.48 per M
`moonshotai/kimi-k2.5`	Large (MoE)	262K	Text + Image	$0.6 /$ 3 per M

New: Chutes now includes GLM 5.1, Kimi K2.6, MiniMax M2.5, Qwen3 Coder Next, Qwen3.5 397B, and MiMo V2 Flash.

Near AI TEE Models

Models powered by Near AI’s decentralized TEE infrastructure:

Model	Parameters	Context	Modality	Price (Prompt/Completion)
`z-ai/glm-5`	Large	203K	Text	$1.20 /$ 3.50 per M
`deepseek/deepseek-chat-v3.1`	671B (MoE)	164K	Text	$1.05 /$ 3.10 per M
`openai/gpt-oss-120b`	117B (MoE)	131K	Text	$0.10 /$ 0.49 per M
`qwen/qwen3-30b-a3b-instruct-2507`	30B (MoE)	262K	Text	$0.15 /$ 0.55 per M
`z-ai/glm-4.7`	130B	131K	Text	$0.85 /$ 3.3 per M

GLM-5 and GPT-OSS-120B are high-capacity models now running through Near AI’s TEE infrastructure. DeepSeek V3.1 supports both thinking and non-thinking modes.

Phala TEE Models

Models powered by Phala Network’s GPU TEE infrastructure with FP8 quantization:

Model	Parameters	Context	Modality	Price (Prompt/Completion)
`phala/qwen3.5-27b`	27B	262K	Text	$0.30 /$ 2.40 per M
`phala/qwen3-vl-30b-a3b-instruct`	30B (MoE)	128K	Vision + Text	$0.2 /$ 0.7 per M
`qwen/qwen3-embedding-8b`	8B	32K	Embeddings	$0.01 /$ 0 per M
`phala/gemma-3-27b-it`	27B	53K	Vision + Text	$0.11 /$ 0.4 per M
`phala/glm-4.7-flash`	~30B	202K	Text	$0.1 /$ 0.43 per M
`phala/gpt-oss-20b`	21B (MoE)	131K	Text	$0.04 /$ 0.15 per M
`phala/qwen-2.5-7b-instruct`	7B	32K	Text	$0.04 /$ 0.1 per M
`phala/qwen2.5-vl-72b-instruct`	72B	128K	Vision + Text	$0.4 /$ 1.2 per M
`phala/uncensored-24b`	24B	32K	Text	$0.2 /$ 0.9 per M
`sentence-transformers/all-minilm-l6-v2`	22M	512	Embeddings	$0.005 /$ 0 per M

New: Qwen3.5-27B and confidential embedding models are now available through Phala. Venice Uncensored 24B offers an alignment-free model for advanced use cases.

The model ID phala/qwen2.5-vl-72b-instruct is a legacy alias that now routes to phala/qwen3-vl-30b-a3b-instruct.

Tinfoil TEE Models

Models powered by Tinfoil’s confidential computing infrastructure:

Model	Parameters	Context	Modality	Price (Prompt/Completion)
`qwen/qwen3-coder-480b-a35b-instruct`	480B (MoE)	262K	Text	$2 /$ 2 per M
`moonshotai/kimi-k2-thinking`	1T (MoE, 32B active)	262K	Text	$2 /$ 2 per M
`deepseek/deepseek-r1-0528`	685B (MoE)	163K	Text	$2 /$ 2 per M
`meta-llama/llama-3.3-70b-instruct`	70B	131K	Text	$2 /$ 2 per M

New: Kimi K2 Thinking is Moonshot AI’s most advanced open reasoning model, optimized for persistent step-by-step thought and dynamic tool invocation across hundreds of turns. Tinfoil models use flat-rate pricing at $2/M tokens.

Identifying TEE Models

TEE models can be identified in three ways:

1. Use the dedicated endpoint

# Get all models available through Phala
curl https://api.redpill.ai/v1/models/phala \
  -H "Authorization: Bearer YOUR_API_KEY"

2. Check the `providers` field

Every model in the API response includes a providers array. Filter for TEE provider names:

curl https://api.redpill.ai/v1/models \
  -H "Authorization: Bearer YOUR_API_KEY" | \
  jq '.data[] | select(any(.providers[]?; test("phala|tinfoil|near-ai|chutes"))) | {id, providers}'

TEE providers: phala, tinfoil, near-ai, chutes

3. Check model aliases

Some confidential models expose compatibility aliases in addition to their canonical model ID. Prefer the canonical ID from /v1/models in new integrations:

# Canonical model ID from /v1/models
response = client.chat.completions.create(model="openai/gpt-oss-120b", ...)

# Legacy aliases may continue to work for compatibility
response = client.chat.completions.create(model="phala/gpt-oss-120b", ...)

Model Details

phala/glm-5

Best Overall Quality

Flagship model for complex systems engineering and agent workflows

Specifications:

Context Length: 202,752 tokens (202K)
Quantization: FP8
Modality: Text -> Text

Description: GLM-5 is an open-source foundation model built for complex systems engineering and long-horizon agent workflows. It delivers:

Production-grade productivity for large-scale programming tasks
Performance aligned to top closed-source models
Expert-level system design capabilities

Use Cases:

Complex systems engineering
Long-horizon agent workflows
Large-scale programming tasks
Advanced code generation

Example:

response = client.chat.completions.create(
    model="phala/glm-5",
    messages=[{
        "role": "user",
        "content": "Design a distributed system architecture for real-time event processing"
    }]
)

phala/gpt-oss-120b

OpenAI Architecture

OpenAI’s open-weight model with familiar behavior

Specifications:

Parameters: 117 billion (MoE, 5.1B active)
Context Length: 131,072 tokens
Quantization: FP8
Modality: Text -> Text

Description: GPT-OSS-120B is OpenAI’s open-weight model designed for high-reasoning and agentic use cases. Optimized for single H100 GPU with:

Configurable reasoning depth
Full chain-of-thought access
Native function calling
Structured output generation

Use Cases:

AI agents and automation
Complex task planning
Tool use and API integration
Production workloads requiring reasoning

Example:

response = client.chat.completions.create(
    model="phala/gpt-oss-120b",
    messages=[{
        "role": "user",
        "content": "Create a step-by-step plan to migrate our infrastructure to TEE"
    }],
    tools=[{
        "type": "function",
        "function": {
            "name": "get_infrastructure_status",
            "description": "Get current infrastructure state"
        }
    }]
)

phala/gpt-oss-20b

Efficient & Fast

Smaller model for low-latency applications

Specifications:

Parameters: 21 billion (MoE, 3.6B active)
Context Length: 131,072 tokens
Quantization: FP8
Modality: Text -> Text

Description: GPT-OSS-20B is optimized for lower-latency inference and consumer/single-GPU deployment. Features:

OpenAI Harmony response format
Reasoning level configuration
Function calling and tool use
Structured outputs
Apache 2.0 license

Use Cases:

Real-time chatbots
Edge deployment
Cost-sensitive applications
High-throughput workloads

Example:

response = client.chat.completions.create(
    model="phala/gpt-oss-20b",
    messages=[{
        "role": "user",
        "content": "Summarize this customer support ticket"
    }],
    max_tokens=150
)

phala/qwen3-vl-30b-a3b-instruct

Vision + Language

Multimodal model for image and video understanding

Specifications:

Parameters: 30 billion (MoE, 3B active)
Context Length: 128,000 tokens
Quantization: FP8
Modality: Text + Image -> Text

Description: Qwen3-VL-30B unifies strong text generation with visual understanding for images and videos:

Real-world and synthetic object perception
2D/3D spatial grounding
Long-form visual comprehension
GUI automation and visual coding
Document AI and OCR

Use Cases:

Document OCR and understanding
Chart and graph analysis
Visual quality inspection
UI/UX automation
Video timeline analysis

Example:

response = client.chat.completions.create(
    model="phala/qwen3-vl-30b-a3b-instruct",
    messages=[{
        "role": "user",
        "content": [
            {
                "type": "text",
                "text": "Analyze this document and extract the key data"
            },
            {
                "type": "image_url",
                "image_url": {
                    "url": "https://example.com/document.jpg"
                }
            }
        ]
    }]
)

phala/gemma-3-27b-it

Google's Latest

Multimodal capabilities with strong multilingual support

Specifications:

Parameters: 27 billion
Context Length: 53,920 tokens (53K)
Quantization: FP8
Modality: Text + Image -> Text

Description: Gemma 3 introduces:

Multimodality support
Context windows up to 128K tokens
140+ language understanding
Improved math and reasoning
Structured outputs
Function calling

Use Cases:

Multilingual applications (140+ languages)
Math and reasoning tasks
Structured data generation
Function calling workflows
Chat applications

Example:

response = client.chat.completions.create(
    model="phala/gemma-3-27b-it",
    messages=[{
        "role": "user",
        "content": "Solve this calculus problem step by step"
    }]
)

phala/uncensored-24b

Uncensored

Alignment-free model for unrestricted use cases

Specifications:

Parameters: 24 billion
Context Length: 32,768 tokens (32K)
Quantization: FP8
Modality: Text -> Text

Description: Venice Uncensored Dolphin Mistral 24B is a fine-tuned variant of Mistral-Small-24B, designed as an uncensored instruct-tuned LLM:

Full user control over alignment and behavior
Steerability and transparent behavior
No default safety layers
Advanced and unrestricted use cases

Use Cases:

Creative writing without content filters
Research requiring unrestricted outputs
Custom alignment experimentation
Red-teaming and safety research

Example:

response = client.chat.completions.create(
    model="phala/uncensored-24b",
    messages=[{
        "role": "user",
        "content": "Write a detailed analysis of this controversial topic"
    }]
)

phala/glm-4.7-flash

Fast & Efficient

30B-class model optimized for agentic coding

Specifications:

Parameters: ~30B
Context Length: 202,752 tokens (202K)
Quantization: FP8
Modality: Text -> Text

Description: GLM 4.7 Flash balances performance and efficiency with:

Optimized agentic coding capabilities
Strengthened long-horizon task planning
Tool collaboration
Leading open-source performance at its size class

Use Cases:

Agentic coding workflows
Long-horizon task planning
Tool-assisted development
Cost-effective general purpose

Example:

response = client.chat.completions.create(
    model="phala/glm-4.7-flash",
    messages=[{
        "role": "user",
        "content": "Implement a REST API endpoint with error handling and tests"
    }]
)

phala/qwen-2.5-7b-instruct

Budget-Friendly

Most cost-effective confidential model

Specifications:

Parameters: 7 billion
Context Length: 32,768 tokens (32K)
Quantization: FP8
Modality: Text -> Text

Description: Qwen 2.5 7B brings significant improvements:

Enhanced coding and mathematics capabilities
Better instruction following
Improved long text generation (8K+ tokens)
Structured data understanding (tables, JSON)
Multilingual support (29+ languages)

Use Cases:

High-volume applications
Multilingual support
Simple chatbots
Text classification
Data extraction

Supported Languages: Chinese, English, French, Spanish, Portuguese, German, Italian, Russian, Japanese, Korean, Vietnamese, Thai, Arabic, and more. Example:

response = client.chat.completions.create(
    model="phala/qwen-2.5-7b-instruct",
    messages=[{
        "role": "user",
        "content": "Extract key information from this invoice (JSON format)"
    }],
    response_format={"type": "json_object"}
)

Feature Comparison

Feature	GLM 5	GPT-OSS 120B	GPT-OSS 20B	Qwen3 VL 30B	Gemma 3 27B	Uncensored 24B	GLM 4.7 Flash	Qwen 2.5 7B
TEE Protected	Yes	Yes	Yes	Yes	Yes	Yes	Yes	Yes
Function Calling	Yes	Yes	Yes	Yes	Yes	Yes	Yes	Yes
Vision	No	No	No	Yes	Yes	No	No	No
Structured Output	Yes	Yes	Yes	Yes	Yes	Yes	Yes	Yes
Streaming	Yes	Yes	Yes	Yes	Yes	Yes	Yes	Yes
Multilingual	Yes	Yes	Yes	Yes	Yes (140+)	Yes	Yes	Yes (29+)

Selection Guide

By Quality Requirements

Highest Quality:

z-ai/glm-5 - Best overall for systems engineering
openai/gpt-oss-120b (117B) - OpenAI architecture

Vision + Language:

phala/qwen3-vl-30b-a3b-instruct (30B) - Vision + text
phala/gemma-3-27b-it (27B) - Multimodal with 140+ languages

Balanced:

phala/glm-4.7-flash (~30B) - Fast with long context (202K)
phala/gemma-3-27b-it (27B) - Good quality, reasonable cost
phala/gpt-oss-20b (21B) - Fast and efficient

Budget:

phala/qwen-2.5-7b-instruct (7B) - Most economical
phala/gpt-oss-20b (21B) - Great value

By Use Case

Complex Reasoning:

z-ai/glm-5 - Systems engineering and agent workflows
openai/gpt-oss-120b - OpenAI-style reasoning

Vision Tasks:

phala/qwen3-vl-30b-a3b-instruct - Document AI, OCR, visual coding
phala/gemma-3-27b-it - General multimodal tasks

Agentic Coding:

phala/glm-4.7-flash - Optimized for agentic coding
openai/gpt-oss-120b - Best for tool use agents

Long Context:

z-ai/glm-5 (203K) - Long systems engineering context
phala/glm-4.7-flash (202K) - Fast with long context

Multilingual:

phala/gemma-3-27b-it - 140+ languages
phala/qwen-2.5-7b-instruct - 29+ languages

High Volume:

phala/qwen-2.5-7b-instruct - Lowest cost
phala/gpt-oss-20b - Fast inference

Unrestricted:

phala/uncensored-24b - No alignment filters

Attestation Support

Models with an appid in their metadata support RedPill’s TEE verification endpoints. Use /v1/models to check support before calling attestation in production:

curl https://api.redpill.ai/v1/models \
  -H "Authorization: Bearer YOUR_API_KEY" | \
  jq '.data[] | select(.metadata.appid != null) | {id, providers, appid: .metadata.appid}'

Attestation Guide

Learn how to verify TEE execution ->

Pricing Comparison

Chutes Models

Model	Prompt per M	Completion per M
GLM 5.1	$1.21	$4.20
Kimi K2.6	$1.09	$4.60
Qwen3.5 397B	$0.55	$3.50
Qwen3 Coder Next	$0.18	$1.20
MiniMax M2.5	$0.20	$1.38
MiMo V2 Flash	$0.10	$0.30
DeepSeek V3.2	$0.32	$0.48
Kimi K2.5	$0.60	$3.00

Near AI Models

Model	Prompt per M	Completion per M
GLM 5	$1.20	$3.50
DeepSeek V3.1	$1.05	$3.10
GPT-OSS 120B	$0.10	$0.49
Qwen3 30B	$0.15	$0.55
GLM 4.7	$0.85	$3.30

Phala Models

Model	Prompt per M	Completion per M
Qwen3.5 27B	$0.30	$2.40
Qwen3 VL 30B	$0.20	$0.70
Qwen3 Embedding 8B	$0.01	$0.00
Gemma 3 27B	$0.11	$0.40
GLM 4.7 Flash	$0.10	$0.43
GPT-OSS 20B	$0.04	$0.15
Qwen 2.5 7B	$0.04	$0.10
Uncensored 24B	$0.20	$0.90
all-MiniLM-L6-v2	$0.005	$0.00

Tinfoil Models

All Tinfoil models use flat-rate pricing: $2/M tokens for both prompt and completion.

Migration Guide

From Regular Models to GPU TEE

Simply change the model name:

# Before (regular model)
response = client.chat.completions.create(
    model="openai/gpt-5",
    messages=[...]
)

# After (GPU TEE confidential model)
response = client.chat.completions.create(
    model="openai/gpt-oss-120b",
    messages=[...]  # Same API!
)

No other code changes required!

FAQs

How do I know if a model runs on Phala?

Check the providers field in the API response. Models with "phala" in their providers array run on Phala’s GPU TEE infrastructure. For all GPU TEE providers, filter for phala, near-ai, tinfoil, or chutes.

Which model is most similar to GPT-4?

openai/gpt-oss-120b - It uses OpenAI’s open-weight architecture and currently runs through Near AI’s TEE provider.

Which model is fastest?

phala/qwen-2.5-7b-instruct - Smallest and fastest for simple tasks. phala/glm-4.7-flash offers the best speed-to-quality ratio.

Which model supports images?

phala/qwen3-vl-30b-a3b-instruct - Vision + text (30B)
phala/gemma-3-27b-it - Vision + text (27B)
moonshotai/kimi-k2.5 and moonshotai/kimi-k2.6 - Vision + text through Chutes

What about phala/qwen2.5-vl-72b-instruct?

This model ID is a legacy alias that now routes to phala/qwen3-vl-30b-a3b-instruct. Both IDs work, but we recommend using the qwen3 variant.

Can I fine-tune these models?

Enterprise customers can fine-tune models in TEE. Contact sales@redpill.ai

What's FP8 quantization?

FP8 reduces model size and increases speed with minimal quality loss (~1%). Enables efficient TEE inference.

Next Steps

Start Using Models

Make your first request

Verify Attestation

Cryptographic proof of TEE

API Reference

Complete API documentation

Pricing Details

Compare model costs

Documentation Index

​All Confidential AI Models

Chutes

Near AI

Phala Network

Tinfoil

​Chutes TEE Models

​Near AI TEE Models

​Phala TEE Models

​Tinfoil TEE Models

​Identifying TEE Models

​1. Use the dedicated endpoint

​2. Check the providers field

​3. Check model aliases

​Model Details

​phala/glm-5

Best Overall Quality

​phala/gpt-oss-120b

OpenAI Architecture

​phala/gpt-oss-20b

Efficient & Fast

​phala/qwen3-vl-30b-a3b-instruct

Vision + Language

​phala/gemma-3-27b-it

Google's Latest

​phala/uncensored-24b

Uncensored

​phala/glm-4.7-flash

Fast & Efficient

​phala/qwen-2.5-7b-instruct

Budget-Friendly

​Feature Comparison

​Selection Guide

​By Quality Requirements

​By Use Case

​Attestation Support

Attestation Guide

​Pricing Comparison

​Chutes Models

​Near AI Models

​Phala Models

​Tinfoil Models

​Migration Guide

​From Regular Models to GPU TEE

​FAQs

​Next Steps

Start Using Models

Verify Attestation

API Reference

Pricing Details

All Confidential AI Models

Chutes TEE Models

Near AI TEE Models

Phala TEE Models

Tinfoil TEE Models

Identifying TEE Models

1. Use the dedicated endpoint

2. Check the `providers` field

3. Check model aliases

Model Details

phala/glm-5

phala/gpt-oss-120b

phala/gpt-oss-20b

phala/qwen3-vl-30b-a3b-instruct

phala/gemma-3-27b-it

phala/uncensored-24b

phala/glm-4.7-flash

phala/qwen-2.5-7b-instruct

Feature Comparison

Selection Guide

By Quality Requirements

By Use Case

Attestation Support

Pricing Comparison

Chutes Models

Near AI Models

Phala Models

Tinfoil Models

Migration Guide

From Regular Models to GPU TEE

FAQs

Next Steps