Skip to main content

Documentation Index

Fetch the complete documentation index at: https://docs.redpill.ai/llms.txt

Use this file to discover all available pages before exploring further.

All Confidential AI Models

RedPill offers GPU TEE model entries running entirely in Trusted Execution Environments across 4 TEE providers: Chutes, Near AI, Phala Network, and Tinfoil. Some legacy aliases are also accepted for compatibility; use GET /v1/models for the live catalog.

Chutes

DeepSeek, MiniMax, Kimi, GLM, Qwen, and MiMo

Near AI

DeepSeek V3.1, GLM, GPT-OSS, and Qwen

Phala Network

Qwen, Gemma, GPT-OSS, GLM, and embeddings

Tinfoil

DeepSeek R1, Qwen Coder, Kimi Thinking, and Llama

Chutes TEE Models

Models powered by Chutes’ confidential computing infrastructure:
ModelParametersContextModalityPrice (Prompt/Completion)
z-ai/glm-5.1Large203KText1.21/1.21 / 4.20 per M
moonshotai/kimi-k2.6Large (MoE)262KText + Image1.09/1.09 / 4.60 per M
qwen/qwen3.5-397b-a17b397B (MoE)262KText0.55/0.55 / 3.50 per M
qwen/qwen3-coder-nextLarge262KText0.18/0.18 / 1.20 per M
minimax/minimax-m2.5Large197KText0.20/0.20 / 1.38 per M
xiaomi/mimo-v2-flashLarge262KText0.10/0.10 / 0.30 per M
deepseek/deepseek-v3.2685B (MoE)164KText0.32/0.32 / 0.48 per M
moonshotai/kimi-k2.5Large (MoE)262KText + Image0.6/0.6 / 3 per M
New: Chutes now includes GLM 5.1, Kimi K2.6, MiniMax M2.5, Qwen3 Coder Next, Qwen3.5 397B, and MiMo V2 Flash.

Near AI TEE Models

Models powered by Near AI’s decentralized TEE infrastructure:
ModelParametersContextModalityPrice (Prompt/Completion)
z-ai/glm-5Large203KText1.20/1.20 / 3.50 per M
deepseek/deepseek-chat-v3.1671B (MoE)164KText1.05/1.05 / 3.10 per M
openai/gpt-oss-120b117B (MoE)131KText0.10/0.10 / 0.49 per M
qwen/qwen3-30b-a3b-instruct-250730B (MoE)262KText0.15/0.15 / 0.55 per M
z-ai/glm-4.7130B131KText0.85/0.85 / 3.3 per M
GLM-5 and GPT-OSS-120B are high-capacity models now running through Near AI’s TEE infrastructure. DeepSeek V3.1 supports both thinking and non-thinking modes.

Phala TEE Models

Models powered by Phala Network’s GPU TEE infrastructure with FP8 quantization:
ModelParametersContextModalityPrice (Prompt/Completion)
phala/qwen3.5-27b27B262KText0.30/0.30 / 2.40 per M
phala/qwen3-vl-30b-a3b-instruct30B (MoE)128KVision + Text0.2/0.2 / 0.7 per M
qwen/qwen3-embedding-8b8B32KEmbeddings0.01/0.01 / 0 per M
phala/gemma-3-27b-it27B53KVision + Text0.11/0.11 / 0.4 per M
phala/glm-4.7-flash~30B202KText0.1/0.1 / 0.43 per M
phala/gpt-oss-20b21B (MoE)131KText0.04/0.04 / 0.15 per M
phala/qwen-2.5-7b-instruct7B32KText0.04/0.04 / 0.1 per M
phala/qwen2.5-vl-72b-instruct72B128KVision + Text0.4/0.4 / 1.2 per M
phala/uncensored-24b24B32KText0.2/0.2 / 0.9 per M
sentence-transformers/all-minilm-l6-v222M512Embeddings0.005/0.005 / 0 per M
New: Qwen3.5-27B and confidential embedding models are now available through Phala. Venice Uncensored 24B offers an alignment-free model for advanced use cases.
The model ID phala/qwen2.5-vl-72b-instruct is a legacy alias that now routes to phala/qwen3-vl-30b-a3b-instruct.

Tinfoil TEE Models

Models powered by Tinfoil’s confidential computing infrastructure:
ModelParametersContextModalityPrice (Prompt/Completion)
qwen/qwen3-coder-480b-a35b-instruct480B (MoE)262KText2/2 / 2 per M
moonshotai/kimi-k2-thinking1T (MoE, 32B active)262KText2/2 / 2 per M
deepseek/deepseek-r1-0528685B (MoE)163KText2/2 / 2 per M
meta-llama/llama-3.3-70b-instruct70B131KText2/2 / 2 per M
New: Kimi K2 Thinking is Moonshot AI’s most advanced open reasoning model, optimized for persistent step-by-step thought and dynamic tool invocation across hundreds of turns. Tinfoil models use flat-rate pricing at $2/M tokens.

Identifying TEE Models

TEE models can be identified in three ways:

1. Use the dedicated endpoint

# Get all models available through Phala
curl https://api.redpill.ai/v1/models/phala \
  -H "Authorization: Bearer YOUR_API_KEY"

2. Check the providers field

Every model in the API response includes a providers array. Filter for TEE provider names:
curl https://api.redpill.ai/v1/models \
  -H "Authorization: Bearer YOUR_API_KEY" | \
  jq '.data[] | select(any(.providers[]?; test("phala|tinfoil|near-ai|chutes"))) | {id, providers}'
TEE providers: phala, tinfoil, near-ai, chutes

3. Check model aliases

Some confidential models expose compatibility aliases in addition to their canonical model ID. Prefer the canonical ID from /v1/models in new integrations:
# Canonical model ID from /v1/models
response = client.chat.completions.create(model="openai/gpt-oss-120b", ...)

# Legacy aliases may continue to work for compatibility
response = client.chat.completions.create(model="phala/gpt-oss-120b", ...)

Model Details

phala/glm-5

Best Overall Quality

Flagship model for complex systems engineering and agent workflows
Specifications:
  • Context Length: 202,752 tokens (202K)
  • Quantization: FP8
  • Modality: Text -> Text
Description: GLM-5 is an open-source foundation model built for complex systems engineering and long-horizon agent workflows. It delivers:
  • Production-grade productivity for large-scale programming tasks
  • Performance aligned to top closed-source models
  • Expert-level system design capabilities
Use Cases:
  • Complex systems engineering
  • Long-horizon agent workflows
  • Large-scale programming tasks
  • Advanced code generation
Example:
response = client.chat.completions.create(
    model="phala/glm-5",
    messages=[{
        "role": "user",
        "content": "Design a distributed system architecture for real-time event processing"
    }]
)

phala/gpt-oss-120b

OpenAI Architecture

OpenAI’s open-weight model with familiar behavior
Specifications:
  • Parameters: 117 billion (MoE, 5.1B active)
  • Context Length: 131,072 tokens
  • Quantization: FP8
  • Modality: Text -> Text
Description: GPT-OSS-120B is OpenAI’s open-weight model designed for high-reasoning and agentic use cases. Optimized for single H100 GPU with:
  • Configurable reasoning depth
  • Full chain-of-thought access
  • Native function calling
  • Structured output generation
Use Cases:
  • AI agents and automation
  • Complex task planning
  • Tool use and API integration
  • Production workloads requiring reasoning
Example:
response = client.chat.completions.create(
    model="phala/gpt-oss-120b",
    messages=[{
        "role": "user",
        "content": "Create a step-by-step plan to migrate our infrastructure to TEE"
    }],
    tools=[{
        "type": "function",
        "function": {
            "name": "get_infrastructure_status",
            "description": "Get current infrastructure state"
        }
    }]
)

phala/gpt-oss-20b

Efficient & Fast

Smaller model for low-latency applications
Specifications:
  • Parameters: 21 billion (MoE, 3.6B active)
  • Context Length: 131,072 tokens
  • Quantization: FP8
  • Modality: Text -> Text
Description: GPT-OSS-20B is optimized for lower-latency inference and consumer/single-GPU deployment. Features:
  • OpenAI Harmony response format
  • Reasoning level configuration
  • Function calling and tool use
  • Structured outputs
  • Apache 2.0 license
Use Cases:
  • Real-time chatbots
  • Edge deployment
  • Cost-sensitive applications
  • High-throughput workloads
Example:
response = client.chat.completions.create(
    model="phala/gpt-oss-20b",
    messages=[{
        "role": "user",
        "content": "Summarize this customer support ticket"
    }],
    max_tokens=150
)

phala/qwen3-vl-30b-a3b-instruct

Vision + Language

Multimodal model for image and video understanding
Specifications:
  • Parameters: 30 billion (MoE, 3B active)
  • Context Length: 128,000 tokens
  • Quantization: FP8
  • Modality: Text + Image -> Text
Description: Qwen3-VL-30B unifies strong text generation with visual understanding for images and videos:
  • Real-world and synthetic object perception
  • 2D/3D spatial grounding
  • Long-form visual comprehension
  • GUI automation and visual coding
  • Document AI and OCR
Use Cases:
  • Document OCR and understanding
  • Chart and graph analysis
  • Visual quality inspection
  • UI/UX automation
  • Video timeline analysis
Example:
response = client.chat.completions.create(
    model="phala/qwen3-vl-30b-a3b-instruct",
    messages=[{
        "role": "user",
        "content": [
            {
                "type": "text",
                "text": "Analyze this document and extract the key data"
            },
            {
                "type": "image_url",
                "image_url": {
                    "url": "https://example.com/document.jpg"
                }
            }
        ]
    }]
)

phala/gemma-3-27b-it

Google's Latest

Multimodal capabilities with strong multilingual support
Specifications:
  • Parameters: 27 billion
  • Context Length: 53,920 tokens (53K)
  • Quantization: FP8
  • Modality: Text + Image -> Text
Description: Gemma 3 introduces:
  • Multimodality support
  • Context windows up to 128K tokens
  • 140+ language understanding
  • Improved math and reasoning
  • Structured outputs
  • Function calling
Use Cases:
  • Multilingual applications (140+ languages)
  • Math and reasoning tasks
  • Structured data generation
  • Function calling workflows
  • Chat applications
Example:
response = client.chat.completions.create(
    model="phala/gemma-3-27b-it",
    messages=[{
        "role": "user",
        "content": "Solve this calculus problem step by step"
    }]
)

phala/uncensored-24b

Uncensored

Alignment-free model for unrestricted use cases
Specifications:
  • Parameters: 24 billion
  • Context Length: 32,768 tokens (32K)
  • Quantization: FP8
  • Modality: Text -> Text
Description: Venice Uncensored Dolphin Mistral 24B is a fine-tuned variant of Mistral-Small-24B, designed as an uncensored instruct-tuned LLM:
  • Full user control over alignment and behavior
  • Steerability and transparent behavior
  • No default safety layers
  • Advanced and unrestricted use cases
Use Cases:
  • Creative writing without content filters
  • Research requiring unrestricted outputs
  • Custom alignment experimentation
  • Red-teaming and safety research
Example:
response = client.chat.completions.create(
    model="phala/uncensored-24b",
    messages=[{
        "role": "user",
        "content": "Write a detailed analysis of this controversial topic"
    }]
)

phala/glm-4.7-flash

Fast & Efficient

30B-class model optimized for agentic coding
Specifications:
  • Parameters: ~30B
  • Context Length: 202,752 tokens (202K)
  • Quantization: FP8
  • Modality: Text -> Text
Description: GLM 4.7 Flash balances performance and efficiency with:
  • Optimized agentic coding capabilities
  • Strengthened long-horizon task planning
  • Tool collaboration
  • Leading open-source performance at its size class
Use Cases:
  • Agentic coding workflows
  • Long-horizon task planning
  • Tool-assisted development
  • Cost-effective general purpose
Example:
response = client.chat.completions.create(
    model="phala/glm-4.7-flash",
    messages=[{
        "role": "user",
        "content": "Implement a REST API endpoint with error handling and tests"
    }]
)

phala/qwen-2.5-7b-instruct

Budget-Friendly

Most cost-effective confidential model
Specifications:
  • Parameters: 7 billion
  • Context Length: 32,768 tokens (32K)
  • Quantization: FP8
  • Modality: Text -> Text
Description: Qwen 2.5 7B brings significant improvements:
  • Enhanced coding and mathematics capabilities
  • Better instruction following
  • Improved long text generation (8K+ tokens)
  • Structured data understanding (tables, JSON)
  • Multilingual support (29+ languages)
Use Cases:
  • High-volume applications
  • Multilingual support
  • Simple chatbots
  • Text classification
  • Data extraction
Supported Languages: Chinese, English, French, Spanish, Portuguese, German, Italian, Russian, Japanese, Korean, Vietnamese, Thai, Arabic, and more. Example:
response = client.chat.completions.create(
    model="phala/qwen-2.5-7b-instruct",
    messages=[{
        "role": "user",
        "content": "Extract key information from this invoice (JSON format)"
    }],
    response_format={"type": "json_object"}
)

Feature Comparison

FeatureGLM 5GPT-OSS 120BGPT-OSS 20BQwen3 VL 30BGemma 3 27BUncensored 24BGLM 4.7 FlashQwen 2.5 7B
TEE ProtectedYesYesYesYesYesYesYesYes
Function CallingYesYesYesYesYesYesYesYes
VisionNoNoNoYesYesNoNoNo
Structured OutputYesYesYesYesYesYesYesYes
StreamingYesYesYesYesYesYesYesYes
MultilingualYesYesYesYesYes (140+)YesYesYes (29+)

Selection Guide

By Quality Requirements

Highest Quality:
  1. z-ai/glm-5 - Best overall for systems engineering
  2. openai/gpt-oss-120b (117B) - OpenAI architecture
Vision + Language:
  1. phala/qwen3-vl-30b-a3b-instruct (30B) - Vision + text
  2. phala/gemma-3-27b-it (27B) - Multimodal with 140+ languages
Balanced:
  1. phala/glm-4.7-flash (~30B) - Fast with long context (202K)
  2. phala/gemma-3-27b-it (27B) - Good quality, reasonable cost
  3. phala/gpt-oss-20b (21B) - Fast and efficient
Budget:
  1. phala/qwen-2.5-7b-instruct (7B) - Most economical
  2. phala/gpt-oss-20b (21B) - Great value

By Use Case

Complex Reasoning:
  • z-ai/glm-5 - Systems engineering and agent workflows
  • openai/gpt-oss-120b - OpenAI-style reasoning
Vision Tasks:
  • phala/qwen3-vl-30b-a3b-instruct - Document AI, OCR, visual coding
  • phala/gemma-3-27b-it - General multimodal tasks
Agentic Coding:
  • phala/glm-4.7-flash - Optimized for agentic coding
  • openai/gpt-oss-120b - Best for tool use agents
Long Context:
  • z-ai/glm-5 (203K) - Long systems engineering context
  • phala/glm-4.7-flash (202K) - Fast with long context
Multilingual:
  • phala/gemma-3-27b-it - 140+ languages
  • phala/qwen-2.5-7b-instruct - 29+ languages
High Volume:
  • phala/qwen-2.5-7b-instruct - Lowest cost
  • phala/gpt-oss-20b - Fast inference
Unrestricted:
  • phala/uncensored-24b - No alignment filters

Attestation Support

Models with an appid in their metadata support RedPill’s TEE verification endpoints. Use /v1/models to check support before calling attestation in production:
curl https://api.redpill.ai/v1/models \
  -H "Authorization: Bearer YOUR_API_KEY" | \
  jq '.data[] | select(.metadata.appid != null) | {id, providers, appid: .metadata.appid}'

Attestation Guide

Learn how to verify TEE execution ->

Pricing Comparison

Chutes Models

ModelPrompt per MCompletion per M
GLM 5.1$1.21$4.20
Kimi K2.6$1.09$4.60
Qwen3.5 397B$0.55$3.50
Qwen3 Coder Next$0.18$1.20
MiniMax M2.5$0.20$1.38
MiMo V2 Flash$0.10$0.30
DeepSeek V3.2$0.32$0.48
Kimi K2.5$0.60$3.00

Near AI Models

ModelPrompt per MCompletion per M
GLM 5$1.20$3.50
DeepSeek V3.1$1.05$3.10
GPT-OSS 120B$0.10$0.49
Qwen3 30B$0.15$0.55
GLM 4.7$0.85$3.30

Phala Models

ModelPrompt per MCompletion per M
Qwen3.5 27B$0.30$2.40
Qwen3 VL 30B$0.20$0.70
Qwen3 Embedding 8B$0.01$0.00
Gemma 3 27B$0.11$0.40
GLM 4.7 Flash$0.10$0.43
GPT-OSS 20B$0.04$0.15
Qwen 2.5 7B$0.04$0.10
Uncensored 24B$0.20$0.90
all-MiniLM-L6-v2$0.005$0.00

Tinfoil Models

All Tinfoil models use flat-rate pricing: $2/M tokens for both prompt and completion.

Migration Guide

From Regular Models to GPU TEE

Simply change the model name:
# Before (regular model)
response = client.chat.completions.create(
    model="openai/gpt-5",
    messages=[...]
)

# After (GPU TEE confidential model)
response = client.chat.completions.create(
    model="openai/gpt-oss-120b",
    messages=[...]  # Same API!
)
No other code changes required!

FAQs

Check the providers field in the API response. Models with "phala" in their providers array run on Phala’s GPU TEE infrastructure. For all GPU TEE providers, filter for phala, near-ai, tinfoil, or chutes.
openai/gpt-oss-120b - It uses OpenAI’s open-weight architecture and currently runs through Near AI’s TEE provider.
phala/qwen-2.5-7b-instruct - Smallest and fastest for simple tasks. phala/glm-4.7-flash offers the best speed-to-quality ratio.
  • phala/qwen3-vl-30b-a3b-instruct - Vision + text (30B)
  • phala/gemma-3-27b-it - Vision + text (27B)
  • moonshotai/kimi-k2.5 and moonshotai/kimi-k2.6 - Vision + text through Chutes
This model ID is a legacy alias that now routes to phala/qwen3-vl-30b-a3b-instruct. Both IDs work, but we recommend using the qwen3 variant.
Enterprise customers can fine-tune models in TEE. Contact sales@redpill.ai
FP8 reduces model size and increases speed with minimal quality loss (~1%). Enables efficient TEE inference.

Next Steps

Start Using Models

Make your first request

Verify Attestation

Cryptographic proof of TEE

API Reference

Complete API documentation

Pricing Details

Compare model costs