Documentation Index
Fetch the complete documentation index at: https://docs.redpill.ai/llms.txt
Use this file to discover all available pages before exploring further.
All Confidential AI Models
RedPill offers GPU TEE model entries running entirely in Trusted Execution Environments across 4 TEE providers: Chutes, Near AI, Phala Network, and Tinfoil. Some legacy aliases are also accepted for compatibility; useGET /v1/models for the live catalog.
Chutes
DeepSeek, MiniMax, Kimi, GLM, Qwen, and MiMo
Near AI
DeepSeek V3.1, GLM, GPT-OSS, and Qwen
Phala Network
Qwen, Gemma, GPT-OSS, GLM, and embeddings
Tinfoil
DeepSeek R1, Qwen Coder, Kimi Thinking, and Llama
Chutes TEE Models
Models powered by Chutes’ confidential computing infrastructure:| Model | Parameters | Context | Modality | Price (Prompt/Completion) |
|---|---|---|---|---|
z-ai/glm-5.1 | Large | 203K | Text | 4.20 per M |
moonshotai/kimi-k2.6 | Large (MoE) | 262K | Text + Image | 4.60 per M |
qwen/qwen3.5-397b-a17b | 397B (MoE) | 262K | Text | 3.50 per M |
qwen/qwen3-coder-next | Large | 262K | Text | 1.20 per M |
minimax/minimax-m2.5 | Large | 197K | Text | 1.38 per M |
xiaomi/mimo-v2-flash | Large | 262K | Text | 0.30 per M |
deepseek/deepseek-v3.2 | 685B (MoE) | 164K | Text | 0.48 per M |
moonshotai/kimi-k2.5 | Large (MoE) | 262K | Text + Image | 3 per M |
New: Chutes now includes GLM 5.1, Kimi K2.6, MiniMax M2.5, Qwen3 Coder Next, Qwen3.5 397B, and MiMo V2 Flash.
Near AI TEE Models
Models powered by Near AI’s decentralized TEE infrastructure:| Model | Parameters | Context | Modality | Price (Prompt/Completion) |
|---|---|---|---|---|
z-ai/glm-5 | Large | 203K | Text | 3.50 per M |
deepseek/deepseek-chat-v3.1 | 671B (MoE) | 164K | Text | 3.10 per M |
openai/gpt-oss-120b | 117B (MoE) | 131K | Text | 0.49 per M |
qwen/qwen3-30b-a3b-instruct-2507 | 30B (MoE) | 262K | Text | 0.55 per M |
z-ai/glm-4.7 | 130B | 131K | Text | 3.3 per M |
GLM-5 and GPT-OSS-120B are high-capacity models now running through Near AI’s TEE infrastructure. DeepSeek V3.1 supports both thinking and non-thinking modes.
Phala TEE Models
Models powered by Phala Network’s GPU TEE infrastructure with FP8 quantization:| Model | Parameters | Context | Modality | Price (Prompt/Completion) |
|---|---|---|---|---|
phala/qwen3.5-27b | 27B | 262K | Text | 2.40 per M |
phala/qwen3-vl-30b-a3b-instruct | 30B (MoE) | 128K | Vision + Text | 0.7 per M |
qwen/qwen3-embedding-8b | 8B | 32K | Embeddings | 0 per M |
phala/gemma-3-27b-it | 27B | 53K | Vision + Text | 0.4 per M |
phala/glm-4.7-flash | ~30B | 202K | Text | 0.43 per M |
phala/gpt-oss-20b | 21B (MoE) | 131K | Text | 0.15 per M |
phala/qwen-2.5-7b-instruct | 7B | 32K | Text | 0.1 per M |
phala/qwen2.5-vl-72b-instruct | 72B | 128K | Vision + Text | 1.2 per M |
phala/uncensored-24b | 24B | 32K | Text | 0.9 per M |
sentence-transformers/all-minilm-l6-v2 | 22M | 512 | Embeddings | 0 per M |
New: Qwen3.5-27B and confidential embedding models are now available through Phala. Venice Uncensored 24B offers an alignment-free model for advanced use cases.
The model ID
phala/qwen2.5-vl-72b-instruct is a legacy alias that now routes to phala/qwen3-vl-30b-a3b-instruct.Tinfoil TEE Models
Models powered by Tinfoil’s confidential computing infrastructure:| Model | Parameters | Context | Modality | Price (Prompt/Completion) |
|---|---|---|---|---|
qwen/qwen3-coder-480b-a35b-instruct | 480B (MoE) | 262K | Text | 2 per M |
moonshotai/kimi-k2-thinking | 1T (MoE, 32B active) | 262K | Text | 2 per M |
deepseek/deepseek-r1-0528 | 685B (MoE) | 163K | Text | 2 per M |
meta-llama/llama-3.3-70b-instruct | 70B | 131K | Text | 2 per M |
New: Kimi K2 Thinking is Moonshot AI’s most advanced open reasoning model, optimized for persistent step-by-step thought and dynamic tool invocation across hundreds of turns. Tinfoil models use flat-rate pricing at $2/M tokens.
Identifying TEE Models
TEE models can be identified in three ways:1. Use the dedicated endpoint
2. Check the providers field
Every model in the API response includes a providers array. Filter for TEE provider names:
phala, tinfoil, near-ai, chutes
3. Check model aliases
Some confidential models expose compatibility aliases in addition to their canonical model ID. Prefer the canonical ID from/v1/models in new integrations:
Model Details
phala/glm-5
Best Overall Quality
Flagship model for complex systems engineering and agent workflows
- Context Length: 202,752 tokens (202K)
- Quantization: FP8
- Modality: Text -> Text
- Production-grade productivity for large-scale programming tasks
- Performance aligned to top closed-source models
- Expert-level system design capabilities
- Complex systems engineering
- Long-horizon agent workflows
- Large-scale programming tasks
- Advanced code generation
phala/gpt-oss-120b
OpenAI Architecture
OpenAI’s open-weight model with familiar behavior
- Parameters: 117 billion (MoE, 5.1B active)
- Context Length: 131,072 tokens
- Quantization: FP8
- Modality: Text -> Text
- Configurable reasoning depth
- Full chain-of-thought access
- Native function calling
- Structured output generation
- AI agents and automation
- Complex task planning
- Tool use and API integration
- Production workloads requiring reasoning
phala/gpt-oss-20b
Efficient & Fast
Smaller model for low-latency applications
- Parameters: 21 billion (MoE, 3.6B active)
- Context Length: 131,072 tokens
- Quantization: FP8
- Modality: Text -> Text
- OpenAI Harmony response format
- Reasoning level configuration
- Function calling and tool use
- Structured outputs
- Apache 2.0 license
- Real-time chatbots
- Edge deployment
- Cost-sensitive applications
- High-throughput workloads
phala/qwen3-vl-30b-a3b-instruct
Vision + Language
Multimodal model for image and video understanding
- Parameters: 30 billion (MoE, 3B active)
- Context Length: 128,000 tokens
- Quantization: FP8
- Modality: Text + Image -> Text
- Real-world and synthetic object perception
- 2D/3D spatial grounding
- Long-form visual comprehension
- GUI automation and visual coding
- Document AI and OCR
- Document OCR and understanding
- Chart and graph analysis
- Visual quality inspection
- UI/UX automation
- Video timeline analysis
phala/gemma-3-27b-it
Google's Latest
Multimodal capabilities with strong multilingual support
- Parameters: 27 billion
- Context Length: 53,920 tokens (53K)
- Quantization: FP8
- Modality: Text + Image -> Text
- Multimodality support
- Context windows up to 128K tokens
- 140+ language understanding
- Improved math and reasoning
- Structured outputs
- Function calling
- Multilingual applications (140+ languages)
- Math and reasoning tasks
- Structured data generation
- Function calling workflows
- Chat applications
phala/uncensored-24b
Uncensored
Alignment-free model for unrestricted use cases
- Parameters: 24 billion
- Context Length: 32,768 tokens (32K)
- Quantization: FP8
- Modality: Text -> Text
- Full user control over alignment and behavior
- Steerability and transparent behavior
- No default safety layers
- Advanced and unrestricted use cases
- Creative writing without content filters
- Research requiring unrestricted outputs
- Custom alignment experimentation
- Red-teaming and safety research
phala/glm-4.7-flash
Fast & Efficient
30B-class model optimized for agentic coding
- Parameters: ~30B
- Context Length: 202,752 tokens (202K)
- Quantization: FP8
- Modality: Text -> Text
- Optimized agentic coding capabilities
- Strengthened long-horizon task planning
- Tool collaboration
- Leading open-source performance at its size class
- Agentic coding workflows
- Long-horizon task planning
- Tool-assisted development
- Cost-effective general purpose
phala/qwen-2.5-7b-instruct
Budget-Friendly
Most cost-effective confidential model
- Parameters: 7 billion
- Context Length: 32,768 tokens (32K)
- Quantization: FP8
- Modality: Text -> Text
- Enhanced coding and mathematics capabilities
- Better instruction following
- Improved long text generation (8K+ tokens)
- Structured data understanding (tables, JSON)
- Multilingual support (29+ languages)
- High-volume applications
- Multilingual support
- Simple chatbots
- Text classification
- Data extraction
Feature Comparison
| Feature | GLM 5 | GPT-OSS 120B | GPT-OSS 20B | Qwen3 VL 30B | Gemma 3 27B | Uncensored 24B | GLM 4.7 Flash | Qwen 2.5 7B |
|---|---|---|---|---|---|---|---|---|
| TEE Protected | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes |
| Function Calling | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes |
| Vision | No | No | No | Yes | Yes | No | No | No |
| Structured Output | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes |
| Streaming | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes |
| Multilingual | Yes | Yes | Yes | Yes | Yes (140+) | Yes | Yes | Yes (29+) |
Selection Guide
By Quality Requirements
Highest Quality:z-ai/glm-5- Best overall for systems engineeringopenai/gpt-oss-120b(117B) - OpenAI architecture
phala/qwen3-vl-30b-a3b-instruct(30B) - Vision + textphala/gemma-3-27b-it(27B) - Multimodal with 140+ languages
phala/glm-4.7-flash(~30B) - Fast with long context (202K)phala/gemma-3-27b-it(27B) - Good quality, reasonable costphala/gpt-oss-20b(21B) - Fast and efficient
phala/qwen-2.5-7b-instruct(7B) - Most economicalphala/gpt-oss-20b(21B) - Great value
By Use Case
Complex Reasoning:z-ai/glm-5- Systems engineering and agent workflowsopenai/gpt-oss-120b- OpenAI-style reasoning
phala/qwen3-vl-30b-a3b-instruct- Document AI, OCR, visual codingphala/gemma-3-27b-it- General multimodal tasks
phala/glm-4.7-flash- Optimized for agentic codingopenai/gpt-oss-120b- Best for tool use agents
z-ai/glm-5(203K) - Long systems engineering contextphala/glm-4.7-flash(202K) - Fast with long context
phala/gemma-3-27b-it- 140+ languagesphala/qwen-2.5-7b-instruct- 29+ languages
phala/qwen-2.5-7b-instruct- Lowest costphala/gpt-oss-20b- Fast inference
phala/uncensored-24b- No alignment filters
Attestation Support
Models with anappid in their metadata support RedPill’s TEE verification endpoints. Use /v1/models to check support before calling attestation in production:
Attestation Guide
Learn how to verify TEE execution ->
Pricing Comparison
Chutes Models
| Model | Prompt per M | Completion per M |
|---|---|---|
| GLM 5.1 | $1.21 | $4.20 |
| Kimi K2.6 | $1.09 | $4.60 |
| Qwen3.5 397B | $0.55 | $3.50 |
| Qwen3 Coder Next | $0.18 | $1.20 |
| MiniMax M2.5 | $0.20 | $1.38 |
| MiMo V2 Flash | $0.10 | $0.30 |
| DeepSeek V3.2 | $0.32 | $0.48 |
| Kimi K2.5 | $0.60 | $3.00 |
Near AI Models
| Model | Prompt per M | Completion per M |
|---|---|---|
| GLM 5 | $1.20 | $3.50 |
| DeepSeek V3.1 | $1.05 | $3.10 |
| GPT-OSS 120B | $0.10 | $0.49 |
| Qwen3 30B | $0.15 | $0.55 |
| GLM 4.7 | $0.85 | $3.30 |
Phala Models
| Model | Prompt per M | Completion per M |
|---|---|---|
| Qwen3.5 27B | $0.30 | $2.40 |
| Qwen3 VL 30B | $0.20 | $0.70 |
| Qwen3 Embedding 8B | $0.01 | $0.00 |
| Gemma 3 27B | $0.11 | $0.40 |
| GLM 4.7 Flash | $0.10 | $0.43 |
| GPT-OSS 20B | $0.04 | $0.15 |
| Qwen 2.5 7B | $0.04 | $0.10 |
| Uncensored 24B | $0.20 | $0.90 |
| all-MiniLM-L6-v2 | $0.005 | $0.00 |
Tinfoil Models
All Tinfoil models use flat-rate pricing: $2/M tokens for both prompt and completion.Migration Guide
From Regular Models to GPU TEE
Simply change the model name:FAQs
How do I know if a model runs on Phala?
How do I know if a model runs on Phala?
Check the
providers field in the API response. Models with "phala" in their providers array run on Phala’s GPU TEE infrastructure. For all GPU TEE providers, filter for phala, near-ai, tinfoil, or chutes.Which model is most similar to GPT-4?
Which model is most similar to GPT-4?
openai/gpt-oss-120b - It uses OpenAI’s open-weight architecture and currently runs through Near AI’s TEE provider.Which model is fastest?
Which model is fastest?
phala/qwen-2.5-7b-instruct - Smallest and fastest for simple tasks. phala/glm-4.7-flash offers the best speed-to-quality ratio.Which model supports images?
Which model supports images?
phala/qwen3-vl-30b-a3b-instruct- Vision + text (30B)phala/gemma-3-27b-it- Vision + text (27B)moonshotai/kimi-k2.5andmoonshotai/kimi-k2.6- Vision + text through Chutes
What about phala/qwen2.5-vl-72b-instruct?
What about phala/qwen2.5-vl-72b-instruct?
This model ID is a legacy alias that now routes to
phala/qwen3-vl-30b-a3b-instruct. Both IDs work, but we recommend using the qwen3 variant.Can I fine-tune these models?
Can I fine-tune these models?
Enterprise customers can fine-tune models in TEE. Contact sales@redpill.ai
What's FP8 quantization?
What's FP8 quantization?
FP8 reduces model size and increases speed with minimal quality loss (~1%). Enables efficient TEE inference.
Next Steps
Start Using Models
Make your first request
Verify Attestation
Cryptographic proof of TEE
API Reference
Complete API documentation
Pricing Details
Compare model costs