All Confidential AI Models
RedPill offers 17 confidential AI models running entirely in GPU TEE (Trusted Execution Environment) across 4 TEE providers: Phala Network, Tinfoil, Near AI, and Chutes.Phala Network
8 models with FP8 quantization
Tinfoil
4 models including Kimi K2 Thinking
Near AI
3 models including DeepSeek V3.1
Chutes
2 models including DeepSeek V3.2
Phala TEE Models
Models powered by Phala Network’s GPU TEE infrastructure with FP8 quantization:| Model | Parameters | Context | Modality | Price (Prompt/Completion) |
|---|---|---|---|---|
phala/glm-5 | Large | 202K | Text | 3.5 per M |
phala/gpt-oss-120b | 117B (MoE) | 131K | Text | 0.49 per M |
phala/qwen3-vl-30b-a3b-instruct | 30B (MoE) | 128K | Vision + Text | 0.7 per M |
phala/gemma-3-27b-it | 27B | 53K | Vision + Text | 0.4 per M |
phala/uncensored-24b | 24B | 32K | Text | 0.9 per M |
phala/gpt-oss-20b | 21B (MoE) | 131K | Text | 0.15 per M |
phala/glm-4.7-flash | ~30B | 202K | Text | 0.43 per M |
phala/qwen-2.5-7b-instruct | 7B | 32K | Text | 0.1 per M |
New: GLM-5 and GLM 4.7 Flash are the latest ZhipuAI models now available with full GPU TEE protection. Venice Uncensored 24B offers an alignment-free model for advanced use cases.
The model ID
phala/qwen2.5-vl-72b-instruct is a legacy alias that now routes to phala/qwen3-vl-30b-a3b-instruct.Tinfoil TEE Models
Models powered by Tinfoil’s confidential computing infrastructure:| Model | Parameters | Context | Modality | Price (Prompt/Completion) |
|---|---|---|---|---|
deepseek/deepseek-r1-0528 | 685B (MoE) | 163K | Text | 2 per M |
qwen/qwen3-coder-480b-a35b-instruct | 480B (MoE) | 262K | Text | 2 per M |
moonshotai/kimi-k2-thinking | 1T (MoE, 32B active) | 262K | Text | 2 per M |
meta-llama/llama-3.3-70b-instruct | 70B | 131K | Text | 2 per M |
New: Kimi K2 Thinking is Moonshot AI’s most advanced open reasoning model, optimized for persistent step-by-step thought and dynamic tool invocation across hundreds of turns. Tinfoil models use flat-rate pricing at $2/M tokens.
Near AI TEE Models
Models powered by Near AI’s decentralized TEE infrastructure:| Model | Parameters | Context | Modality | Price (Prompt/Completion) |
|---|---|---|---|---|
deepseek/deepseek-chat-v3.1 | 671B (MoE) | 163K | Text | 2.5 per M |
qwen/qwen3-30b-a3b-instruct-2507 | 30B (MoE) | 262K | Text | 0.45 per M |
z-ai/glm-4.7 | 130B | 131K | Text | 3.3 per M |
GLM-4.7 is ZhipuAI’s latest flagship model with enhanced programming capabilities and more stable multi-step reasoning. DeepSeek V3.1 supports both thinking and non-thinking modes.
Chutes TEE Models
Models powered by Chutes’ confidential computing infrastructure:| Model | Parameters | Context | Modality | Price (Prompt/Completion) |
|---|---|---|---|---|
deepseek/deepseek-v3.2 | 685B (MoE) | 163K | Text | 0.4 per M |
moonshotai/kimi-k2.5 | Large (MoE) | 262K | Text + Image | 3 per M |
New: Chutes offers competitive pricing on DeepSeek V3.2 and Kimi K2.5. Kimi K2.5 is a native multimodal model with state-of-the-art visual coding and agentic capabilities.
Identifying TEE Models
TEE models can be identified in two ways:1. Use the dedicated endpoint
2. Check the providers field
Every model in the API response includes a providers array. Filter for TEE provider names:
phala, tinfoil, near-ai, chutes
3. Use the phala/ model ID prefix
Any model can be accessed via the phala/ prefix to explicitly route through Phala’s TEE:
Model Details
phala/glm-5
Best Overall Quality
Flagship model for complex systems engineering and agent workflows
- Context Length: 202,752 tokens (202K)
- Quantization: FP8
- Modality: Text -> Text
- Production-grade productivity for large-scale programming tasks
- Performance aligned to top closed-source models
- Expert-level system design capabilities
- Complex systems engineering
- Long-horizon agent workflows
- Large-scale programming tasks
- Advanced code generation
phala/gpt-oss-120b
OpenAI Architecture
OpenAI’s open-weight model with familiar behavior
- Parameters: 117 billion (MoE, 5.1B active)
- Context Length: 131,072 tokens
- Quantization: FP8
- Modality: Text -> Text
- Configurable reasoning depth
- Full chain-of-thought access
- Native function calling
- Structured output generation
- AI agents and automation
- Complex task planning
- Tool use and API integration
- Production workloads requiring reasoning
phala/gpt-oss-20b
Efficient & Fast
Smaller model for low-latency applications
- Parameters: 21 billion (MoE, 3.6B active)
- Context Length: 131,072 tokens
- Quantization: FP8
- Modality: Text -> Text
- OpenAI Harmony response format
- Reasoning level configuration
- Function calling and tool use
- Structured outputs
- Apache 2.0 license
- Real-time chatbots
- Edge deployment
- Cost-sensitive applications
- High-throughput workloads
phala/qwen3-vl-30b-a3b-instruct
Vision + Language
Multimodal model for image and video understanding
- Parameters: 30 billion (MoE, 3B active)
- Context Length: 128,000 tokens
- Quantization: FP8
- Modality: Text + Image -> Text
- Real-world and synthetic object perception
- 2D/3D spatial grounding
- Long-form visual comprehension
- GUI automation and visual coding
- Document AI and OCR
- Document OCR and understanding
- Chart and graph analysis
- Visual quality inspection
- UI/UX automation
- Video timeline analysis
phala/gemma-3-27b-it
Google's Latest
Multimodal capabilities with strong multilingual support
- Parameters: 27 billion
- Context Length: 53,920 tokens (53K)
- Quantization: FP8
- Modality: Text + Image -> Text
- Multimodality support
- Context windows up to 128K tokens
- 140+ language understanding
- Improved math and reasoning
- Structured outputs
- Function calling
- Multilingual applications (140+ languages)
- Math and reasoning tasks
- Structured data generation
- Function calling workflows
- Chat applications
phala/uncensored-24b
Uncensored
Alignment-free model for unrestricted use cases
- Parameters: 24 billion
- Context Length: 32,768 tokens (32K)
- Quantization: FP8
- Modality: Text -> Text
- Full user control over alignment and behavior
- Steerability and transparent behavior
- No default safety layers
- Advanced and unrestricted use cases
- Creative writing without content filters
- Research requiring unrestricted outputs
- Custom alignment experimentation
- Red-teaming and safety research
phala/glm-4.7-flash
Fast & Efficient
30B-class model optimized for agentic coding
- Parameters: ~30B
- Context Length: 202,752 tokens (202K)
- Quantization: FP8
- Modality: Text -> Text
- Optimized agentic coding capabilities
- Strengthened long-horizon task planning
- Tool collaboration
- Leading open-source performance at its size class
- Agentic coding workflows
- Long-horizon task planning
- Tool-assisted development
- Cost-effective general purpose
phala/qwen-2.5-7b-instruct
Budget-Friendly
Most cost-effective confidential model
- Parameters: 7 billion
- Context Length: 32,768 tokens (32K)
- Quantization: FP8
- Modality: Text -> Text
- Enhanced coding and mathematics capabilities
- Better instruction following
- Improved long text generation (8K+ tokens)
- Structured data understanding (tables, JSON)
- Multilingual support (29+ languages)
- High-volume applications
- Multilingual support
- Simple chatbots
- Text classification
- Data extraction
Feature Comparison
| Feature | GLM 5 | GPT-OSS 120B | GPT-OSS 20B | Qwen3 VL 30B | Gemma 3 27B | Uncensored 24B | GLM 4.7 Flash | Qwen 2.5 7B |
|---|---|---|---|---|---|---|---|---|
| TEE Protected | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes |
| Function Calling | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes |
| Vision | No | No | No | Yes | Yes | No | No | No |
| Structured Output | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes |
| Streaming | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes |
| Multilingual | Yes | Yes | Yes | Yes | Yes (140+) | Yes | Yes | Yes (29+) |
Selection Guide
By Quality Requirements
Highest Quality:phala/glm-5- Best overall for systems engineeringphala/gpt-oss-120b(117B) - OpenAI architecture
phala/qwen3-vl-30b-a3b-instruct(30B) - Vision + textphala/gemma-3-27b-it(27B) - Multimodal with 140+ languages
phala/glm-4.7-flash(~30B) - Fast with long context (202K)phala/gemma-3-27b-it(27B) - Good quality, reasonable costphala/gpt-oss-20b(21B) - Fast and efficient
phala/qwen-2.5-7b-instruct(7B) - Most economicalphala/gpt-oss-20b(21B) - Great value
By Use Case
Complex Reasoning:phala/glm-5- Systems engineering and agent workflowsphala/gpt-oss-120b- OpenAI-style reasoning
phala/qwen3-vl-30b-a3b-instruct- Document AI, OCR, visual codingphala/gemma-3-27b-it- General multimodal tasks
phala/glm-4.7-flash- Optimized for agentic codingphala/gpt-oss-120b- Best for tool use agents
phala/glm-5(202K) - Longest contextphala/glm-4.7-flash(202K) - Fast with long context
phala/gemma-3-27b-it- 140+ languagesphala/qwen-2.5-7b-instruct- 29+ languages
phala/qwen-2.5-7b-instruct- Lowest costphala/gpt-oss-20b- Fast inference
phala/uncensored-24b- No alignment filters
Attestation Support
All Phala models provide cryptographic attestation. Models with anappid in their metadata support TEE verification:
Attestation Guide
Learn how to verify TEE execution ->
Pricing Comparison
Phala Models
| Model | Prompt per M | Completion per M |
|---|---|---|
| Qwen 2.5 7B | $0.04 | $0.10 |
| GPT-OSS 20B | $0.04 | $0.15 |
| GLM 4.7 Flash | $0.10 | $0.43 |
| GPT-OSS 120B | $0.10 | $0.49 |
| Gemma 3 27B | $0.11 | $0.40 |
| Qwen3 VL 30B | $0.20 | $0.70 |
| Uncensored 24B | $0.20 | $0.90 |
| GLM 5 | $1.20 | $3.50 |
Tinfoil Models
All Tinfoil models use flat-rate pricing: $2/M tokens for both prompt and completion.Near AI Models
| Model | Prompt per M | Completion per M |
|---|---|---|
| Qwen3 30B | $0.15 | $0.45 |
| GLM 4.7 | $0.85 | $3.30 |
| DeepSeek V3.1 | $1.00 | $2.50 |
Chutes Models
| Model | Prompt per M | Completion per M |
|---|---|---|
| DeepSeek V3.2 | $0.27 | $0.40 |
| Kimi K2.5 | $0.60 | $3.00 |
Migration Guide
From Regular Models to Phala
Simply change the model name:FAQs
How do I know if a model runs on Phala?
How do I know if a model runs on Phala?
Check the
providers field in the API response. Models with "phala" in their providers array run on Phala’s GPU TEE infrastructure. You can also use the phala/ prefix in the model ID, or call the dedicated /v1/models/phala endpoint.Which model is most similar to GPT-4?
Which model is most similar to GPT-4?
phala/gpt-oss-120b - It’s OpenAI’s architecture and has similar capabilities.Which model is fastest?
Which model is fastest?
phala/qwen-2.5-7b-instruct - Smallest and fastest for simple tasks. phala/glm-4.7-flash offers the best speed-to-quality ratio.Which model supports images?
Which model supports images?
phala/qwen3-vl-30b-a3b-instruct- Vision + text (30B)phala/gemma-3-27b-it- Vision + text (27B)
What about phala/qwen2.5-vl-72b-instruct?
What about phala/qwen2.5-vl-72b-instruct?
This model ID is a legacy alias that now routes to
phala/qwen3-vl-30b-a3b-instruct. Both IDs work, but we recommend using the qwen3 variant.Can I fine-tune these models?
Can I fine-tune these models?
Enterprise customers can fine-tune models in TEE. Contact sales@redpill.ai
What's FP8 quantization?
What's FP8 quantization?
FP8 reduces model size and increases speed with minimal quality loss (~1%). Enables efficient TEE inference.
Next Steps
Start Using Models
Make your first request
Verify Attestation
Cryptographic proof of TEE
API Reference
Complete API documentation
Pricing Details
Compare model costs