OpenAI-compatible API. Change your base URL, keep everything else.
Thermly is a drop-in replacement for the OpenAI API. Just change two lines:
from openai import OpenAI
client = OpenAI(
base_url="https://api.thermly.net/v1", # ← Add this
api_key="sk-thermo-your-key-here", # ← Your Thermly key
)
response = client.chat.completions.create(
model="auto", # Thermly picks the cheapest adequate model
messages=[{"role": "user", "content": "Hello!"}],
)
print(response.choices[0].message.content)Get your API key from the Dashboard → API Keys page.
All API requests require an API key sent in the Authorization header:
Authorization: Bearer sk-thermo-your-key-herehttps://api.thermly.net/v1/chat/completionsThis is the only endpoint you need. It's fully compatible with the OpenAI Chat Completions API format.
Let Thermly pick the best model for each request. This is where the cost savings come from.
| Model | Description |
|---|---|
auto | Analyzes your request and routes to the cheapest model that can handle it. Best for most use cases. |
fast | Always uses the fastest, cheapest model. Best for simple tasks. |
balanced | Mid-tier model. Balance of cost and quality. |
best | Always uses the most capable model. Best for complex tasks. |
Specify a model by name to bypass smart routing. Your request goes directly to that provider.
| Provider | Model |
|---|---|
| OpenAI | gpt-4o, gpt-4o-mini, gpt-4-turbo, gpt-4, gpt-3.5-turbo, o1, o3-mini |
| Anthropic | claude-sonnet-4-5, claude-haiku-4-5, claude-opus-4, claude-3-haiku |
gemini-2.5-flash, gemini-2.5-pro |
claude-haiku, gemini-flash, or claude-3-5-sonnet-20241022. They automatically resolve to the correct model.| Parameter | Type | Required | Description |
|---|---|---|---|
model | string | Yes | Model to use. Use "auto" for smart routing, or a specific model name. |
messages | array | Yes | Array of message objects with role and content. Max 100 messages. |
temperature | float | No | Sampling temperature, 0.0 to 2.0. Higher = more creative. Default: provider default. |
max_tokens | integer | No | Maximum tokens in the response. Max: 16,384. |
top_p | float | No | Nucleus sampling, 0.0 to 1.0. |
stop | string | array | No | Stop sequences. Up to 4 sequences. |
tools | array | No | Function calling tools. Max 64 tools. |
stream | boolean | No | Not yet supported. Coming soon. |
Responses follow the standard OpenAI chat completion format:
{
"id": "thermo-a1b2c3d4e5f6",
"object": "chat.completion",
"created": 1709000000,
"model": "gpt-4o-mini",
"choices": [
{
"index": 0,
"message": {
"role": "assistant",
"content": "Hello! How can I help you?"
},
"finish_reason": "stop"
}
],
"usage": {
"prompt_tokens": 12,
"completion_tokens": 8,
"total_tokens": 20
},
"x_thermostat": {
"actual_model": "gpt-4o-mini",
"complexity_level": 1,
"cached": false
}
}The x_thermostat field is Thermly-specific metadata. It tells you which model actually handled the request and whether the response was cached.
from openai import OpenAI
client = OpenAI(
base_url="https://api.thermly.net/v1",
api_key="sk-thermo-your-key-here",
)
# Smart routing - Thermly picks the best model
response = client.chat.completions.create(
model="auto",
messages=[
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Explain how DNS works."},
],
)
print(response.choices[0].message.content)
# Direct model - specify exactly which model to use
response = client.chat.completions.create(
model="claude-sonnet-4-5",
messages=[{"role": "user", "content": "Write a haiku about coding."}],
)
print(response.choices[0].message.content)import OpenAI from "openai";
const client = new OpenAI({
baseURL: "https://api.thermly.net/v1",
apiKey: "sk-thermo-your-key-here",
});
const response = await client.chat.completions.create({
model: "auto",
messages: [{ role: "user", content: "Hello!" }],
});
console.log(response.choices[0].message.content);curl https://api.thermly.net/v1/chat/completions \
-H "Authorization: Bearer sk-thermo-your-key-here" \
-H "Content-Type: application/json" \
-d '{
"model": "auto",
"messages": [{"role": "user", "content": "Hello!"}]
}'All errors follow the same JSON format:
{
"error": {
"message": "Human-readable error description",
"type": "error_type",
"code": "error_code"
}
}| Code | Meaning | What to do |
|---|---|---|
| 401 | Invalid or missing API key | Check your API key is correct and active |
| 402 | Insufficient credits | Add credits from the dashboard billing page |
| 422 | Invalid request parameters | Check your request body matches the parameter specs |
| 429 | Rate limit exceeded | Wait and retry. Check the Retry-After header |
| 503 | All models unavailable | The upstream provider is down. Retry after a few seconds |
Each API key has a per-minute request limit. If you exceed it, you'll receive a 429 response with a Retry-After header.
| Limit | Value |
|---|---|
| Requests per minute | 60 RPM per API key |
| Max messages per request | 100 |
| Max message length | 50,000 characters |
| Max output tokens | 16,384 |
Need higher limits? Contact us.
When using smart routing models (auto, fast, balanced, best), identical requests may return cached responses at zero cost. Cached responses are indicated by x_thermostat.cached: true in the response.
x-no-cache: true to force a fresh response.gpt-4o) are never cached.Questions? Issues? Reach out to support@thermly.net or visit your dashboard.