Swan Inference API

Call 42+ AI Models via an OpenAI-Compatible API

Swan Inference provides an OpenAI-compatible REST API for accessing decentralized AI models. If you've used the OpenAI API or any OpenAI-compatible client, you already know how to use Swan Inference — just change the base URL and API key.

Base URL: https://inference.swanchain.io

Quick Start

1. Get an API Key

Sign up at inference.swanchain.ioarrow-up-right to get your API key. Keys use the sk-swan- prefix.

2. Make Your First Request

curl https://inference.swanchain.io/v1/chat/completions \
  -H "Authorization: Bearer sk-swan-YOUR-API-KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "deepseek-r1-distill-llama-70b",
    "messages": [
      {"role": "system", "content": "You are a helpful assistant."},
      {"role": "user", "content": "What is Swan Chain?"}
    ]
  }'

That's it — any library or tool that supports the OpenAI API format works with Swan Inference.


Authentication

All API requests require an API key passed in the Authorization header:

Key Prefix
Purpose

sk-swan-*

Consumer API key — for making inference requests

sk-prov-*

Provider API key — for GPU providers connecting to the network


API Endpoints

List Models

Retrieve all available models and their current status.

Response:

You can also browse the full model catalog with pricing at inference.swanchain.io/modelsarrow-up-right.


Chat Completions

Generate chat-based text responses. This is the primary endpoint for interacting with LLMs.

Request Body:

Parameter
Type
Required
Description

model

string

Yes

Model ID (e.g., deepseek-r1-distill-llama-70b)

messages

array

Yes

Array of message objects with role and content

temperature

float

No

Sampling temperature (0-2). Default: 1.0

max_tokens

integer

No

Maximum tokens to generate. Default: model-dependent

stream

boolean

No

Enable streaming responses. Default: false

top_p

float

No

Nucleus sampling threshold. Default: 1.0

stop

string/array

No

Stop sequence(s)

frequency_penalty

float

No

Frequency penalty (-2 to 2). Default: 0

presence_penalty

float

No

Presence penalty (-2 to 2). Default: 0

Example — Standard Request:

Response:


Streaming

Enable real-time token-by-token responses by setting stream: true. The response uses Server-Sent Events (SSE).

Stream Response Format:

Each SSE event contains a JSON chunk:


Embeddings

Generate vector embeddings for text. Useful for search, similarity, and RAG applications.

Request Body:

Parameter
Type
Required
Description

model

string

Yes

Embedding model ID

input

string/array

Yes

Text to embed (string or array of strings)

Example:

Response:


Image Generation

Generate images from text prompts.

Request Body:

Parameter
Type
Required
Description

model

string

Yes

Image model ID (e.g., flux-1-schnell)

prompt

string

Yes

Text description of the image to generate

n

integer

No

Number of images to generate. Default: 1

size

string

No

Image size (e.g., 1024x1024)

Example:

Response:


Audio Transcription

Transcribe audio files to text.

Request Body (multipart/form-data):

Parameter
Type
Required
Description

file

file

Yes

Audio file (mp3, mp4, wav, webm, etc.)

model

string

Yes

Audio model ID (e.g., whisper-large-v3)

language

string

No

Language code (e.g., en)

Example:

Response:


Supported Models

Swan Inference hosts 42+ models across five categories:

Category
Models
Pricing

LLM

DeepSeek R1 (70B), Llama 3 (3B, 8B, 70B), Qwen 2.5, Mistral, Phi-3

Per input/output token

Image

FLUX.1 Schnell, Stable Diffusion XL

Per request

Audio

Whisper Large V3

Per request

Embedding

BGE Large, E5 Large

Per token

Multimodal

Llama 3.2 Vision, Qwen-VL

Per token

circle-info

Model availability depends on online providers. Check real-time status at inference.swanchain.io/modelsarrow-up-right or call GET /v1/models.


Rate Limits

Requests are rate-limited per API key:

Model Category
Requests per Minute

LLM

200

Image

60

Embedding

500

Other

200

Maximum concurrent requests: 100 per API key.

When rate-limited, the API returns HTTP 429 Too Many Requests with a Retry-After header.


Request Limits

Parameter
Limit

Max input tokens (LLM)

128,000

Max output tokens (LLM)

16,384

Max input tokens (Embedding)

8,192

Max request body size

10 MB

Max messages per request

100

Max message length

100,000 characters

Request timeout

120 seconds


Error Handling

The API returns standard HTTP error codes with JSON error bodies:

Status Code
Meaning

400

Bad request — check your request body

401

Unauthorized — invalid or missing API key

404

Model not found or no providers available

429

Rate limit exceeded — slow down

500

Internal server error

503

Service unavailable — all providers busy

The platform automatically retries failed requests (up to 2 retries with exponential backoff) when a provider is temporarily unavailable, so most transient errors are handled transparently.


Response Headers

Swan Inference includes helpful headers in every response:

Header
Description

X-Request-ID

Unique request correlation ID for tracing

X-Swan-Connection-Mode

How the request was routed: websocket or external

Use X-Request-ID when contacting support or debugging request issues.


Using with LLM Frameworks

Swan Inference works with any framework that supports OpenAI-compatible APIs.

LangChain (Python)

LlamaIndex

LiteLLM

Vercel AI SDK (TypeScript)


Pricing

Category
Pricing Unit
Billed In

LLM

Per input token + per output token

USDC

Embedding

Per token

USDC

Image

Per request

USDC

Audio

Per request

USDC

View current pricing for each model at inference.swanchain.io/modelsarrow-up-right.

Token usage is included in every response under the usage field.


Network Stats

Public endpoints are available for monitoring network health:

Endpoint
Description

GET /api/v1/stats/network

Aggregate network stats (providers, requests, capacity)

GET /api/v1/stats/leaderboard

Provider leaderboard ranked by performance

GET /api/v1/stats/gpu

GPU distribution and VRAM capacity across the network

GET /api/v1/stats/utilization

Network utilization metrics

GET /api/v1/stats/model-demand

Model demand data (useful for providers choosing which models to serve)

GET /api/v1/dashboard/summary

Dashboard summary with request and capacity metrics

These endpoints do not require authentication.


Learn More

Last updated