Text to speech

POST/v1/audio/speechGenerate speech from text

Synthesizes speech with Kokoro-82M — a sentence takes about a second. The request and response match OpenAI's /v1/audio/speech, so official SDKs work unchanged.

curl https://runanything.ai/v1/audio/speech \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "kokoro-82m",
    "input": "The quick brown fox jumped over the lazy dog.",
    "voice": "bm_george",
    "response_format": "wav",
    "speed": 1.1
  }' \
  --output speech.wav

Request body

Parameter	Type	Description
`input`required	string	The text to speak. 1–4,096 characters.
`voice`required	string	A Kokoro voice id (`af_heart`, `bm_george`, full list) or an OpenAI voice name (`alloy`, `nova`, `onyx`, …).
`model`	string	`kokoro-82m` (default). `tts-1`, `tts-1-hd`, and `gpt-4o-mini-tts` are accepted aliases for compatibility.
`response_format`	string	`mp3` (default), `wav`, `aac`, or `pcm`. `opus` and `flac` aren't supported yet and return a 400.
`speed`	number	Playback speed, 0.25–4.0. Default 1.0.

Output formats

mp3 — 96 kbps mono. The default, and what OpenAI SDK code expects when it doesn't pass a format.
wav — 16-bit PCM, 24 kHz mono, standard RIFF file. Largest output, zero decode cost.
aac — ADTS AAC. Smallest output at comparable quality.
pcm — raw signed 16-bit little-endian samples, 24 kHz mono, no container. The only streaming format — see below.

Streaming with pcm

With response_format: "pcm" the response body streams raw audio as it's synthesized — the first bytes arrive before the full clip exists, which is what you want for assistants and anything conversational. The response includes X-Sample-Rate: 24000; samples are s16le mono.

import pyaudio
from openai import OpenAI

client = OpenAI(base_url="https://runanything.ai/v1", api_key="YOUR_API_KEY")

player = pyaudio.PyAudio().open(
    format=pyaudio.paInt16, channels=1, rate=24000, output=True
)

with client.audio.speech.with_streaming_response.create(
    model="kokoro-82m",
    voice="af_heart",
    input="Audio starts playing before this sentence finishes generating.",
    response_format="pcm",
) as response:
    for chunk in response.iter_bytes(chunk_size=4096):
        player.write(chunk)

const res = await fetch("https://runanything.ai/v1/audio/speech", {
  method: "POST",
  headers: {
    Authorization: "Bearer YOUR_API_KEY",
    "Content-Type": "application/json",
  },
  body: JSON.stringify({
    model: "kokoro-82m",
    voice: "af_heart",
    input: "Audio starts playing before this sentence finishes generating.",
    response_format: "pcm",
  }),
});

// res.body is a ReadableStream of raw s16le mono PCM at 24 kHz
const reader = res.body.getReader();
while (true) {
  const { done, value } = await reader.read();
  if (done) break;
  feedToAudioPipeline(value); // e.g. an AudioWorklet or speaker stream
}

The other formats (mp3, wav, aac) are delivered as complete files — fine for clips you save or play whole. If you need time-to-first-audio, use pcm.