Speech-to-Text

Transcribe audio to text. Drop-in replacement for POST /v1/audio/transcriptions with automatic fallback across OpenAI Whisper and SiliconFlow SenseVoice.

POST/v1/audio/transcriptions

Content type

multipart/form-data. The audio file is sent as a form field named file.

Form fields

NameTypeRequiredDescription
model
stringRequiredStandard model alias. Must declare transcription capability. See supported models below.
file
fileRequiredAudio file to transcribe. Supported formats: flac, mp3, mp4, mpeg, mpga, m4a, ogg, wav, webm. Max 25 MB.
language
stringISO-639-1 language code (e.g. en, zh, fr). Auto-detected when omitted. Providing the language improves accuracy and latency.
prompt
stringOptional context text to guide the model. Useful for proper nouns or domain-specific terminology.
response_format
stringOutput format: json (default), text, srt, verbose_json, vtt.
temperature
numberSampling temperature between 0 and 1. Higher values increase variation. Default 0.

Response

For response_format=json (default), returns a JSON object with a text field containing the transcript and a duration field (seconds, used for billing). Status 200 on success.

Quick start

Python
from openai import OpenAI

client = OpenAI(
    base_url="https://api.therouter.ai",
    api_key="YOUR_API_KEY",
)

with open("audio.mp3", "rb") as f:
    transcript = client.audio.transcriptions.create(
        model="openai/whisper-1",
        file=f,
    )

print(transcript.text)

Using SenseVoice for Chinese/English

For Chinese or multilingual audio, alibaba/sensevoice-small (SiliconFlow) delivers emotion-aware transcription with strong Chinese support. It is also the automatic fallback when openai/whisper-1 is unavailable.

python
# SiliconFlow SenseVoice β€” emotion-aware, optimised for Chinese/English
transcript = client.audio.transcriptions.create(
    model="alibaba/sensevoice-small",
    file=open("audio.mp3", "rb"),
    language="zh",          # optional β€” auto-detected if omitted
    response_format="json",
)
print(transcript.text)

Supported models

NameTypeRequiredDescription
openai/whisper-1
$0.006/minOpenAI Whisper-1. General-purpose multilingual transcription. Automatic fallback to SenseVoice (SiliconFlow) at priority 1.
alibaba/sensevoice-small
$0.006/minSiliconFlow SenseVoice Small. Emotion-aware, strong Chinese/English accuracy. Also serves as fallback for openai/whisper-1.
Billing is per minute of audio duration. The gateway reads the duration field from the upstream JSON response; if the field is absent (e.g. raw text format), the request is not billed.