Qwen3-TTS-Flash - Speech Synthesis (AI Model · By )

API Documentation

API Reference

Audio

Speech Synthesis

Supported Model Code

Loading models...

Qwen3-TTS-Flash (qwen3-tts-flash)

qwen3-tts-flash0 models support this endpoint

The Qwen3-TTS-Flash is Tongyi's latest offline text-to-speech foundation model, featuring 17 expressive voices while enabling low-latency, high-stability audio synthesis. It supports multilingual and dialect outputs with consistent voice characteristics across languages. Trained on massive datasets, the system automatically adjusts vocal tones based on text semantics and demonstrates robust capabilities for synthesizing complex content. This model is provided as a snapshot version. This version is functionally equivalent to snapshot qwen3-tts-flash-2025-11-27. https://bailian.console.alibabacloud.com/cn-beijing?tab=model#/model-market/detail/qwen3-tts-flash?serviceSite=asia-pacific-china

Speech synthesis

https://api.modelstream.ai

POST/v1/audio/speech

Authentication

BearerAuth

AuthenticationBearer <token>

All API requests must be authenticated using a Bearer token in the Authorization header. Please ensure your API key is active.Authorization: Bearer sk-xxxxxx

Parameter Location: Header Param

Request Body

application/json

These parameters come from the selected model form_schema. Switching models updates this list and the request example.

text*string

RequiredExample Value: In this picture, a woman and a dog are playing on the beach. The background features waves lapping at the shore, and the sky is a bit pale—it should be sunrise or sunset. If you'd like to know anything else, just ask!Placeholder: Enter the text to be converted to speech...

voice?string

Select the voice for speech synthesis

Example Value: Cherry

Enum/Options:

Cherry (Intellectual Female): CherryChelsie (American English Female): ChelsieZhibao (Warm Male): ZhibaoBingjiao (Lively Female): BingjiaoStella (British English Female): StellaAiden (American English Male): Aiden

language_type?string

Language type of the text, auto detection works for mixed Chinese-English scenarios

Example Value: Auto

Enum/Options:

AutoChineseEnglish

format?string

Encoding format of the output audio

Example Value: pcm

Enum/Options:

MP3: mp3PCM (Raw): pcm

sample_rate?string

Audio sample rate, higher values mean better quality

Example Value: 24000

Enum/Options:

24 kHz (High Quality): 2400016 kHz (Standard): 16000

volume?number

Volume level of the output audio, 1.0 is standard

Example Value: 1Value Range: 0.1 ≤ value ≤ 2step: 0.1

speed?number

Speech playback speed, 1.0 is standard

Example Value: 1Value Range: 0.5 ≤ value ≤ 2step: 0.1

pitch?number

Voice pitch level, 1.0 is standard

Example Value: 1Value Range: 0.5 ≤ value ≤ 2step: 0.1

enable_timestamp?boolean

Whether to return word-level timestamp information in the response

Example Value: false

Response Parameters

application/json

200apiDocs.responses.successGenerateAudio

curl -X POST "https://api.modelstream.ai/v1/audio/speech" \
  -H "Authorization: Bearer <token>" \
  -H "Content-Type: application/json" \
  -d '{
  "model": "qwen3-tts-flash",
  "input": "In this picture, a woman and a dog are playing on the beach. The background features waves lapping at the shore, and the sky is a bit pale—it should be sunrise or sunset. If you'd like to know anything else, just ask!",
  "voice": "Cherry",
  "language_type": "Auto",
  "response_format": "mp3",
  "sample_rate": 24000,
  "volume": 1,
  "speed": 1,
  "pitch": 1,
  "enable_timestamp": false
}'

"string"

API Documentation

API Reference

Audio

Speech Synthesis

Supported Model Code

Loading models...

Qwen3-TTS-Flash (qwen3-tts-flash)

qwen3-tts-flash0 models support this endpoint

Speech synthesis

https://api.modelstream.ai

POST/v1/audio/speech

Authentication

BearerAuth

AuthenticationBearer <token>

All API requests must be authenticated using a Bearer token in the Authorization header. Please ensure your API key is active.Authorization: Bearer sk-xxxxxx

Parameter Location: Header Param

Request Body

application/json

These parameters come from the selected model form_schema. Switching models updates this list and the request example.

text*string

voice?string

Select the voice for speech synthesis

Example Value: Cherry

Enum/Options:

language_type?string

Language type of the text, auto detection works for mixed Chinese-English scenarios

Example Value: Auto

Enum/Options:

AutoChineseEnglish

format?string

Encoding format of the output audio

Example Value: pcm

Enum/Options:

MP3: mp3PCM (Raw): pcm

sample_rate?string

Audio sample rate, higher values mean better quality

Example Value: 24000

Enum/Options:

24 kHz (High Quality): 2400016 kHz (Standard): 16000

volume?number

Volume level of the output audio, 1.0 is standard

Example Value: 1Value Range: 0.1 ≤ value ≤ 2step: 0.1

speed?number

Speech playback speed, 1.0 is standard

Example Value: 1Value Range: 0.5 ≤ value ≤ 2step: 0.1

pitch?number

Voice pitch level, 1.0 is standard

Example Value: 1Value Range: 0.5 ≤ value ≤ 2step: 0.1

enable_timestamp?boolean

Whether to return word-level timestamp information in the response

Example Value: false

Response Parameters

application/json

200apiDocs.responses.successGenerateAudio

curl -X POST "https://api.modelstream.ai/v1/audio/speech" \
  -H "Authorization: Bearer <token>" \
  -H "Content-Type: application/json" \
  -d '{
  "model": "qwen3-tts-flash",
  "input": "In this picture, a woman and a dog are playing on the beach. The background features waves lapping at the shore, and the sky is a bit pale—it should be sunrise or sunset. If you'd like to know anything else, just ask!",
  "voice": "Cherry",
  "language_type": "Auto",
  "response_format": "mp3",
  "sample_rate": 24000,
  "volume": 1,
  "speed": 1,
  "pitch": 1,
  "enable_timestamp": false
}'

"string"