The Qwen3-TTS-Flash is Tongyi's latest offline text-to-speech foundation model, featuring 17 expressive voices while enabling low-latency, high-stability audio synthesis. It supports multilingual and dialect outputs with consistent voice characteristics across languages. Trained on massive datasets, the system automatically adjusts vocal tones based on text semantics and demonstrates robust capabilities for synthesizing complex content. This model is provided as a snapshot version. This version is functionally equivalent to snapshot qwen3-tts-flash-2025-11-27. https://bailian.console.alibabacloud.com/cn-beijing?tab=model#/model-market/detail/qwen3-tts-flash?serviceSite=asia-pacific-china
Speech synthesis
All API requests must be authenticated using a Bearer token in the Authorization header. Please ensure your API key is active.Authorization: Bearer sk-xxxxxx
These parameters come from the selected model form_schema. Switching models updates this list and the request example.
text*stringvoice?stringSelect the voice for speech synthesis
language_type?stringLanguage type of the text, auto detection works for mixed Chinese-English scenarios
format?stringEncoding format of the output audio
sample_rate?stringAudio sample rate, higher values mean better quality
volume?numberVolume level of the output audio, 1.0 is standard
speed?numberSpeech playback speed, 1.0 is standard
pitch?numberVoice pitch level, 1.0 is standard
enable_timestamp?booleanWhether to return word-level timestamp information in the response