ModelStream LogoModelStream Logo
Models
Video API
Image API
Chat API
Audio API
Studio
Pricing
Docs
Menu
IntroductionQuickstartAPI KeysUse with Hermes AgentUse with OpenClaw
Model ListBilling Guide
ModelStream

Video API

  • Seedance 2.0
  • Happyhorse 1.0
  • Vidu Q3
  • Kling V3.0
  • Veo 3.1
  • Wan 2.7
  • More Video Models →

Image API

  • GPT Image 2
  • Nano Banana 2
  • Seedream 5.0
  • Imagen 4
  • Qwen Image 2.0
  • Z-Image Turbo
  • More Image Models →

Audio API

  • Suno Music
  • Qwen3 TTS Flash
  • More Audio Models →

Chat API

  • GLM-5.2
  • Claude Opus 4.8
  • Gemini 3.5 Flash
  • Qwen 3.7 Max
  • GPT 5.5
  • More Chat Models →

About Us

  • Privacy Policy
  • Terms of Service
  • Support
  • Enterprise

© 2026 ModelStream Inc. All rights reserved.

API Documentation
API Reference
Audio
Speech Synthesis

Speech Synthesis

Loading models...
Q
Qwen3-TTS-Flash (qwen3-tts-flash)
qwen3-tts-flash0 models support this endpoint

The Qwen3-TTS-Flash is Tongyi's latest offline text-to-speech foundation model, featuring 17 expressive voices while enabling low-latency, high-stability audio synthesis. It supports multilingual and dialect outputs with consistent voice characteristics across languages. Trained on massive datasets, the system automatically adjusts vocal tones based on text semantics and demonstrates robust capabilities for synthesizing complex content. This model is provided as a snapshot version. This version is functionally equivalent to snapshot qwen3-tts-flash-2025-11-27. https://bailian.console.alibabacloud.com/cn-beijing?tab=model#/model-market/detail/qwen3-tts-flash?serviceSite=asia-pacific-china

Speech synthesis

https://api.modelstream.ai
POST/v1/audio/speech

Authentication

BearerAuth
AuthenticationBearer <token>

All API requests must be authenticated using a Bearer token in the Authorization header. Please ensure your API key is active.Authorization: Bearer sk-xxxxxx

Parameter Location: Header Param

Request Body

application/json

These parameters come from the selected model form_schema. Switching models updates this list and the request example.

text*string
RequiredExample Value: In this picture, a woman and a dog are playing on the beach. The background features waves lapping at the shore, and the sky is a bit pale—it should be sunrise or sunset. If you'd like to know anything else, just ask!Placeholder: Enter the text to be converted to speech...
voice?string

Select the voice for speech synthesis

Example Value: Cherry
Enum/Options:
Cherry (Intellectual Female): CherryChelsie (American English Female): ChelsieZhibao (Warm Male): ZhibaoBingjiao (Lively Female): BingjiaoStella (British English Female): StellaAiden (American English Male): Aiden
language_type?string

Language type of the text, auto detection works for mixed Chinese-English scenarios

Example Value: Auto
Enum/Options:
AutoChineseEnglish
format?string

Encoding format of the output audio

Example Value: pcm
Enum/Options:
MP3: mp3PCM (Raw): pcm
sample_rate?string

Audio sample rate, higher values mean better quality

Example Value: 24000
Enum/Options:
24 kHz (High Quality): 2400016 kHz (Standard): 16000
volume?number

Volume level of the output audio, 1.0 is standard

Example Value: 1Value Range: 0.1 ≤ value ≤ 2step: 0.1
speed?number

Speech playback speed, 1.0 is standard

Example Value: 1Value Range: 0.5 ≤ value ≤ 2step: 0.1
pitch?number

Voice pitch level, 1.0 is standard

Example Value: 1Value Range: 0.5 ≤ value ≤ 2step: 0.1
enable_timestamp?boolean

Whether to return word-level timestamp information in the response

Example Value: false

Response Parameters

application/json
200apiDocs.responses.successGenerateAudio
curl -X POST "https://api.modelstream.ai/v1/audio/speech" \
  -H "Authorization: Bearer <token>" \
  -H "Content-Type: application/json" \
  -d '{
  "model": "qwen3-tts-flash",
  "input": "In this picture, a woman and a dog are playing on the beach. The background features waves lapping at the shore, and the sky is a bit pale—it should be sunrise or sunset. If you'd like to know anything else, just ask!",
  "voice": "Cherry",
  "language_type": "Auto",
  "response_format": "mp3",
  "sample_rate": 24000,
  "volume": 1,
  "speed": 1,
  "pitch": 1,
  "enable_timestamp": false
}'
"string"