Article

Qwen3 TTS: How to Use Preset Voices, Voice Cloning, and Voice Design

admin
Apr 29, 2026 8 min read
Qwen3 TTS: How to Use Preset Voices, Voice Cloning, and Voice Design

Most text-to-speech APIs hand you a dropdown of preset voices and call it a day. Qwen3 TTS goes further. Built on the Qwen3 LLM backbone, it offers three distinct modes: pick a preset voice for instant results, clone any voice from a 10-second audio sample, or describe a completely new voice in plain English and let the model generate it.

All three modes run through a single deAPI endpoint, support 10 languages, and ship under Apache 2.0 for commercial use. This guide walks through each mode with working code examples so you can start generating speech in minutes.

Three Ways to Generate Speech

Each mode serves a different workflow. Here’s how they compare:

CustomVoiceVoiceCloneVoiceDesign
What you providePick from 9 presetsUpload 5-15s audio sampleWrite a text description
Best forQuick integration, production appsBrand narrator, your own voiceFictional characters, prototyping
ConsistencyIdentical every runHigh (same reference = same voice)Variable (same prompt ≈ similar voice)
API slugQwen3_TTS_12Hz_1_7B_CustomVoiceQwen3_TTS_12Hz_1_7B_BaseQwen3_TTS_12Hz_1_7B_VoiceDesign

All three share the same pricing ($12.86 per 1M characters), the same 10-language support, and the same

POST /api/v2/audio/speech endpoint. The 12 Hz audio token rate keeps inference fast and costs low compared to higher-frequency TTS architectures.

Preset Voices with CustomVoice

The fastest path to speech. Pick a voice, pass your text, get audio back.

Qwen3 TTS offers 9 voice identities, each available across 10 languages: English, Chinese, Japanese, Korean, German, French, Russian, Portuguese, Spanish, and Italian. The standout feature is cross-language consistency – Eric sounds like Eric whether he’s speaking English, German, or Japanese. Same timbre, appropriate accent for each language.

VoiceGenderCharacterGood for
VivianFWarm, conversationalAssistants, IVR
SerenaFElegant, calmAudiobooks, narration
Ono_AnnaFYouthful, clearProduct videos, tutorials
SoheeFSoft, gentleChildren’s content, meditation
EricMStandard, businesslikeCorporate, podcasts
DylanMYoung, energeticCommercials, social media
RyanMCasual, conversationalPodcasts, explainers
AidenMYoung, lightTutorials, apps
Uncle_FuMMature, warmStorytelling, mentorship

One exception: Ryan is unavailable in Chinese. Every other voice works across all 10 languages.

Here’s a working Python example:

import requests
import time

API_KEY = "your_api_key_here"
BASE = "https://api.deapi.ai/api/v2"
HEADERS = {
    "Authorization": f"Bearer {API_KEY}",
    "Accept": "application/json"
}

# Generate speech with a preset voice
response = requests.post(f"{BASE}/audio/speech", headers=HEADERS, data={
    "text": "Welcome back to the show. Today we're diving into something fascinating.",
    "model": "Qwen3_TTS_12Hz_1_7B_CustomVoice",
    "lang": "English",
    "speed": 1.0,
    "format": "mp3",
    "sample_rate": 24000,
    "voice": "Eric"
})

request_id = response.json()["data"]["request_id"]

# Poll for the result
while True:
    result = requests.get(
        f"{BASE}/jobs/{request_id}",
        headers=HEADERS
    ).json()

    if result["data"]["status"] == "done":
        print(f"Audio URL: {result['data']['result_url']}")  
        break
    elif result["data"]["status"] == "error":
        print(f"Error: {result['data']}")  
        break

    time.sleep(1)

The LLM backbone gives CustomVoice an edge over traditional TTS engines when it comes to prosody. A rhetorical question gets different intonation than a statement, and the model handles pauses around em-dashes and ellipses naturally. Use punctuation as your primary pacing tool – periods for full stops, commas for breaths, `…` for suspense.

Clone Any Voice with VoiceClone

VoiceClone takes a short audio sample and reproduces that voice on any text you provide. The model picks up timbre, pacing, accent, and emotional style from the reference – all without any fine-tuning or training.

The standout capability is cross-lingual cloning. Record a 10-second English sample, and the cloned voice can speak all 10 supported languages while keeping its identity. A single recording powers your entire multi-language product.

Preparing Your Reference Audio

Clone quality depends roughly 70% on the reference and 30% on your text. Investing a few minutes in a good recording pays off across every generation.

Length: 8-12 seconds is the sweet spot. Below 5 seconds the model fills gaps with guesswork. Above 15 seconds yields diminishing returns.

Quality: Record in a quiet room with a decent microphone. Background noise, reverb, and echo get cloned along with the voice. A USB condenser mic in a treated room beats a phone recording in a coffee shop every time.

Content: One speaker, continuous speech, no long pauses. A reference with varied prosody – questions, statements, commas, periods – gives the model more information about how the speaker naturally sounds.

Style matching: The model clones emotional style alongside timbre. A calm, measured reference produces calm output even if your text has exclamation marks. Record your reference in the mood you want the output to carry.

Here’s the code for voice cloning:

import requests
import time

API_KEY = "your_api_key_here"
BASE = "https://api.deapi.ai/api/v2"
HEADERS = {
    "Authorization": f"Bearer {API_KEY}",
    "Accept": "application/json"
}

# Clone a voice from a reference audio file
with open("reference_voice.wav", "rb") as audio_file:
    response = requests.post(f"{BASE}/audio/speech", headers=HEADERS,
        data={
            "text": "Today we're talking about something I've been obsessing over for weeks.",
            "model": "Qwen3_TTS_12Hz_1_7B_Base",
            "lang": "English",
            "speed": 1.0,
            "format": "mp3",
            "sample_rate": 24000
        },
        files={
            "ref_audio": ("reference.wav", audio_file, "audio/wav")
        }
    )

request_id = response.json()["data"]["request_id"]

# Poll for result (same pattern as above)
while True:
    result = requests.get(
        f"{BASE}/jobs/{request_id}",
        headers=HEADERS
    ).json()

    if result["data"]["status"] == "done":
        print(f"Audio URL: {result['data']['result_url']}")  
        break
    elif result["data"]["status"] == "error":
        print(f"Error: {result['data']}")  
        break

    time.sleep(1)

VoiceClone works well for brand narration, automated podcast intros in your own voice, and multi-language marketing where you want one consistent speaker across every market. A word on ethics: cloning someone’s voice without their consent is illegal in many jurisdictions, including under the EU AI Act. Only clone voices you own or have permission to use.

Design a Voice from Words with VoiceDesign

VoiceDesign takes the opposite approach from VoiceClone. Instead of providing an audio sample, you describe the voice you want in plain text, and the model generates a matching speaker from scratch.

The description (called instruct in the API) works across eight dimensions: gender, age, pitch, pace, emotion, accent, timbre, and personality. The more specific your description, the more consistent the output between runs. “Female voice” leaves enormous room for interpretation. “British female, 30s, mezzo-soprano, warm, slight RP accent, smooth timbre, audiobook narrator energy” narrows the space dramatically.

Here’s a template that produces reliable results:

[gender], [age], [pitch], [pace], [tone/emotion], [accent], [timbre], [personality/role]

Three examples that demonstrate the range:

Audiobook narrator: British male, 70s, warm and authoritative, low-medium pitch, measured pace, rich timbre, documentary-style gravitas, slight RP accent.

Commercial voiceover: American male, early 30s, high energy, bright timbre, medium-high pitch, fast pace, confident announcer voice.

Character voice: Elderly woman, 80s, raspy and mischievous, medium-low pitch, slow deliberate pace, slight Eastern European accent, gravelly timbre, fairy tale witch energy.

The model understands archetype references like “BBC documentary narrator” or “late-night radio DJ” and translates them into vocal characteristics. Avoid naming specific real people – stick to roles and archetypes.

Here’s the code:

import requests
import time

API_KEY = "your_api_key_here"
BASE = "https://api.deapi.ai/api/v2"
HEADERS = {
    "Authorization": f"Bearer {API_KEY}",
    "Accept": "application/json"
}

# Design a voice from a text description
response = requests.post(f"{BASE}/audio/speech", headers=HEADERS, data={
    "text": "And here, in the depths of the rainforest, we find a creature unlike any other.",
    "model": "Qwen3_TTS_12Hz_1_7B_VoiceDesign",
    "lang": "English",
    "speed": 1.0,
    "format": "mp3",
    "sample_rate": 24000,
    "instruct": "British male, 70s, warm and authoritative, low-medium pitch, measured pace, rich timbre, documentary narrator."
})

request_id = response.json()["data"]["request_id"]

# Poll for result (same pattern as above)
while True:
    result = requests.get(
        f"{BASE}/jobs/{request_id}",
        headers=HEADERS
    ).json()

    if result["data"]["status"] == "done":
        print(f"Audio URL: {result['data']['result_url']}")  
        break
    elif result["data"]["status"] == "error":
        print(f"Error: {result['data']}")  
        break

    time.sleep(1)

VoiceDesign shines for character dialogue in games and audio dramas, prototyping brand voices before hiring talent, and any situation where you need a specific vocal quality but don’t have a recording to clone. Think of it as casting – generate 5-10 variants, pick the best, save the exact prompt for consistency across your project.

One important difference from the other modes: VoiceDesign is not deterministic. The same description produces similar but not identical voices across runs. For production, find a voice you like, then reuse that exact prompt text throughout your project.

Writing Text That Sounds Right

All three modes share the same text-handling rules, rooted in the Qwen3 LLM backbone.

Punctuation controls pacing. Periods create full pauses. Commas add brief breaths. Em-dashes create dramatic suspension. Ellipses build suspense. These aren’t suggestions – they’re the primary tool for shaping how your audio sounds.

CAPS add emphasis sparingly. I said NO. stresses “NO” naturally. But I AM SO EXCITED ABOUT THIS reads as shouting. Limit CAPS to one or two words per paragraph.

Match your text style to your voice. A corporate narrator voice paired with casual slang (“Yo, what’s up”) sounds dissonant. A playful podcast voice reading legal disclaimers sounds equally wrong. The voice and the words need to belong together.

Keep runs between 500 and 1,500 characters. The model handles up to 5,000, but prosody stays most natural in the mid-range. For longer content, split into segments and concatenate the audio.

For Chinese and Japanese, use full-width punctuation. 。,?! instead of .,?! – the model was trained on full-width characters for CJK text.

Get Started with deAPI

All three Qwen3 TTS models use the same endpoint: POST /api/v2/audio/speech. The model parameter determines which mode you’re using – swap the slug to switch between preset voices, voice cloning, and voice design.

Pricing is identical across all three modes: $12.86 per 1 million characters. An average sentence (~100 characters) costs roughly $0.0013. The $5 free credit you get on signup covers approximately 390.000 ****characters – enough to generate hours of audio without spending a cent.

Full API documentation, including all available languages and voice presets, is at docs.deapi.ai.


Ready to add voice to your app? Sign up at deapi.ai and get $5 in free credits. Pick a preset voice and hear results in seconds, or clone your own voice from a 10-second recording. No credit card required.

Start building with AI
in under a minute

Access all models from this article through a single REST API. Start with $5 free credits — no subscription, no credit card.

No subscription
No credit card required