ToMoviee AI Is Now on Mobile!
Get App

1. Overview

1.1 Service Capabilities

Text-to-Speech (TTS) converts written text into natural, smooth, and highly human-like speech. With support for multiple languages and the ability to convey emotional nuances and natural pauses, the generated speech is expressive, lifelike, and engaging. This feature delivers a flexible and high-quality voice generation solution ideal for various use cases, from audiobook narration and video dubbing to commercial ad voiceovers. The output speech is not only clear and natural but also emotionally resonant, making it a powerful tool for conveying information and capturing audience attention.

1.2 Sample Prompts and Outputs

Text 

Output Speech 

Although the peak summer travel season has yet to arrive, airfares of flights from interior provinces to Xinjiang have already quietly surged. 

From June 15 to 17, leaders of the G7 countries, Canada, France, Germany, Italy, Japan, the United Kingdom, and the United States, will gather in Kananaskis, Canada, for the 51st G7 Summit. 

2Prompt engine

N/A

3. API Requests

3.1 Request URL

Parameter Name 

Value 

Required 

Example 

Description 

Content-Type

application/json

Yes

Authorization

Yes

Basic xxx

Security verification information, in the format of Basic {access_token}, where access_token is a token, generated using the given app_key and app_crit, with the generation method being base64 (app_key: app_crit)

X-App-Key

Yes

Assigned appkey

Body

Parameter Name 

Type 

Required 

Default Value 

Description 

Other Info 

text

string

Yes

Text content supports Chinese:<1024 tokens (tokens>Chinese characters, English words, punctuation, etc.)

wsid

integer

Yes

User WSID. 

drive

string

No

If you use cloud storage for video/image output, this field is required in JSON format. Example: { "space_id": 11111, // Cloud storage space ID "file_dest_path": "/path/sss", // Cloud storage destination path (directory) "file_tag": [ // File tags { "key": "key1", "value": "value1" }, { "key": "key2", "value": "value2" } ] } 

emotion_choice

string

No

Emotional tone. Valid values: Neutral (default), Happy, Sad, Surprise, and Angry. 

speaker_choice

string

No

Voice template. The default is a female voice. The following 15 voice types are supported: ['GEN_ZH_F_001', 'GEN_ZH_F_002', 'GEN_ZH_F_003', 'GEN_ZH_F_004', 'GEN_ZH_F_005', 'GEN_ZH_F_006', 'GEN_ZH_F_007', 'GEN_ZH_M_001', 'GEN_ZH_M_002', 'GEN_ZH_M_003', 'GEN_ZH_M_004', 'GEN_ZH_M_005', 'GEN_ZH_M_006', 'CHAR_ZH_M_001', 'CHAR_ZH_M_002'] 

ref_audio

string

No

Reference audio required for voice modeling. Recommended duration: 5s–10s (min 3s, max 15s). Format: WAV. 

loudness_adjustment

integer

Yes

Adjusts the output volume. Default: -23 dB. Range: -60 dB to 0 dB. Recommended: -35 dB to -10 dB, gap=1. 

key_adjustment

integer

Yes

Adjusts the pitch. Unit: semitones. Default: 0. Range: -12 to 12, gap=1. 

speed_adjustment

number

Yes

Adjusts the playback speed. Default: 1.0. Range: 0.5x to 2.0x. 

file_type

integer

No

0: OSS; 5: cloud storage. 

is_clone

boolean

No

Whether to model voice. false (default): standard TTS; true: models voice. 

callback

string

No

Callback URL. 

params

string

No

回调透明参数

priority

number

No

Task priority. 

lang_code

string

No

Language code (currently supports only Chinese). Default: zh-CN. 

3.3 Response

Parameter Name 

Type 

Required 

Default Value 

Description 

Other Info 

code

number

Yes

Error code. 

msg

string

Yes

Error message. 

data

object

No

├─ task_id

string

No

Task ID. 

3.4 Sample Requests