Build Multilingual TTS with MAI-Voice-2-Preview

MAI-Voice-2-Preview: Multilingual Prompted Text-to-Speech

Model card reference: MAI-Voice-2 (Foundry) latest update

MAI-Voice-2-Preview is a high-fidelity, expressive, prompted TTS model in public preview across 15 languages and 18 locales. This notebook demonstrates REST call patterns, multilingual synthesis with currently published prebuilt voices, voice prompting guidance, and practical implementation notes.

1. Setup

Environment variables

Variable	Required	Secret	Purpose
`MAI_VOICE_2_ENDPOINT`	Optional	No	Voice endpoint (falls back to East US TTS endpoint).
`MAI_VOICE_2_KEY`	Optional*	Yes	API key when key-based auth is used.
`USE_ENTRA_AUTH`	Optional	No	Set `true` to use Entra auth, `false` to force key auth.
`MAI_VOICE_2_OUTPUT_DIR`	Optional	No	Output directory for generated audio; defaults to `media/mai-voice-2`.

* Required when USE_ENTRA_AUTH=false.

Do not commit .env or deployment.env files with secrets.

# %pip install -q requests python-dotenv azure-identity

import os
from pathlib import Path
import requests
from dotenv import load_dotenv
from azure.identity import DefaultAzureCredential, get_bearer_token_provider

ENV_PATH = 'deployment.env' if os.path.exists('deployment.env') else os.path.join('..', 'deployment.env')
load_dotenv(ENV_PATH, override=True)

VOICE2_ENDPOINT = (
    os.getenv('MAI_VOICE_2_ENDPOINT')
    or os.getenv('VOICE_SPEECH_ENDPOINT')
    or 'https://eastus.tts.speech.microsoft.com/'
)
VOICE2_KEY = (
    os.getenv('MAI_VOICE_2_KEY')
    or os.getenv('VOICE_SPEECH_KEY')
    or os.getenv('AZURE_SPEECH_KEY')
)
USE_ENTRA_AUTH = os.getenv('USE_ENTRA_AUTH', 'true').lower() == 'true'
if not VOICE2_KEY:
    USE_ENTRA_AUTH = True

voice_output_env = os.getenv('MAI_VOICE_2_OUTPUT_DIR')
OUT_DIR = Path(voice_output_env) if voice_output_env else Path('media') / 'mai-voice-2'
OUT_DIR.mkdir(parents=True, exist_ok=True)

token_provider = None
if USE_ENTRA_AUTH:
    for env_var in ('AZURE_TENANT_ID', 'AZURE_CLIENT_ID', 'AZURE_CLIENT_SECRET'):
        if os.getenv(env_var) == '':
            os.environ.pop(env_var, None)
    token_provider = get_bearer_token_provider(
        DefaultAzureCredential(),
        'https://cognitiveservices.azure.com/.default',
    )

print(f'Endpoint: {VOICE2_ENDPOINT}')
print(f'Auth mode: {'Entra ID' if USE_ENTRA_AUTH else 'API key'}')
print('Default sample voice: en-US-Harper:MAI-Voice-2-Preview')
print('Output target: 24kHz MP3')

2. Model Card Highlights

Preview status: MAI-Voice-2 is currently in public preview and is not recommended for production workloads.
High-fidelity natural voice synthesis with expressive, conversational output.
Generate speech from short audio prompts (5-60 seconds). Voice prompting is gated and requires Microsoft approval plus consent safeguards.
Multilingual support across 15 languages and 18 locales.
Supported languages: Arabic, Chinese, English, French, German, Hindi, Indonesian, Italian, Japanese, Korean, Portuguese, Russian, Spanish, Thai, and Vietnamese.
Supports long-form content generation via chunking with context carryover.
Output format is 24kHz mono audio.
Served globally via East US, Sweden Central, and Southeast Asia.
Pricing reference: $22 per 1M characters.
Out-of-scope note: optimized for naturalness/expressivity over ultra-low-latency scenarios.

Available prebuilt voices

The currently published MAI-Voice-2-Preview prebuilt voices are:

Voice ID	Locale	Language	Gender	Recommended use case
`en-US-Harper:MAI-Voice-2-Preview`	`en-US`	English (United States)	Female	General conversation, expressive long-form
`es-MX-Valeria:MAI-Voice-2-Preview`	`es-MX`	Spanish (Mexico)	Female	General conversation, multilingual narration
`fr-FR-Soleil:MAI-Voice-2-Preview`	`fr-FR`	French (France)	Female	General conversation, multilingual narration
`de-DE-Klaus:MAI-Voice-2-Preview`	`de-DE`	German (Germany)	Male	General conversation, multilingual narration

Microsoft may add more locales and voices during preview; check the public MAI-Voice documentation before hard-coding a voice list in production code.

Choosing MAI-Voice-1 vs. MAI-Voice-2

If you need...	Use	Practical guidance
English-only TTS with mature SSML style-control examples	MAI-Voice-1	Keep MAI-Voice-1 for existing English flows that already depend on a specific voice/style combination.
Multilingual narration or localized conversational UX	MAI-Voice-2-Preview	Run the same script through the closest MAI-Voice-2 locale, then compare pronunciation, naturalness, and persona consistency side by side.
A voice that resembles an approved short reference clip	MAI-Voice-2-Preview voice prompting	Use only approved prompt audio, keep clips in the 5-60 second range, document consent, and review generated output before downstream use.

For recipe validation, save one MAI-Voice-1 baseline and one MAI-Voice-2 sample for the same sentence. Listen for pronunciation, pacing, emotional fit, and whether the localized voice preserves the intent without over-tuning the prompt.

3. Reference HTTP Pattern

reference_ssml = '''<speak version="1.0" xmlns="http://www.w3.org/2001/10/synthesis" xmlns:mstts="http://www.w3.org/2001/mstts" xml:lang="en-US">
  <voice name="en-US-Harper:MAI-Voice-2-Preview">
    Hello, this is test text-to-speech model
  </voice>
</speak>'''

reference_url = f"{VOICE2_ENDPOINT.rstrip('/')}/cognitiveservices/v1"
reference_headers = {
    'Content-Type': 'application/ssml+xml',
    'X-Microsoft-OutputFormat': 'audio-24khz-160kbitrate-mono-mp3',
    'User-Agent': 'mai-voice-2-notebook-reference',
}
if USE_ENTRA_AUTH:
    reference_headers['Authorization'] = f"Bearer {token_provider()}"
else:
    reference_headers['Ocp-Apim-Subscription-Key'] = VOICE2_KEY

RUN_REFERENCE_CALL = False
if RUN_REFERENCE_CALL:
    response = requests.post(
        reference_url,
        headers=reference_headers,
        data=reference_ssml.encode('utf-8'),
        timeout=180,
    )
    response.raise_for_status()
    out_file = OUT_DIR / 'speech-voice-en.mp3'
    out_file.write_bytes(response.content)
    print(f'Wrote {out_file} ({out_file.stat().st_size:,} bytes)')
else:
    safe_headers = {
        k: ('<bearer token>' if k == 'Authorization' else '<subscription key>' if k == 'Ocp-Apim-Subscription-Key' else v)
        for k, v in reference_headers.items()
    }
    print('Set RUN_REFERENCE_CALL=True to execute this Python HTTP sample.')
    print('URL:', reference_url)
    print('Headers:', safe_headers)

4. Helper: Synthesize SSML to File

def headers() -&gt; dict:
    h = {
        'Content-Type': 'application/ssml+xml',
        'X-Microsoft-OutputFormat': 'audio-24khz-160kbitrate-mono-mp3',
        'User-Agent': 'mai-voice-2-notebook',
    }
    if USE_ENTRA_AUTH:
        h['Authorization'] = f"Bearer {token_provider()}"
    else:
        h['Ocp-Apim-Subscription-Key'] = VOICE2_KEY
    return h

def synthesize_to_file(ssml: str, out_file: str) -&gt; Path:
    url = f"{VOICE2_ENDPOINT.rstrip('/')}/cognitiveservices/v1"
    resp = requests.post(url, headers=headers(), data=ssml.encode('utf-8'), timeout=180)
    if not resp.ok:
        raise requests.HTTPError(f'TTS request failed with {resp.status_code}: {resp.text}', response=resp)
    p = OUT_DIR / out_file
    p.write_bytes(resp.content)
    print(f'Wrote {p} ({p.stat().st_size:,} bytes)')
    return p

5. Multilingual Synthesis Samples

Illustrative sample audio from one run:

Download sample (01-mai-voice2-en.mp3)

Audio style and prosody can vary between runs and model updates.

samples = [
    {'lang': 'en-US', 'voice': 'en-US-Harper:MAI-Voice-2-Preview', 'text': 'Hello from MAI Voice 2 in English.', 'out': 'mai_voice2_en.mp3'},
    {'lang': 'es-MX', 'voice': 'es-MX-Valeria:MAI-Voice-2-Preview', 'text': 'Hola, esta es una muestra de MAI Voice 2.', 'out': 'mai_voice2_es.mp3'},
    {'lang': 'fr-FR', 'voice': 'fr-FR-Soleil:MAI-Voice-2-Preview', 'text': 'Bonjour, ceci est un exemple MAI Voice 2.', 'out': 'mai_voice2_fr.mp3'},
    {'lang': 'de-DE', 'voice': 'de-DE-Klaus:MAI-Voice-2-Preview', 'text': 'Hallo, dies ist eine MAI Voice 2 Probe.', 'out': 'mai_voice2_de.mp3'},
]

for s in samples:
    ssml = f'''<speak version="1.0" xmlns="http://www.w3.org/2001/10/synthesis" xmlns:mstts="http://www.w3.org/2001/mstts" xml:lang="{s['lang']}">
  <voice name="{s['voice']}">{s['text']}</voice>
</speak>'''
    try:
        synthesize_to_file(ssml, s['out'])
    except Exception as ex:
        print(f"{s['voice']} failed: {ex}")

6. Voice Prompting and Access Requests

Voice prompting (personal voice cloning) is gated and requires Microsoft approval plus consent safeguards.

Generate speech from short audio prompts (5-60 seconds). Use only prompt audio you are authorized to use, retain consent records, and review generated audio before downstream use.

If MAI-Voice-2 or voice prompting is not visible in your subscription, treat it as gated preview access: request access through your Microsoft account team or the Azure AI Custom Neural Voice and Custom Avatar Limited Access Review, then wait for approval before building with customer data.

Implementation reminders from the model card:

Apply for limited access approval.
Upload consent audio and a 5-60 second prompt.
Use Personal Voice APIs to create voice profile.
Synthesize with approved voice profile.

7. Next Steps

Set MAI_VOICE_2_PRICE_PER_1M_CHAR after MAI-Voice-2 pricing is published.
Re-check the public MAI-Voice docs for newly published voices and locales before shipping.
Add latency benchmarking if your scenario is latency-sensitive.

Author