MAI-Voice-2-Preview: Multilingual Prompted Text-to-Speech
Model card reference: MAI-Voice-2 (Foundry) latest update
MAI-Voice-2-Preview is a high-fidelity, expressive, prompted TTS model in public preview across 15 languages and 18 locales. This notebook demonstrates REST call patterns, multilingual synthesis with currently published prebuilt voices, voice prompting guidance, and practical implementation notes.
1. Setup
Environment variables
| Variable | Required | Secret | Purpose |
|---|---|---|---|
MAI_VOICE_2_ENDPOINT |
Optional | No | Voice endpoint (falls back to East US TTS endpoint). |
MAI_VOICE_2_KEY |
Optional* | Yes | API key when key-based auth is used. |
USE_ENTRA_AUTH |
Optional | No | Set true to use Entra auth, false to force key auth. |
MAI_VOICE_2_OUTPUT_DIR |
Optional | No | Output directory for generated audio; defaults to media/mai-voice-2. |
* Required when USE_ENTRA_AUTH=false.
Do not commit .env or deployment.env files with secrets.
# %pip install -q requests python-dotenv azure-identity
import os
from pathlib import Path
import requests
from dotenv import load_dotenv
from azure.identity import DefaultAzureCredential, get_bearer_token_provider
ENV_PATH = 'deployment.env' if os.path.exists('deployment.env') else os.path.join('..', 'deployment.env')
load_dotenv(ENV_PATH, override=True)
VOICE2_ENDPOINT = (
os.getenv('MAI_VOICE_2_ENDPOINT')
or os.getenv('VOICE_SPEECH_ENDPOINT')
or 'https://eastus.tts.speech.microsoft.com/'
)
VOICE2_KEY = (
os.getenv('MAI_VOICE_2_KEY')
or os.getenv('VOICE_SPEECH_KEY')
or os.getenv('AZURE_SPEECH_KEY')
)
USE_ENTRA_AUTH = os.getenv('USE_ENTRA_AUTH', 'true').lower() == 'true'
if not VOICE2_KEY:
USE_ENTRA_AUTH = True
voice_output_env = os.getenv('MAI_VOICE_2_OUTPUT_DIR')
OUT_DIR = Path(voice_output_env) if voice_output_env else Path('media') / 'mai-voice-2'
OUT_DIR.mkdir(parents=True, exist_ok=True)
token_provider = None
if USE_ENTRA_AUTH:
for env_var in ('AZURE_TENANT_ID', 'AZURE_CLIENT_ID', 'AZURE_CLIENT_SECRET'):
if os.getenv(env_var) == '':
os.environ.pop(env_var, None)
token_provider = get_bearer_token_provider(
DefaultAzureCredential(),
'https://cognitiveservices.azure.com/.default',
)
print(f'Endpoint: {VOICE2_ENDPOINT}')
print(f'Auth mode: {'Entra ID' if USE_ENTRA_AUTH else 'API key'}')
print('Default sample voice: en-US-Harper:MAI-Voice-2-Preview')
print('Output target: 24kHz MP3')
2. Model Card Highlights
- Preview status: MAI-Voice-2 is currently in public preview and is not recommended for production workloads.
- High-fidelity natural voice synthesis with expressive, conversational output.
- Generate speech from short audio prompts (5-60 seconds). Voice prompting is gated and requires Microsoft approval plus consent safeguards.
- Multilingual support across 15 languages and 18 locales.
- Supported languages: Arabic, Chinese, English, French, German, Hindi, Indonesian, Italian, Japanese, Korean, Portuguese, Russian, Spanish, Thai, and Vietnamese.
- Supports long-form content generation via chunking with context carryover.
- Output format is 24kHz mono audio.
- Served globally via East US, Sweden Central, and Southeast Asia.
- Pricing reference: $22 per 1M characters.
- Out-of-scope note: optimized for naturalness/expressivity over ultra-low-latency scenarios.
Available prebuilt voices
The currently published MAI-Voice-2-Preview prebuilt voices are:
| Voice ID | Locale | Language | Gender | Recommended use case |
|---|---|---|---|---|
en-US-Harper:MAI-Voice-2-Preview |
en-US |
English (United States) | Female | General conversation, expressive long-form |
es-MX-Valeria:MAI-Voice-2-Preview |
es-MX |
Spanish (Mexico) | Female | General conversation, multilingual narration |
fr-FR-Soleil:MAI-Voice-2-Preview |
fr-FR |
French (France) | Female | General conversation, multilingual narration |
de-DE-Klaus:MAI-Voice-2-Preview |
de-DE |
German (Germany) | Male | General conversation, multilingual narration |
Microsoft may add more locales and voices during preview; check the public MAI-Voice documentation before hard-coding a voice list in production code.
Choosing MAI-Voice-1 vs. MAI-Voice-2
| If you need... | Use | Practical guidance |
|---|---|---|
| English-only TTS with mature SSML style-control examples | MAI-Voice-1 | Keep MAI-Voice-1 for existing English flows that already depend on a specific voice/style combination. |
| Multilingual narration or localized conversational UX | MAI-Voice-2-Preview | Run the same script through the closest MAI-Voice-2 locale, then compare pronunciation, naturalness, and persona consistency side by side. |
| A voice that resembles an approved short reference clip | MAI-Voice-2-Preview voice prompting | Use only approved prompt audio, keep clips in the 5-60 second range, document consent, and review generated output before downstream use. |
For recipe validation, save one MAI-Voice-1 baseline and one MAI-Voice-2 sample for the same sentence. Listen for pronunciation, pacing, emotional fit, and whether the localized voice preserves the intent without over-tuning the prompt.
3. Reference HTTP Pattern
reference_ssml = '''<speak version="1.0" xmlns="http://www.w3.org/2001/10/synthesis" xmlns:mstts="http://www.w3.org/2001/mstts" xml:lang="en-US">
<voice name="en-US-Harper:MAI-Voice-2-Preview">
Hello, this is test text-to-speech model
</voice>
</speak>'''
reference_url = f"{VOICE2_ENDPOINT.rstrip('/')}/cognitiveservices/v1"
reference_headers = {
'Content-Type': 'application/ssml+xml',
'X-Microsoft-OutputFormat': 'audio-24khz-160kbitrate-mono-mp3',
'User-Agent': 'mai-voice-2-notebook-reference',
}
if USE_ENTRA_AUTH:
reference_headers['Authorization'] = f"Bearer {token_provider()}"
else:
reference_headers['Ocp-Apim-Subscription-Key'] = VOICE2_KEY
RUN_REFERENCE_CALL = False
if RUN_REFERENCE_CALL:
response = requests.post(
reference_url,
headers=reference_headers,
data=reference_ssml.encode('utf-8'),
timeout=180,
)
response.raise_for_status()
out_file = OUT_DIR / 'speech-voice-en.mp3'
out_file.write_bytes(response.content)
print(f'Wrote {out_file} ({out_file.stat().st_size:,} bytes)')
else:
safe_headers = {
k: ('<bearer token>' if k == 'Authorization' else '<subscription key>' if k == 'Ocp-Apim-Subscription-Key' else v)
for k, v in reference_headers.items()
}
print('Set RUN_REFERENCE_CALL=True to execute this Python HTTP sample.')
print('URL:', reference_url)
print('Headers:', safe_headers)
4. Helper: Synthesize SSML to File
def headers() -> dict:
h = {
'Content-Type': 'application/ssml+xml',
'X-Microsoft-OutputFormat': 'audio-24khz-160kbitrate-mono-mp3',
'User-Agent': 'mai-voice-2-notebook',
}
if USE_ENTRA_AUTH:
h['Authorization'] = f"Bearer {token_provider()}"
else:
h['Ocp-Apim-Subscription-Key'] = VOICE2_KEY
return h
def synthesize_to_file(ssml: str, out_file: str) -> Path:
url = f"{VOICE2_ENDPOINT.rstrip('/')}/cognitiveservices/v1"
resp = requests.post(url, headers=headers(), data=ssml.encode('utf-8'), timeout=180)
if not resp.ok:
raise requests.HTTPError(f'TTS request failed with {resp.status_code}: {resp.text}', response=resp)
p = OUT_DIR / out_file
p.write_bytes(resp.content)
print(f'Wrote {p} ({p.stat().st_size:,} bytes)')
return p
5. Multilingual Synthesis Samples
Illustrative sample audio from one run:
Audio style and prosody can vary between runs and model updates.
samples = [
{'lang': 'en-US', 'voice': 'en-US-Harper:MAI-Voice-2-Preview', 'text': 'Hello from MAI Voice 2 in English.', 'out': 'mai_voice2_en.mp3'},
{'lang': 'es-MX', 'voice': 'es-MX-Valeria:MAI-Voice-2-Preview', 'text': 'Hola, esta es una muestra de MAI Voice 2.', 'out': 'mai_voice2_es.mp3'},
{'lang': 'fr-FR', 'voice': 'fr-FR-Soleil:MAI-Voice-2-Preview', 'text': 'Bonjour, ceci est un exemple MAI Voice 2.', 'out': 'mai_voice2_fr.mp3'},
{'lang': 'de-DE', 'voice': 'de-DE-Klaus:MAI-Voice-2-Preview', 'text': 'Hallo, dies ist eine MAI Voice 2 Probe.', 'out': 'mai_voice2_de.mp3'},
]
for s in samples:
ssml = f'''<speak version="1.0" xmlns="http://www.w3.org/2001/10/synthesis" xmlns:mstts="http://www.w3.org/2001/mstts" xml:lang="{s['lang']}">
<voice name="{s['voice']}">{s['text']}</voice>
</speak>'''
try:
synthesize_to_file(ssml, s['out'])
except Exception as ex:
print(f"{s['voice']} failed: {ex}")
6. Voice Prompting and Access Requests
Voice prompting (personal voice cloning) is gated and requires Microsoft approval plus consent safeguards.
Generate speech from short audio prompts (5-60 seconds). Use only prompt audio you are authorized to use, retain consent records, and review generated audio before downstream use.
If MAI-Voice-2 or voice prompting is not visible in your subscription, treat it as gated preview access: request access through your Microsoft account team or the Azure AI Custom Neural Voice and Custom Avatar Limited Access Review, then wait for approval before building with customer data.
Implementation reminders from the model card:
- Apply for limited access approval.
- Upload consent audio and a 5-60 second prompt.
- Use Personal Voice APIs to create voice profile.
- Synthesize with approved voice profile.
7. Next Steps
- Set MAI_VOICE_2_PRICE_PER_1M_CHAR after MAI-Voice-2 pricing is published.
- Re-check the public MAI-Voice docs for newly published voices and locales before shipping.
- Add latency benchmarking if your scenario is latency-sensitive.