MAI-Transcribe-1.5 (Speech API multipart patterns)
This notebook follows the Speech REST pattern you specified:
POST /speechtotext/transcriptions:transcribe?api-version=2025-10-15- multipart form with
audio+definition - key auth via
Ocp-Apim-Subscription-Key
1. Setup
Environment variables
| Variable | Required | Secret | Purpose |
|---|---|---|---|
MAI_TRANSCRIBE_15_ENDPOINT |
Optional | No | Speech endpoint (defaults to East US Speech endpoint). |
MAI_TRANSCRIBE_15_KEY |
Yes | Yes | Subscription key used in Ocp-Apim-Subscription-Key. |
TRANSCRIBE_LOCAL_AUDIO_DIR |
Optional | No | Folder containing local WAV/MP3/FLAC files; defaults to media/mai-transcribe-1-5. |
Do not commit .env or deployment.env files with secrets.
# %pip install -q requests python-dotenvimport os
import json
from copy import deepcopy
from pathlib import Path
import requests
from dotenv import load_dotenv
ENV_PATH = 'deployment.env' if os.path.exists('deployment.env') else os.path.join('..', 'deployment.env')
load_dotenv(ENV_PATH, override=True)
SPEECH_ENDPOINT = os.getenv('MAI_TRANSCRIBE_15_ENDPOINT', 'https://eastus.api.cognitive.microsoft.com/').rstrip('/')
SPEECH_KEY = os.getenv('MAI_TRANSCRIBE_15_KEY')
TRANSCRIBE_URL = f"{SPEECH_ENDPOINT}/speechtotext/transcriptions:transcribe?api-version=2025-10-15"
assert SPEECH_KEY, 'Set MAI_TRANSCRIBE_15_KEY in deployment.env'
audio_dir_env = os.getenv('TRANSCRIBE_LOCAL_AUDIO_DIR')
LOCAL_AUDIO_DIR = Path(audio_dir_env) if audio_dir_env else Path('media') / 'mai-transcribe-1-5'
assert LOCAL_AUDIO_DIR.exists(), f'Audio folder not found: {LOCAL_AUDIO_DIR}'
candidates = sorted(LOCAL_AUDIO_DIR.glob('*.wav')) + sorted(LOCAL_AUDIO_DIR.glob('*.mp3')) + sorted(LOCAL_AUDIO_DIR.glob('*.flac'))
non_empty = [p for p in candidates if p.stat().st_size > 0]
assert non_empty, f'No non-empty WAV/MP3/FLAC files found in: {LOCAL_AUDIO_DIR}'
AUDIO_FILE = str(max(non_empty, key=lambda p: p.stat().st_size))
print('Endpoint:', SPEECH_ENDPOINT)
print('Audio file:', AUDIO_FILE)
def _mime_for(path: Path) -> str:
s = path.suffix.lower()
if s == '.wav':
return 'audio/wav'
if s == '.mp3':
return 'audio/mpeg'
if s == '.flac':
return 'audio/flac'
return 'application/octet-stream'
def transcribe_with_definition(audio_path: str, definition: dict) -> dict:
p = Path(audio_path)
with p.open('rb') as f:
files = {
'audio': (p.name, f, _mime_for(p)),
'definition': (None, json.dumps(definition), 'application/json'),
}
resp = requests.post(
TRANSCRIBE_URL,
headers={'Ocp-Apim-Subscription-Key': SPEECH_KEY},
files=files,
timeout=300,
)
if resp.ok:
return resp.json()
# Compatibility fallback for backends that reject enhancedMode/model
txt = (resp.text or '').lower()
if resp.status_code == 400 and 'enhanced mode with model' in txt:
fb = deepcopy(definition)
fb.get('enhancedMode', {}).pop('model', None)
return transcribe_with_definition(audio_path, fb)
if resp.status_code == 400 and 'enhanced mode is currently not supported' in txt:
fb = deepcopy(definition)
fb.pop('enhancedMode', None)
return transcribe_with_definition(audio_path, fb)
raise requests.HTTPError(f'Transcription failed with {resp.status_code}: {resp.text}', response=resp)
Illustrative sample audio used in this recipe:
Transcription results are illustrative and may vary by model updates, locale detection, and audio quality.
2. General Speech-to-Text transcription
general_definition = {
'enhancedMode': {
'enabled': True,
'model': 'mai-transcribe-1.5'
}
}
general_result = transcribe_with_definition(AUDIO_FILE, general_definition)
print(json.dumps(general_result, indent=2)[:2000])
Example response shape:
{
"id": "<redacted>",
"status": "succeeded",
"combinedPhrases": [
{"text": "..."}
],
"phrases": [
{"speaker": 0, "offsetMilliseconds": 0, "text": "..."}
]
}
3. Speech-to-Text with verbatim mode
verbatim_definition = {
'enhancedMode': {
'enabled': True,
'model': 'mai-transcribe-1.5',
'transcribeStyle': 'verbatim'
}
}
verbatim_result = transcribe_with_definition(AUDIO_FILE, verbatim_definition)
print(json.dumps(verbatim_result, indent=2)[:2000])
Example response shape:
{
"id": "<redacted>",
"status": "succeeded",
"combinedPhrases": [
{"text": "..."}
],
"phrases": [
{"speaker": 0, "offsetMilliseconds": 0, "text": "..."}
]
}
4. Entity biasing example (phrase list)
This is the entity biasing pattern: pass important names/terms via phraseList.phrases to improve recognition.
phrase_list_definition = {
'phraseList': {
'phrases': ['MAI', 'Microsoft Build', 'Azure Speech', 'Foundry', 'Copilot']
},
'enhancedMode': {
'enabled': True,
'model': 'mai-transcribe-1.5'
}
}
phrase_list_result = transcribe_with_definition(AUDIO_FILE, phrase_list_definition)
print(json.dumps(phrase_list_result, indent=2)[:2000])
Example response shape:
{
"id": "<redacted>",
"status": "succeeded",
"combinedPhrases": [
{"text": "..."}
],
"phrases": [
{"speaker": 0, "offsetMilliseconds": 0, "text": "..."}
]
}
5. Additional example — Automatic language identification
Imported from your additional notebook examples.
lid_definition = {
'locales': ['en-US', 'es-ES', 'fr-FR', 'de-DE'],
'enhancedMode': {
'enabled': True,
'model': 'mai-transcribe-1.5'
},
}
lid_result = transcribe_with_definition(AUDIO_FILE, lid_definition)
print(json.dumps(lid_result, indent=2)[:2000])
Example response shape:
{
"id": "<redacted>",
"status": "succeeded",
"combinedPhrases": [
{"text": "..."}
],
"phrases": [
{"speaker": 0, "offsetMilliseconds": 0, "text": "..."}
]
}
6. Notes from model card (v2)
- Language coverage expanded to 43 languages (25 from MAI-Transcribe-1 + 18 additional languages in v2).
- Faster long-form inference: up to 5.7x faster than MAI-Transcribe-1 on long audio.
- Entity/keyword biasing supported (up to 200 keywords) via
phraseList.phrases. - Automatic language identification is supported.
- Current limitation: diarization is not supported yet (planned for an upcoming release).
- Input formats: WAV, MP3, FLAC.
- Input limits: up to 300 MB and 2 hours of audio.
- Serving regions (global routing): Central US, Sweden Central, and Southeast Asia.
7. Troubleshooting
| Error | Resolution |
|---|---|
Enhanced mode ... not supported |
Backend limitation; helper auto-falls back by removing model/enhancedMode. |
InvalidLocale |
Add/adjust locale in definition if required by your backend. |
EmptyAudioFile |
Use a non-empty file; notebook auto-picks largest non-empty local audio file. |
401/403 |
Verify MAI_TRANSCRIBE_15_KEY and endpoint in deployment.env. |
Tags
mai models inference multimodal