AI-103 Certification Practice Question #59

Question

You are creating an agent workflow in a Microsoft Foundry project to support natural voice interactions.
The agent must receive continuous audio input, convert the input into text for reasoning, and then return spoken responses to a user. The workflow must meet the following requirements:
Support turn-taking dynamics, where the agent begins to generate the speech output before the user finishes speaking.
Operate with low latency to maintain conversational experience.
You need to enable both speech to text and text to speech in a real-time agent interaction.
What should you do?

Accepted Answer

A natural, low-latency voice agent needs two real-time pieces: convert the continuous incoming audio to text (real-time speech to text) and speak the agent's generated text back (text to speech), with streaming so output can begin before the user finishes. MS Learn describes real-time speech to text as instant transcription of live audio and describes low-latency/streaming TTS as enabling responsive interactive dialogue and starting synthesis from partial text — together satisfying the turn-taking and low-latency requirements. Batch transcription (A) is for prerecorded files, embeddings (C) don't do STT/TTS, and speech translation (D) solves a different problem. Suggested Answer and MS Learn agree; 90%, reduced only for no community vote data.

More AI-103 practice questions