More about neural text-to-speech features

2023-03-08 3490℃

The text-to-speech feature of the Speech service on Azure has been fully upgraded to the neural text-to-speech engine. This engine uses deep neural networks to make the voices of computers nearly indistinguishable from the recordings of people. With the clear articulation of words, neural text-to-speech significantly reduces listening fatigue when users interact with AI systems.

The patterns of stress and intonation in spoken language are called prosody. Traditional text-to-speech systems break down prosody into separate linguistic analysis and acoustic prediction steps that are governed by independent models. That can result in muffled, buzzy voice synthesis.

Here's more information about neural text-to-speech features in the Speech service, and how they overcome the limits of traditional text-to-speech systems:

  • Real-time speech synthesis: Use the Speech SDK or REST API to convert text-to-speech by using prebuilt neural voices or custom neural voices.

  • Asynchronous synthesis of long audio: Use the batch synthesis API (Preview) to asynchronously synthesize text-to-speech files longer than 10 minutes (for example, audio books or lectures). Unlike synthesis performed via the Speech SDK or speech-to-text REST API, responses aren't returned in real time. The expectation is that requests are sent asynchronously, responses are polled for, and synthesized audio is downloaded when the service makes it available.

  • Prebuilt neural voices: Microsoft neural text-to-speech capability uses deep neural networks to overcome the limits of traditional speech synthesis with regard to stress and intonation in spoken language. Prosody prediction and voice synthesis happen simultaneously, which results in more fluid and natural-sounding outputs. Each prebuilt neural voice model is available at 24kHz and high-fidelity 48kHz. You can use neural voices to:

    • Make interactions with chatbots and voice assistants more natural and engaging.
    • Convert digital texts such as e-books into audiobooks.
    • Enhance in-car navigation systems.

    For a full list of platform neural voices, see Language and voice support for the Speech service.

  • Fine-tuning text-to-speech output with SSML: Speech Synthesis Markup Language (SSML) is an XML-based markup language that's used to customize text-to-speech outputs. With SSML, you can adjust pitch, add pauses, improve pronunciation, change speaking rate, adjust volume, and attribute multiple voices to a single document.

    You can use SSML to define your own lexicons or switch to different speaking styles. With the multilingual voices, you can also adjust the speaking languages via SSML. To fine-tune the voice output for your scenario, see Improve synthesis with Speech Synthesis Markup Language and Speech synthesis with the Audio Content Creation tool.

  • Visemes: Visemes are the key poses in observed speech, including the position of the lips, jaw, and tongue in producing a particular phoneme. Visemes have a strong correlation with voices and phonemes.

    By using viseme events in Speech SDK, you can generate facial animation data. This data can be used to animate faces in lip-reading communication, education, entertainment, and customer service. Viseme is currently supported only for the en-US (US English) neural voices.

If the article violates your rights, please contact us!