Amazon Polly Neural (also known as the Neural Text-to-Speech or NTTS engine) is a deep learning-based speech synthesis system developed by Amazon Web Services. Launched as an advancement over traditional concatenative synthesis, it is designed to produce natural-sounding speech by improving the intonation and rhythm of synthesized voices.
The model utilizes a sequence-to-sequence architecture consisting of two primary components: a neural network that converts input phonemes into spectrograms and a neural vocoder that transforms those spectrograms into continuous audio signals. This architecture allows the system to synthesize speech that captures subtle acoustic features and context-dependent prosody, providing higher audio fidelity than the legacy standard engine.
Key capabilities include support for various speaking styles, such as the Newscaster style, which mimics the formal delivery of professional news anchors. It also supports Speech Synthesis Markup Language (SSML), enabling fine-grained control over emphasis, pitch, and duration. The engine is available across dozens of languages and regional accents, supporting both real-time and asynchronous synthesis operations.