Convert text to natural-sounding human voices using Amazon Polly

Convert text to natural-sounding human voices using Amazon Polly

For years, brands have relied on famous actors to provide narration for ads, which creates a sense of familiarity and imparts the desired tone. But actors can be expensive, and their voices usually have a limited shelf life for commercial use. A synthetic voice offers an effective alternative.

Furthermore, AI voices make the process of editing voiceovers, quick and easy. Users can simply make the necessary changes to their script and modify the AI voice automatically. Synthetic voice is now accessible more than ever in this growing AI and ML world. the possibility of synthetic automated voice is only limited by your imagination. That being said the process of converting Text / SSML to natural-sounding voices isn't as difficult as used to be.

In this article, we'll see how AWS Polly which is an AI-based text-to-speech service can be used to convert text to speech in your desired language. We'll also see how to use AWS Lambda to convert text to speech and save it to AWS S3 and DynamoDB.

Services

What is Amazon Polly?

Amazon Polly is a service that turns text into lifelike speech. It's a text-to-speech (TTS) service that uses advanced deep learning technologies to synthesize speech that sounds like a human voice. It provides dozens of lifelike voices in multiple languages. You can create applications that talk and build entirely new categories of speech-enabled products. Polly is a fully managed service that makes it easy for developers to add speech to their applications.

Polly is a paid service. You can use the free tiers. After that, you'll have to pay for the service. The free tier includes 5 million characters per month for 12 months. You can find the pricing details here.

  1. With AWS Polly developers can customize and control speech output that supports lexicons and Speech Synthesis Markup Language (SSML) tags.

  2. Store and redistribute speech in standard formats like MP3 and OGG.

  3. Quickly deliver lifelike voices and conversational user experiences in consistently fast response times.

How Polly Works This is an Image demonstrating how Polly works straight out of the AWS documentation.

Product-Page.png

How to use AWS Polly?

For a Quick Simple Demo of AWS Polly, you can use the AWS Polly Service Page.

  1. Go to the AWS Polly console and click on the Get Started Now button.
  2. Click on Text to Speech.
  3. Select the language and voice you want to use.
  4. Enter the text you want to convert to speech.
  5. Click on Play to listen to the speech.
  6. Click on Download to download the speech.

Without further ado, let's get started and build a simple application that converts text to speech and saves the MP3 file it to AWS S3 and the Metadata to DynamoDB.

What is AWS Lambda?

AWS Lambda is a compute service that lets you run code without provisioning or managing servers. AWS Lambda executes your code only when needed and scales automatically, from a few requests per day to thousands per second. You pay only for the compute time you consume - there is no charge when your code is not running. With Lambda, you can run code for virtually any type of application or backend service - all with zero administration. Just upload your code and Lambda takes care of everything required to run and scale your code with high availability. You can set up your code to automatically trigger from other AWS services or call it directly from any web or mobile app.

What is AWS S3?

AWS S3 is a simple storage service that offers an extremely durable, highly available, and infinitely scalable object storage infrastructure at very low costs. It is designed to make web-scale computing easier for developers. Amazon S3 provides developers and IT teams with secure, durable, highly-scalable object storage. Amazon S3 is easy to use, with a simple web services interface to store and retrieve any amount of data from anywhere on the web.

What is DynamoDB?

AWS DynamoDB is a fully managed NoSQL database service that provides fast and predictable performance with seamless scalability. DynamoDB lets you offload the administrative burdens of operating and scaling a distributed database, so that you don't have to worry about hardware provisioning, setup, and configuration, replication, software patching, or cluster scaling.

Prerequisites

  1. You need to have an AWS account. If you don't have one, you can create one here.
  2. You need to have Node.js installed on your machine. You can download it from here.
  3. You need to have the AWS CLI installed on your machine. You can download it from here.
  4. You need to have AWS CLI configured. you can configure it by running aws configure.
  5. You need to have the Serverless Framework installed on your machine. You can download it from here.

Steps

Note: You can find the complete code for this article in my GitHub Repository.

Initializing the app.

With the serverless framework, we can generate a boilerplate code for our application. We'll use the serverless create command to generate the boilerplate code.

sls create --template aws-nodejs --path speak-polly-app --name speak-polly-app

Here the --template signifies the template we want to use. We're using the aws-nodejs template. The --path signifies the path where we want to create the project. We're creating the project in the speak-polly-app folder. The --name signifies the name of the project. We're naming the project as speak-polly-app.

After running the above command, you'll see a folder named speak-polly-app in your current directory. This folder contains the boilerplate code for our application.

cd speak-polly-app

Now, let's install the dependencies for our application. We'll be using aws-sdk and uuid packages. The aws-sdk package will help us to interact with AWS services and APIs. The uuid package will help us to generate a unique ID for our speech.

npm install aws-sdk uuid

Code

Now, let's open the handler.js file in our project. This file contains the code for our application.

'use strict';

const AWS = require('aws-sdk');
const { v4: uuidV4 } = require('uuid');
const Polly = new AWS.Polly({ apiVersion: '2016-06-10' });
const S3 = new AWS.S3({ apiVersion: 'latest' });
const dynamodb = new AWS.DynamoDB({ apiVersion: '2012-08-10' });

module.exports.speakPolly = async (event) => {
  try {
    const records = event['Records'];

    for (const record of records) {
      const bucketName = record['s3']['bucket']['name'];
      const objectKey = record['s3']['object']['key'];
      // const objectSize = record['s3']['object']['size'];

      const objectDetails = await getObjectFromS3(bucketName, objectKey);
      const text = objectDetails.Body.toString('utf-8');
      const voiceId = process.env.VOICE_AGENT || 'Carla';

      const params = {
        Engine: 'standard',
        OutputFormat: 'mp3',
        SampleRate: '22050',
        Text: text,
        TextType: 'text',
        VoiceId: voiceId,
      };

      const data = await Polly.synthesizeSpeech(params).promise();

      const destinationBucketName = process.env.AUDIO_BUCKET;
      const mp3FileKey = uuidV4();
      const destinationObjectKey = `${mp3FileKey}.mp3`;

      const s3Params = {
        Bucket: destinationBucketName,
        Key: destinationObjectKey,
        data: data.AudioStream,
        ContentType: 'audio/mpeg',
      };

      const dynamoTableName = process.env.DYNAMODB_TABLE;

      await Promise.all([
        putObjectToS3(s3Params),
        putItemToDynamoDB(dynamoTableName, {
          id: { S: mp3FileKey },
          text: { S: text },
          voiceId: { S: voiceId },
          bucket: { S: destinationBucketName },
          createdAt: { S: new Date().toISOString() },
          updatedAt: { S: new Date().toISOString() },
        }),
      ]);
    }

    return {
      statusCode: 200,
      body: JSON.stringify(
        { message: 'Successfully executed', data: event },
        null,
        2
      ),
    };
  } catch (err) {

    return {
      statusCode: 500,
      body: JSON.stringify({ message: 'Error', data: err }, null, 2),
    };
  }
};

async function getObjectFromS3(bucket_name, object_key) {
  const params = {
    Bucket: bucket_name,
    Key: unquotePlus(object_key),
  };

  return await S3.getObject(params).promise();
}

async function putObjectToS3({ Bucket, Key, data, ContentType }) {
  const params = {
    Bucket: Bucket,
    Key: Key,
    Body: data,
    ContentType: ContentType,
  };

  return await S3.putObject(params).promise();
}

async function putItemToDynamoDB(TableName, Item) {
  const params = {
    TableName: TableName,
    Item: Item,
  };

  return await dynamodb.putItem(params).promise();
}

function unquotePlus(s) {
  return decodeURIComponent(s.replace(/\+/g, ' '));
}

Let's understand the Lambda Function.

The default handler function is speakPolly. This function will be called when we invoke our Lambda function. The event object contains the details of the event that triggered the Lambda function. In our case, the event object will contain the details of the S3 object that was created.

With the event object we'll get the details of the S3 object that was uploaded. We'll get the bucket name, object key from the event object, and many other details like object size. We'll use the getObjectFromS3 function to get the object from the S3 bucket. With toString method we'll get the raw text from the object and text to generate the speech.

We'll use the Polly.synthesizeSpeech function to generate the speech. The Polly.synthesizeSpeech function will return the speech in the form of an audio stream. We'll use the putObjectToS3 function to put the audio stream to the S3 bucket. The putObjectToS3 function will return the object that was created in the S3 bucket.

The putItemToDynamoDB function will save the details to a DynamoDB table.

Now, let's deploy our application. We'll be using the serverless package to deploy our application. for this we need to write the YAML file containing all the details of resources and configuration of our application.

service: speak-polly-app
frameworkVersion: '3'

custom:
  text_bucket: text-${self:service}-${self:provider.stage}
  audio_bucket: audio-${self:service}-${self:provider.stage}
  dynamodb_table: mp3-${self:service}-${self:provider.stage}

provider:
  name: aws
  runtime: nodejs14.x
  stage: dev
  region: ap-south-1
  memorySize: 128
  timeout: 100
  profile: serverless-admin
  environment:
    TEXT_BUCKET: ${self:custom.text_bucket}
    AUDIO_BUCKET: ${self:custom.audio_bucket}
    DYNAMODB_TABLE: ${self:custom.dynamodb_table}

  iam:
    role:
      statements:
        - Effect: Allow
          Action: 's3:*'
          Resource:
            - 'arn:aws:s3:::${self:custom.text_bucket}/*'
            - 'arn:aws:s3:::${self:custom.text_bucket}*'
            - 'arn:aws:s3:::${self:custom.audio_bucket}/*'
            - 'arn:aws:s3:::${self:custom.audio_bucket}*'
        - Effect: Allow
          Action: 'polly:*'
          Resource: '*'
        - Effect: Allow
          Action:
            - 'dynamodb:PutItem'
            - 'dynamodb:GetItem'
            - 'dynamodb:DeleteItem'
          Resource:
            - 'arn:aws:dynamodb:${self:provider.region}:*:table/${self:custom.dynamodb_table}'
            - 'arn:aws:dynamodb:${self:provider.region}:*:table/${self:custom.dynamodb_table}/*'

functions:
  speakPolly:
    handler: handler.speakPolly
    environment:
      VOICE_AGENT: Joanna

    events:
      - s3:
          bucket: ${self:custom.text_bucket}
          event: s3:ObjectCreated:*
          rules:
            - prefix: transcripts/
            - suffix: .txt

# you can add CloudFormation resource templates here
resources:
  Resources:
    S3AudioBucket:
      Type: AWS::S3::Bucket
      Properties:
        BucketName: ${self:custom.audio_bucket}
        AccessControl: PublicRead
        VersioningConfiguration:
          Status: Suspended

    DynamoDBTable:
      Type: AWS::DynamoDB::Table
      Properties:
        TableName: ${self:custom.dynamodb_table}
        AttributeDefinitions:
          - AttributeName: id
            AttributeType: S
        KeySchema:
          - AttributeName: id
            KeyType: HASH
        ProvisionedThroughput:
          ReadCapacityUnits: 1
          WriteCapacityUnits: 1

Let's understand the YAML file.

The service property is the name of our application. The frameworkVersion property is the version of the serverless package we are using.

The custom property is used to define the custom properties. We'll be using the text_bucket and audio_bucket and dynamodb_table properties to define the name of the S3 buckets and DynamoDB table we are going to create.

The provider property is used to define the provider of our application. We'll be using the aws provider. The runtime property is the runtime environment of our application. We'll be using the nodejs14.x runtime environment. The stage property is the stage of our application. We'll be using the dev stage. The region property is in which region our application and resources are going to reside. We'll be using the ap-south-1 (Mumbai) region. The memorySize property is the memory size that the lambda function gets. We'll be using the 128 MB. The timeout property is the maximum number of seconds that our function can run at a single invocation. We'll be using the 100 timeout. The profile property is the aws profile that I'm going to use. If you're using default or have a single account configured you can remove it. Here I'm using the serverless-admin profile. The environment property is the environment variables of our lambda function.

The iam property is used to define the IAM roles that our application going to need. The role property is used to define the IAM role of our application. The statements is used to define the IAM statements. The Effect property is used to define the effect of the IAM statement. The action property is used to define the action of the IAM statement. The Resource property is used to define the resources of the IAM statement.

The functions block defines the functions of our application. speakPolly is the function of our application, you can declare one or more functions as well. The handler property is used to define the handler function of the lambda function. The environment property is used to define the environment variables specific to the function. The events property is used to define the events that can trigger the function. The S3 event that is declared can trigger the speakPolly function.

The resources property is used to define the resources of our application. The S3AudioBucket is to define the S3 destination bucket. The Type property is used to define the type of resource. The Properties statement is used to define the different properties of the resource. Similarly, with the DynamoDBTable property a DynamoDB table and its different attributes are mentioned.

Now, let's deploy our application. We'll use the serverless package to deploy our application.

serverless deploy --verbose

Note: Instead of serverless you can also use sls.

The serverless deploy command will deploy our application. The --verbose flag will display the verbose output of the command.

function-home.png

The command with the serverless.yml file will create a cloudformation stack. The cloudformation stack intern will create all the necessary IAM roles, S3 buckets, Lambda functions, database tables, and other required resources.

cloud-formation.png

After the deployment is completed, we'll get the details of the resources that were created.

Now you can upload the text file to the S3 TextBucket. The speakPolly function will be triggered. The speakPolly function will generate the speech from the text and put the audio stream into the S3 bucket. the metadata like audio filename, VoiceId, creation date, etc will be stored in dynamodb.

For every execution, a log stream will be created in the CloudWatch logs.

cloud-watch-logs.png

The output of the speakPolly function can be seen in the target S3 Bucket with the MP3 file.

output-s3-mp3-file.png

Cleanup

After you test the application, you can remove the application by serverless to make sure you don't get charged for the resources that were created.

serverless remove --verbose

References

Did you find this article valuable?

Support TheHTTP by becoming a sponsor. Any amount is appreciated!