For years, brands have relied on famous actors to provide narration for ads, which creates a sense of familiarity and imparts the desired tone. But actors can be expensive, and their voices usually have a limited shelf life for commercial use. A synthetic voice offers an effective alternative.
Furthermore, AI voices make the process of editing voiceovers, quick and easy. Users can simply make the necessary changes to their script and modify the AI voice automatically. Synthetic voice is now accessible more than ever in this growing AI and ML world. the possibility of synthetic automated voice is only limited by your imagination. That being said the process of converting Text / SSML to natural-sounding voices isn't as difficult as used to be.
In this article, we'll see how AWS Polly which is an AI-based text-to-speech service can be used to convert text to speech in your desired language. We'll also see how to use AWS Lambda to convert text to speech and save it to AWS S3 and DynamoDB.
Services
What is Amazon Polly?
Amazon Polly is a service that turns text into lifelike speech. It's a text-to-speech (TTS) service that uses advanced deep learning technologies to synthesize speech that sounds like a human voice. It provides dozens of lifelike voices in multiple languages. You can create applications that talk and build entirely new categories of speech-enabled products. Polly is a fully managed service that makes it easy for developers to add speech to their applications.
Polly is a paid service. You can use the free tiers. After that, you'll have to pay for the service. The free tier includes 5 million characters per month for 12 months. You can find the pricing details here.
With
AWS Polly
developers can customize and control speech output that supports lexicons and Speech Synthesis Markup Language (SSML) tags.Store and redistribute speech in standard formats like MP3 and OGG.
Quickly deliver lifelike voices and conversational user experiences in consistently fast response times.
How Polly Works This is an Image demonstrating how Polly works straight out of the AWS documentation.
How to use AWS Polly?
For a Quick Simple Demo of AWS Polly, you can use the AWS Polly Service Page.
- Go to the AWS Polly console and click on the
Get Started Now
button. - Click on
Text to Speech
. - Select the language and voice you want to use.
- Enter the text you want to convert to speech.
- Click on
Play
to listen to the speech. - Click on
Download
to download the speech.
Without further ado, let's get started and build a simple application that converts text to speech and saves the MP3 file it to AWS S3 and the Metadata to DynamoDB.
What is AWS Lambda?
AWS Lambda is a compute service that lets you run code without provisioning or managing servers. AWS Lambda executes your code only when needed and scales automatically, from a few requests per day to thousands per second. You pay only for the compute time you consume - there is no charge when your code is not running. With Lambda, you can run code for virtually any type of application or backend service - all with zero administration. Just upload your code and Lambda takes care of everything required to run and scale your code with high availability. You can set up your code to automatically trigger from other AWS services or call it directly from any web or mobile app.
What is AWS S3?
AWS S3 is a simple storage service that offers an extremely durable, highly available, and infinitely scalable object storage infrastructure at very low costs. It is designed to make web-scale computing easier for developers. Amazon S3 provides developers and IT teams with secure, durable, highly-scalable object storage. Amazon S3 is easy to use, with a simple web services interface to store and retrieve any amount of data from anywhere on the web.
What is DynamoDB?
AWS DynamoDB is a fully managed NoSQL database service that provides fast and predictable performance with seamless scalability. DynamoDB lets you offload the administrative burdens of operating and scaling a distributed database, so that you don't have to worry about hardware provisioning, setup, and configuration, replication, software patching, or cluster scaling.
Prerequisites
- You need to have an AWS account. If you don't have one, you can create one here.
- You need to have Node.js installed on your machine. You can download it from here.
- You need to have the AWS CLI installed on your machine. You can download it from here.
- You need to have AWS CLI configured. you can configure it by running
aws configure
. - You need to have the Serverless Framework installed on your machine. You can download it from here.
Steps
Note: You can find the complete code for this article in my GitHub Repository.
Initializing the app.
With the serverless framework, we can generate a boilerplate code for our application. We'll use the serverless create
command to generate the boilerplate code.
sls create --template aws-nodejs --path speak-polly-app --name speak-polly-app
Here the --template
signifies the template we want to use. We're using the aws-nodejs
template. The --path
signifies the path where we want to create the project. We're creating the project in the speak-polly-app
folder. The --name
signifies the name of the project. We're naming the project as speak-polly-app
.
After running the above command, you'll see a folder named speak-polly-app
in your current directory. This folder contains the boilerplate code for our application.
cd speak-polly-app
Now, let's install the dependencies for our application. We'll be using aws-sdk
and uuid
packages. The aws-sdk
package will help us to interact with AWS services and APIs. The uuid
package will help us to generate a unique ID for our speech.
npm install aws-sdk uuid
Code
Now, let's open the handler.js
file in our project. This file contains the code for our application.
'use strict';
const AWS = require('aws-sdk');
const { v4: uuidV4 } = require('uuid');
const Polly = new AWS.Polly({ apiVersion: '2016-06-10' });
const S3 = new AWS.S3({ apiVersion: 'latest' });
const dynamodb = new AWS.DynamoDB({ apiVersion: '2012-08-10' });
module.exports.speakPolly = async (event) => {
try {
const records = event['Records'];
for (const record of records) {
const bucketName = record['s3']['bucket']['name'];
const objectKey = record['s3']['object']['key'];
// const objectSize = record['s3']['object']['size'];
const objectDetails = await getObjectFromS3(bucketName, objectKey);
const text = objectDetails.Body.toString('utf-8');
const voiceId = process.env.VOICE_AGENT || 'Carla';
const params = {
Engine: 'standard',
OutputFormat: 'mp3',
SampleRate: '22050',
Text: text,
TextType: 'text',
VoiceId: voiceId,
};
const data = await Polly.synthesizeSpeech(params).promise();
const destinationBucketName = process.env.AUDIO_BUCKET;
const mp3FileKey = uuidV4();
const destinationObjectKey = `${mp3FileKey}.mp3`;
const s3Params = {
Bucket: destinationBucketName,
Key: destinationObjectKey,
data: data.AudioStream,
ContentType: 'audio/mpeg',
};
const dynamoTableName = process.env.DYNAMODB_TABLE;
await Promise.all([
putObjectToS3(s3Params),
putItemToDynamoDB(dynamoTableName, {
id: { S: mp3FileKey },
text: { S: text },
voiceId: { S: voiceId },
bucket: { S: destinationBucketName },
createdAt: { S: new Date().toISOString() },
updatedAt: { S: new Date().toISOString() },
}),
]);
}
return {
statusCode: 200,
body: JSON.stringify(
{ message: 'Successfully executed', data: event },
null,
2
),
};
} catch (err) {
return {
statusCode: 500,
body: JSON.stringify({ message: 'Error', data: err }, null, 2),
};
}
};
async function getObjectFromS3(bucket_name, object_key) {
const params = {
Bucket: bucket_name,
Key: unquotePlus(object_key),
};
return await S3.getObject(params).promise();
}
async function putObjectToS3({ Bucket, Key, data, ContentType }) {
const params = {
Bucket: Bucket,
Key: Key,
Body: data,
ContentType: ContentType,
};
return await S3.putObject(params).promise();
}
async function putItemToDynamoDB(TableName, Item) {
const params = {
TableName: TableName,
Item: Item,
};
return await dynamodb.putItem(params).promise();
}
function unquotePlus(s) {
return decodeURIComponent(s.replace(/\+/g, ' '));
}
Let's understand the Lambda Function.
The default handler function is speakPolly
. This function will be called when we invoke our Lambda function. The event
object contains the details of the event that triggered the Lambda function. In our case, the event object will contain the details of the S3 object that was created.
With the event
object we'll get the details of the S3 object that was uploaded. We'll get the bucket name, object key from the event object, and many other details like object size. We'll use the getObjectFromS3
function to get the object from the S3 bucket. With toString
method we'll get the raw text from the object and text to generate the speech.
We'll use the Polly.synthesizeSpeech
function to generate the speech. The Polly.synthesizeSpeech
function will return the speech in the form of an audio stream. We'll use the putObjectToS3
function to put the audio stream to the S3 bucket. The putObjectToS3
function will return the object that was created in the S3 bucket.
The putItemToDynamoDB
function will save the details to a DynamoDB table.
Now, let's deploy our application. We'll be using the serverless
package to deploy our application. for this we need to write the YAML
file containing all the details of resources and configuration of our application.
service: speak-polly-app
frameworkVersion: '3'
custom:
text_bucket: text-${self:service}-${self:provider.stage}
audio_bucket: audio-${self:service}-${self:provider.stage}
dynamodb_table: mp3-${self:service}-${self:provider.stage}
provider:
name: aws
runtime: nodejs14.x
stage: dev
region: ap-south-1
memorySize: 128
timeout: 100
profile: serverless-admin
environment:
TEXT_BUCKET: ${self:custom.text_bucket}
AUDIO_BUCKET: ${self:custom.audio_bucket}
DYNAMODB_TABLE: ${self:custom.dynamodb_table}
iam:
role:
statements:
- Effect: Allow
Action: 's3:*'
Resource:
- 'arn:aws:s3:::${self:custom.text_bucket}/*'
- 'arn:aws:s3:::${self:custom.text_bucket}*'
- 'arn:aws:s3:::${self:custom.audio_bucket}/*'
- 'arn:aws:s3:::${self:custom.audio_bucket}*'
- Effect: Allow
Action: 'polly:*'
Resource: '*'
- Effect: Allow
Action:
- 'dynamodb:PutItem'
- 'dynamodb:GetItem'
- 'dynamodb:DeleteItem'
Resource:
- 'arn:aws:dynamodb:${self:provider.region}:*:table/${self:custom.dynamodb_table}'
- 'arn:aws:dynamodb:${self:provider.region}:*:table/${self:custom.dynamodb_table}/*'
functions:
speakPolly:
handler: handler.speakPolly
environment:
VOICE_AGENT: Joanna
events:
- s3:
bucket: ${self:custom.text_bucket}
event: s3:ObjectCreated:*
rules:
- prefix: transcripts/
- suffix: .txt
# you can add CloudFormation resource templates here
resources:
Resources:
S3AudioBucket:
Type: AWS::S3::Bucket
Properties:
BucketName: ${self:custom.audio_bucket}
AccessControl: PublicRead
VersioningConfiguration:
Status: Suspended
DynamoDBTable:
Type: AWS::DynamoDB::Table
Properties:
TableName: ${self:custom.dynamodb_table}
AttributeDefinitions:
- AttributeName: id
AttributeType: S
KeySchema:
- AttributeName: id
KeyType: HASH
ProvisionedThroughput:
ReadCapacityUnits: 1
WriteCapacityUnits: 1
Let's understand the YAML
file.
The service
property is the name of our application. The frameworkVersion
property is the version of the serverless
package we are using.
The custom
property is used to define the custom properties. We'll be using the text_bucket
and audio_bucket
and dynamodb_table
properties to define the name of the S3 buckets and DynamoDB table we are going to create.
The provider
property is used to define the provider of our application. We'll be using the aws
provider. The runtime
property is the runtime environment of our application. We'll be using the nodejs14.x
runtime environment. The stage
property is the stage of our application. We'll be using the dev
stage. The region
property is in which region our application and resources are going to reside. We'll be using the ap-south-1
(Mumbai) region. The memorySize
property is the memory size that the lambda function gets. We'll be using the 128 MB
. The timeout
property is the maximum number of seconds that our function can run at a single invocation. We'll be using the 100
timeout. The profile
property is the aws profile that I'm going to use. If you're using default or have a single account configured you can remove it. Here I'm using the serverless-admin
profile. The environment
property is the environment variables of our lambda function.
The iam
property is used to define the IAM roles that our application going to need. The role
property is used to define the IAM role of our application. The statements
is used to define the IAM statements. The Effect
property is used to define the effect of the IAM statement. The action
property is used to define the action of the IAM statement. The Resource
property is used to define the resources of the IAM statement.
The functions
block defines the functions of our application. speakPolly
is the function of our application, you can declare one or more functions as well. The handler
property is used to define the handler function of the lambda function. The environment
property is used to define the environment variables specific to the function. The events
property is used to define the events that can trigger the function. The S3 event that is declared can trigger the speakPolly
function.
The resources
property is used to define the resources of our application. The S3AudioBucket
is to define the S3 destination bucket. The Type
property is used to define the type of resource. The Properties
statement is used to define the different properties of the resource. Similarly, with the DynamoDBTable
property a DynamoDB table and its different attributes are mentioned.
Now, let's deploy our application. We'll use the serverless
package to deploy our application.
serverless deploy --verbose
Note: Instead of
serverless
you can also usesls
.
The serverless deploy
command will deploy our application. The --verbose
flag will display the verbose output of the command.
The command with the serverless.yml
file will create a cloudformation
stack. The cloudformation stack intern will create all the necessary IAM roles, S3 buckets, Lambda functions, database tables, and other required resources.
After the deployment is completed, we'll get the details of the resources that were created.
Now you can upload the text file to the S3 TextBucket. The speakPolly
function will be triggered. The speakPolly
function will generate the speech from the text and put the audio stream into the S3 bucket. the metadata like audio filename, VoiceId, creation date, etc will be stored in dynamodb.
For every execution, a log stream will be created in the CloudWatch logs.
The output of the speakPolly
function can be seen in the target S3 Bucket with the MP3 file.
Cleanup
After you test the application, you can remove the application by serverless to make sure you don't get charged for the resources that were created.
serverless remove --verbose