This post is about the recent experience I had creating a synthetic version of my own voice.
I’ve been interested in synthetic speech since starting my Homie & Lexy podcast, where I generate the character voices in Polly, Amazon’s text-to-speech engine.
And the episode of VoiceMarketing titled “How listenable is synthetic speech” looked at how Polly performs for various types of content, including the Gettysburg Address, simple jokes, a newscaster script, and a Shakespearian sonnet.
Throughout it all, I’ve gotten quite familiar with SSML (Speech Synthesis Markup Language).
I’d had several conversations with Rupal Patel, CEO of VocaliD, a company that’s doing interesting work in the area of creating synthetic voices. (She was recently interviewed on a Voicebot.ai episode.)
So when Rupal asked if I’d like to go through the synthetic speech process and create a synthetic version of my voice, I was intrigued and jumped at the opportunity. (Full transparency, this article isn’t a paid endorsement for VocaliD. I just found the process interesting and the results impressive, and wanted to write about it.)
Here’s an outline of the process I went through for creating my synthetic voice.
Equipment
Two big points here.
- Have a quiet recording environment
- Use a good mic/headset
Both of those may seem obvious, but small problems can lead to big problems in the results. Even a little background noise can cause problems in compiling the speech, as the technology focuses on minute differences in the speech recordings, and that includes non-speech noise that slips in.
For microphones, I used a Sennheiser SC 60, which I use for my podcasts, and it worked well. VocaliD recommends the Turtle Beach Recon 50 gaming headset. Both are reasonably priced at about $40.
Recording
The recording process was fascinating. I read a total of 2,000 statements (they need about 90 minutes of clean audio recordings to work with). The statements ranged from a couple of sentences to short phrases with only a few words. This was all provided by VocaliD, and I recorded it online through their website interface. It took me somewhere around 3-4 hours over the course of a couple days to complete the recordings.
The phrases were an interesting mix of text. Mark Twain. Shakespeare. Famous novels. And even phrases from my own podcasts. More than once I had to do second takes because I’d laughed out loud at what I was reading.
Compositing
This is where the magic happens. And I’m not going to dive into a technical explanation of the machine learning advances and end-to-end neural synthesis stuff.
For technical details, I’ll refer you to the VocaliD website.
On a more pedestrian level, the clean audio samples are used to train an end-to-end synthesizer which learns to emulate the sound patterns and intonation of the speaker. And that voice training process goes on for between 24 and 48 hours. So a fair bit of processing.
Demonstration
Let’s run through a few demonstrations comparing my voice to my synthetic voice under different types of phrases. A question, a statement, projecting excitement, and using irony.
Question
Me:
Synthetic Me:
Statement
Me:
Synthetic Me:
Excitement
Me:
Synthetic Me:
Irony
Me:
Synthetic Me:
Results
When I first heard the synthetic version of myself, I was quite impressed. It’s a bit trippy hearing yourself saying something you’ve never actually said. Welcome to the future!
While synthetic voices still struggle with many types of content, I found the diction on VocaliD’s tool impressive for what you might call mid-range content. Texts that can be delivered in a fairly straight forward manner. Not crazy humor. Not high drama. But needed information.
What I think this level of synthesis does more than anything is extend the length of speech a synthetic voice can read for in that mid range of content.
In my episode “How listenable is synthetic speech“, the main factor in whether synthetic speech is usable was the type of content. But coming in shortly after that is the length. A short news or weather clip? That’s more doable. A 6 minute reading of a magazine article? Not so much.
Use Cases
What are some use cases for where we could apply synthetic speech?
My favorite, and this is actually what charted the course for VocaliD, is giving voice to people who are voice-disabled. Rupal makes a great point about how you wouldn’t attach just any prosthetic arm to a small child in need of one. It’s similar for someone who needs a synthetic voice. This creates the possibility for them to have a voice that’s distinct and tied to their personality.
Another example is for brand consistency.
Consider a company like Allstate Insurance, and their use of Dennis Haysbert for their commercials. By synthesizing his voice, they could apply those same brand qualities to their company’s interactive voice system. Any voice assistant apps they develop. And for some companies, that would even apply to physical products they develop.
For the money they’ve spent positioning their brand, defining the right voice to represent that brand, and the contract for Mr Haysbert’s talents the additional cost of synthesizing his voice is nominal, and extends the value of work they’ve already done.
Another interesting possibility is the creation of previously unattainable brand voices. VocaliD is not only able to synthesize one persons voice, but they can also combine multiple voices to create a brand voice that’s not only distinct, but perhaps has qualities never before captured in a single voice. A voice avatar, perhaps.
Sensitivities and Considerations
I’d be remiss if I didn’t mention that the idea of synthesizing someone’s voice introduces a lot of legal, not to mention ethical, considerations. Given I’m not an attorney, I’m not going to do a deep dive into this. But clearly the idea of creating synthetic speech around a person’s voice identity will bring up a lot of issues and discussions in the coming years. I do know the folks at VocaliD have already been thinking about this and are working on ways to watermark a talents voice, similar to technologies already used for videos and images.
As the world of voice technology, voice and audio devices, and voice-delivered content evolves over the next few years, the synthetic voice space is going to offer brands, content developers, and voice talents some unprecedented opportunities. My sense is these changes are going to happen swiftly, and navigating this part of the voice landscape is going to be key to maximizing the impact and value of audio content.