Voice Cloning

Voice cloning is a technology that promises human-sounding synthetic speech that can be used to support existing applications and encourage the development of new applications, particularly for use in mobile phone voice mail, announcement messages, and voice-activated features. Although synthesized speech systems go back to 1939, today’s technology offers voice quality that is so realistic that it justifies being called “cloning.”

Voice cloning is based on technology developed by AT&T Labs and has two components. The first is a text-to-speech engine that turns written words into natural-sounding speech. The second includes a library of voices and the ability to custom-develop a voice, perhaps duplicating a celebrity spokesperson. The English-speaking voice, male or female, can be used to read text on a computer, cell phone, or personal digital assistant (PDA).

The technology can even be added to a car’s computer system to recite driving directions, provide city and restaurant guides, and report on the performance of key subsystems. The speech software is so good at reproducing the sounds, inflections, and intonations of a human voice that it can recreate voices and even bring the voices of long-dead celebrities back to life.

The software, which turns printed text into synthesized speech, makes it possible for a company to use recordings of a person’s voice to utter new things that the person never actually said. The software, called Natural Voices, is not flawless—the synthesized speech may contain a few robotic tones and unnatural inflections—but this is the first text-to-speech software to raise the specter of voice cloning, replicating a person’s voice so perfectly that the human ear cannot tell the difference.

The product itself is provided as a text-to-speech server engine and client software development kit (SDK) that is an integrated collection of C++ classes to help developers integrate text to speech into their applications. The SDK includes a sample application that can be used to explore potential uses of the SDK and text-to-speech server. Both the text-to-speech server engine and SDK run on popular computer and development platforms, including Linux, Solaris, and Windows NT and 2000.

An installation package installs the AT&T Labs Natural Voices TTS engine, documentation, tools, class libraries, sample applications, and demo applications onto the target system. AT&T Labs also offers a custom voice product that entails a person going to a studio where staff record 10 to 40 hours of readings.

Texts range from business and news reports to outright babble. The recordings are then chopped into the smallest number of units possible and sorted into databases. When the software processes text, it retrieves the sounds and reassembles them to form new sentences. In the case of long-dead celebrities, archival recordings can be used in the same way.


Potential customers for the software, which is priced in the thousands of dollars, include telephone call centers, companies that make software that reads digital files aloud, and makers of automated voice devices. Businesses could use the software to:

  • Create new revenue-generating applications and services for cell phone users.
  • Improve customer relationships by putting a pleasantsounding voice interface on applications, products, or services.
  • Realize mobility and “access anywhere/anytime/any device” strategies by making computer-based information accessible by voice.
  • Facilitate international expansion plans through a wide variety of text-to-speech languages.

Third-party developers can use voice cloning technology to add significant enhancements to existing applications and services, drive new revenue opportunities, and add “stickiness” to applications or services. Voice on an e-commerce Web site, for example, can make content easily accessible to the visually impaired, which would keep them coming back for future purchases.

The software also can be used by publishers of video games and books on tape. In the near future, people will want high-end speech technology that enables them to interact at length with their cell phones and Palm organizers instead of typing entries and squinting at a tiny screen.


Issues Voice cloning technology raises ownership issues. For example, who owns the rights to a celebrity’s voice? This and related issues can be addressed in contracts that include voice-licensing clauses. Current technical limitations may alleviate any worries that a person’s voice could be cloned without permission.

Although the technology is not yet good enough to carry out fraud, synthesized voices eventually may be capable of tricking people into thinking that they are getting phone calls from people they know—such as a politician during an election campaign.

Politicians already make use of machines that perfectly mimic their signatures and handwritten postscript messages, making it appear that they are sending personal letters to constituents. In the not too distant future, we can expect voice cloning to add another personal touch to campaigning.

What is unique about voice cloning is the ability to recreate custom voices. AT&T has previously licensed speech technology, such as SpeechWorks, to other companies but contends that the latest version represents a huge technological leap forward. Despite the technical breakthroughs by AT&T Labs, many engineers are skeptical that a completely simulated voice can be indistinguishable from that of a human.

With the pressure on to perfect the technology, however, it is too soon to rule out this possibility. Already industry analysts are predicting that the market for text-to-speech software will reach more than $1 billion in the next 5 years, providing ample incentive to fine-tune the technology.