Remember the first time you realized that you’d been had? The photo you were looking at wasn’t authentic, but a manipulation, created by Photoshop? Well, here you go again.
At its annual MAX conference, Adobe demoed a new technology, developed with Princeton University, called Project VoCo. Adobe also describes technology as “Photoshopping Voiceovers,” because it’s essentially Photoshop for the human voice—if Photoshop were as easy to use as Word. You import a voice sample and see its wave form like any audio software you know. The system identifies the words in the clip and spells them out into a text editor. If you want to change anything that was said, all you have to do is type a new word. When you play the clip again, that voice you heard saying something that was actually recorded? Now it’s saying something totally different. It’s eerie.
“You guys have been making weird things online with photo editing,” said Adobe’s Zeyu Jin during Voco’s first public demo. “We’ll do the next thing today. Let’s do something to human speech. Like changing what you said in your wedding.”
The demonstration on stage was uncanny. A man saying, “I kissed my dogs and my wife,” was swapped out to “I kissed my wife and my dogs.” That worked okay (there’s a brief blip in the audio, but it’s fixable), but it was nothing you couldn’t do with conventional audio software. Then, however, Jin changed the voice to say, “I kissed Jordan and my dogs.” The word “Jordan” hadn’t been originally recorded. And it sounds totally real.
There’s a real business opportunity here. Hollywood films re-record and dub significant amounts of dialog in films because it’s not captured well on location, which is a time-intensive process that requires getting stars back into audio booths to read their lines again. Theoretically, VoCo could eliminate a lot of that need. It’s also easy to imagine all of these chatbots we’re talking to being able to adopt the voices of people we know and love. So when your spouse texts, you can hear them pester you about the dry cleaning in their own voice.
Of course, this is all assuming that VoCo’s voice generation algorithms are very good. Frankly, we’ve seen far too little to assume that they’ve reached that point of quality yet—and the longer they speak, the more the errors probably show.
But my interest in VoCo is its uncanny side. Hearing someone say something they didn’t say feels viscerally wrong. Maybe it’s because I’m a journalist, who is constantly referring to audio recordings to get quotes straight, that I immediately imagine a “leaked” recording of Hillary or Trump that isn’t real at all, with an endless wave of political show hosts debating it out with audio experts.
Didn’t Photoshop offer the same potential for abuse? To some extent, sure. However, the UX that Adobe has shown for Project VoCo has way less of a learning curve than Photoshop. Because a Photoshop master can do anything with the software, but most of us don’t know how to use layer masks and color burns and all the subtle tools digital artists need to take full potential of the software. So while we could use Photoshop to forge a nuclear arsenal inside North Korea, we’re more likely to stamp a friend’s head on a bodybuilder for some office gag.
But Project VoCo’s trick doesn’t need a complex toolbar or years of study to wield. Its user interface is literally just words, which you can delete and re-type. All skill has been removed from the equation, while algorithms work behind the scenes to make your wildest dreams come true.
Consider what happens when media manipulation is both all-capable and all-simple—when we can convincingly render anyone we know doing or saying anything we like? Maybe that time is still a century off, but if VoCo demonstrates anything, it’s that we might need a century to get used to the idea—and prepare for the consequences.