The Voice Design Revolution and How to Join It: A Discussion with Cathy Pearl

Profile photo for CareerFoundry contributor Ed Wood.

Voice user interfaces will revolutionize the way we interact with technology. But we would say that, wouldn’t we?

As the first provider of a comprehensive specialization course in VUI, CareerFoundry is confident that the wave of demand for voice designers is on its way. Why? Well, it’s not just because it’s one of the coolest new job titles around.

The major players in tech—from Amazon to Google—are investing heavily in voice. By September 2017, a mindblowing 5,000 people were working on Amazon Alexa—the name given to both the company’s digital assistant and the wing of the tech behemoth responsible for it—and they were in the process of recruiting a further 1,000 more. As Rob Pulciani, director of Amazon Alexa, said at the launch of the CareerFoundry course: “Voice is a brand new way of interacting with technology, and one that we believe is the future of computing.”

So why is voice design (we’ll have to get used to this collocation) gaining such traction now? After all, Google first introduced voice search as far back as 2009, with Apple releasing its first incarnation of Siri in 2011. The difference is that the technology has now crossed the threshold at which it’s truly utile to the user. This threshold, estimated to equate to roughly 95% accuracy in comprehension, could well mark the inflection point at which we begin to tire of poking and prodding our devices and prefer, instead, to talk to them.

Today’s VUIs are quickly growing smarter, learning the user’s speech patterns over time and even building their own vocabulary. With VUIs fast-evolving, voice is poised to be the next major disruption in computing. (Amazon)

Adoption rates for products like Amazon Alexa and Voice Search indicate that the migration towards this technology is well and truly underway: The number of adults using voice search on a daily basis in the U.S. is hovering around 50%, the number of teens using it has already surpassed that mark, and ComScore estimates that 50% of all searches will be voice-based by 2020. Furthermore, the number of voice assistants installed on smartphones will reach 5 billion by 2022.

Amazon Alexa has already sold over 20 million devices—which equates to 75% of the market—and during the Black Friday holiday shopping weekend of 2017, the Amazon Echo Dot was the best-selling product of any manufacturer in any category on Amazon. Google Home, Echo’s closest rival, has sold approximately 5 million devices, and both Apple and Microsoft are now entering the market with the Homepod and Invoke.

Given voice’s rapid development, we thought it worthwhile to get in touch with one of the industry’s stars, Cathy Pearl, to discuss how she first got started, how others could now, the current challenges, and what the future holds for the field.

From an early age, Cathy held a keen interest in communicating with machines—even as a child, she was already programming computers, and her one true desire was for “the computer to talk back to me.” She went on to complete an undergraduate degree in Cognitive Science, followed by a master’s in Computer Science. \

Her professional experience stretches across almost two decades: from voice work for Nuance Communications, one of the trailblazers in voice recognition technology, in 1999 to designing voice user interfaces  for  everything from banks to fashion brands.

She now leads the user experience team at Sensely, a 30-strong San Francisco startup in the healthcare sector whose virtual avatar, Molly, assists chronically ill patients in managing their conditions.

The app is primarily voice-controlled, connecting patients to medical devices for measurement and lessening the need for repeated visits to the doctor. She’s also the author of the recently released book Designing Voice User Interfaces.

Describing Voice Design for the Uninitiated

There’s a great Mitchell and Webb dinner party sketch in which a brain surgeon encounters first an accountant, then a charity worker, and compliments both on their choice of occupation before following up with the gleeful gibe: “Well, it’s not exactly brain surgery, is it?” Then, in strolls a rocket scientist, who gives him a taste of his own medicine; “Brain surgery? Well, it’s not exactly rocket science, is it?”

I recall this sketch when I ask Cathy how she describes her job to those lucky enough to be by her side at a dinner party. She turns the concept on its head: “Well, imagine you’re speaking to a person at a dinner party who isn’t following basic conversational rules, which would be very frustrating. For example, you ask, ‘Do you have the time?,’ to which they reply, ‘yes’—that would be pretty annoying.”

I cast my mind back to innumerable stilted conversations at the dining table, but none quite that stilted. Cathy continues: “Technically, they have answered your question—they’ve understood you. But, boy, is it frustrating and totally not what you wanted, so you follow up and ask, ‘Well, what is it?,’ and they reply, ‘I’m sorry, I don’t understand,’ because they don’t know what ‘it’ refers to.

So this is where we’re at with a lot of voice systems currently. Technically, we understood the words coming out of your mouth, but we’re not making use of them in the right way; we’re not doing it in an expected, elegant way. This is the point of VUI design: it’s not just the technology—we also have to think a lot about conversational rules. How do people expect the conversation to go? How will people actually speak to the device? And then we have to make sure all of this is taken into account and make a design that works for them.”

Accounting for Cultural Differences in Conversational Rules

Anyone who speaks or has learned a second language (or merely travelled by public transport in other countries) will know that there are marked differences in conversational rules from one country to another.

I ask how voice designers approach this, and as it turns out, one needn’t look to foreign languages to be confronted with such problems:

“You have to think about localization. Even going from American English to British English, we’ve had the feedback that the responses were too enthusiastic, so we had to tone it down. And within the U.S., the pause people take before speaking is shorter for east-coasters than west-coasters, so west-coasters think, ‘why aren’t they letting me speak?’, and east-coasters think, ‘why aren’t they speaking?’ That kind of thing is currently difficult for our voice systems to cope with.”

Constraints require designers to be creative: “Because we can’t handle that more subtle turn-taking, we as designers have to make it really clear, in an elegant way, when an input is expected and what type of input is expected. That’s something which is really key right now.” Perhaps somewhat naively, I’d assumed that the grand aim of voice design would be to mimic human-to-human communication as accurately as possible, and I suggest that the instructional approach contravenes this:

“I’m not sure where I fall on that. On the one hand, I’m a big fan of human-like conversation because then I don’t have to learn anything. I can just talk like I normally talk and don’t need a user manual or any special commands. On the other hand, I don’t want to say, ‘Alexa, please tell me the weather in San Francisco today.’ I want to say, ‘Alexa, weather,’ because it’s just a computer—a shortcut. In the end, it should work in both ways.”

These shortcuts constitute one of the first evolutions in how we communicate with our digital assistants, and, in a similar way to the unexpected explosion of emoji in chat messages, perhaps they’re precursors to a lengthy glossary of shorthand commands that will become commonplace. Or, alternatively, as devices become more intelligent, users might develop their own, individual, stylized shorthand.

“Alexa, tell me it’s going to be sunny today.”

“Sorry, Michael, you’re going to have to take your coat again.”

I wonder if this kind of communication could have repercussions on the way humans communicate with one another. Cathy confirms there’s been some commotion in the media about this:

“I know there’s been a bit of a scare about that: ‘Oh, my kid orders around Amazon and I’m afraid it’s going to teach my kid to be rude’—I don’t really worry about that. Kids can easily distinguish between talking to a device and a human. But I did see a more recent article that said it wasn’t so much about the person learning to speak rudely, but rather the way they come across to the rest of the world. Maybe, if I hear you being rude to your devices, I’ll think that’s not a good way to be.”

Or, maybe, we simply won’t hear people talking to their devices at all—unless we’re in their home with them, that is.

A number of studies have demonstrated users’ aversion to speaking to their phone in public. While over 50% of people surveyed would use their voice assistant when at home or in the car, fewer than 25% would use it in a public place, whether that be on public transportation, at a party, or at the gym. 40% also reported being annoyed by other people using voice assistants in public, so our reticence isn’t entirely without foundation.

There are some notable demographic differences, though—the younger, the wealthier, and the more masculine you are, the more likely you are to use voice in public, and Cathy recognizes signs of greater acceptance into the mainstream, from Superbowl ads to Saturday Night Live spoofs.

She admits, however, that the use of voice in public will continue to be an issue until a subvocalization technology is developed (apparently, NASA is already working on this) so that one doesn’t have to speak aloud. She also points out that our working environments, where open space offices are still the norm, are not equipped for twenty people talking to their devices simultaneously.

So, aside from ever-increasing adoption rates and subvocal communication, what does 2018 have in store for voice designers? “My hope is that we’ll see more actual conversations and smarter systems of context recognition, which is something very built into us as humans, so it’s very frustrating when our systems don’t understand it. 2018 will be about smarter interactions with our voice system that go beyond one turn.” Voice designers will be instrumental in driving this development:

“Amazon Echo has over 25,000 skills, and a lot of people, especially developers, are making skills, but the question is: how are you going to make people use your skill? And, in my mind, that’s where the VUI designers come in, because there are a lot of skills out there that may work if you say exactly the right thing in exactly the right sequence, but they’re not very satisfying experiences. A designer can help with that and really make a great experience for the user, making the user want to come back.”

The world needs VUI designers, then, or else we’ll be consigned to awkward conversations with circuit boards for the foreseeable future. Taking a step away from the abstractions of machine-human interactions and 2018 predictions, I ask what it’s like to be a VUI designer. How does a typical day pan out?

“I don’t know if there’s ever a typical day, but common things I do include looking at anonymized logs of what people are saying to know where we’re getting it wrong and how we can improve. I also spend time speaking with our customers—the companies we interact with, to understand their needs for their patients—as well as to the patients themselves, asking what they like and don’t like, and I work with visual designers to help design and improve our mobile app.”

I remark that it seems quite similar to the tasks of a UX designer: “It’s just another design system—with its own quirks and specialities, of course—but so many design principles apply here.”

And (aside from taking the CareerFoundry course, obviously), how should one go about entering the field of VUI design?

“My biggest recommendation is to try it. What do I mean by that? Do something called sample dialogues, which is like writing a movie script. Things like, ‘Here’s what the system would say, here’s what the user would say, here’s what the system would say…,’ and you play around, and then you use a prototyping tool like Sayspring, Storyline, or Pullstring—all free tools that allow you to create real voice conversations where you can actually speak and try them out. You don’t have to be a coder, and you’ll even surprise yourself. More formally, you can study conversation theory, from H. P. Grice, for example, who came up with theorems of human communication back in the 70s. He introduces a lot of great concepts.”

Cathy divides the history of voice into two eras: the cost-cutting era of IVRs (interactive voice response) where automated responders would stand in-between humans and what they wanted, and the new era, where automated responders carry out value-adding tasks other humans don’t want to, whether that be ordering from a shop or selecting a track to play.

Having been burnt out by IVR due to its ostensible lack of positive impact on the user, she’s now buoyed by the new technology and the opportunities and benefits it offers users. Talking about Sensely, she says, “We’re using the technology to actually help people—people who might have difficult, chronic conditions, who need assistance every day, and I’m excited about it. Our users are older and not particularly tech-savvy, but they don’t need a two-hour course on how to use the app. They just speak, and I think that’s wonderful.”

There’s no doubt that VUI design is a burgeoning field that represents the next frontier in human-device interaction. If you’d like to learn more about the field, check out Cathy’s book, Designing Voice User Interfaces, and CareerFoundry’s VUI design course.

Want to learn more about UX design? Check out these articles:

What is CareerFoundry?

CareerFoundry is an online school for people looking to switch to a rewarding career in tech. Select a program, get paired with an expert mentor and tutor, and become a job-ready designer, developer, or analyst from scratch, or your money back.

Learn more about our programs
blog-footer-image