In order to try and find alternatives to the current way we communicate with voice assistants, I took a 4 step approach.

  • I identified groups of pain points users had when interacting with voice assistants.
  • I used conversational and screen prototypes that tested the ideas I had from my research and interventions and addressed users' pain points.
  • I co-created with users and iterated the interactions based on their feedback.
  • After arriving at an acceptable direction, I ran the proposals by a Machine Learning expert to see if these solutions actually fit with how ML & Natural Language Processing are used in voice assistants today.

Pain Points

#1 Misunderstanding Names/Accents

User Quotes

"It turned Koin to Colin or Karen, no matter how many times I corrected it…"

"I changed the name of my contacts from my parents’ Indian names to My Mom and My Dad so that I could call them using Siri while driving”

"Siri never works for me...maybe it's my shitty German accent"


Negative experiences like this decrease usage, even if the desire to use voice assistants exist because of their perceived utility

Users feel like they're wrong for being themselves, or that this technology is not made for them

Smart home assistants like Alexa or Google Assistant are starting to appear all over the world. Their functionalities and integrations are often localised, but even so, those in the global South have a disproportionately worse experience using them.

These devices are not perfect and reflect the nature and biases of their human makers. One solution is to provide endless amounts of unbiased training data that is localised and relevant - a daunting task. Could there be another way?

#2 Bad First-Time Experience

User Quotes

“Turns off a first impression when it says sorry, I don't understand”

“Initially I was really curious, I wanted to test it just to see… I asked it a bunch of questions and realised it was just a dumb robot”

"There’s no 'hover state’ for voice assistants”


When you don’t set expectations for voice assistants, people start out having very high ones. It's only downhill from there.

It's hard to form a thorough mental model through voice

A quick comparison of onboarding for existing voice assistants versus a conventional screen-based app reveals a heavy reliance on the idea that conversations are natural to people, but screen interfaces were/are more alien or have the potential to be more complex and therefore require explanation.

However, as the documentation for Conversational Design by Google explains, this way of speaking is already unnatural. By providing limited guidance, users are most certainly going to try to talk to it like they'd talk to the only other things in the world they converse with - other humans.

When voice assistants are held to these high standards and fail repeatedly, users lose trust that these devices are reliable. How can they be expected to purchase shoes from a device that has trouble turning the lights on?

#3 Doesn’t Admit or Acknowledge Shortcomings

User Quotes

"It won't respond to be quiet, only to shut up”

"Sometimes it'll misunderstand and read out a long wikipedia article and I have to yell at it to stop, it’s so annoying”

“I guess it’s learning all the time in the background, but I can't tell”


‍Lack of explanation as to what happened leads to anger and frustration

Modern voice assistants are designed to give an answer, even if it's incorrect, and won’t explain their reasoning or logic. A lot is hidden within the AI 'Black Box'

Voice assistants can be 'confidently wrong' about the answers they provide, and don't follow up with an explanation. This can leave users baffled at how something they consider 'so obvious' is beyond the understanding of this intelligent device. Or, perhaps more commonly, the answer is 'I didn't get that'.

The perceived stubbornness and shortness can further a user's (often verbal) aggression towards the voice assistant.

#4 Discoverability of Full Breadth of Function

User Quotes

“Doesn't that already happen?”

“Oh I didn't know you could do that”

“Oh yeah now Alexa says it can tell me what happened in the last update”


Discovery is limited to users' imagination, in a very specific context and means of inquiry

This isn't acceptable for screen interfaces, so why is it okay for voice?

Voice assistants encourage a 'learn-by-doing' approach to their use, via examples of brief commands you can give when you speak to them, or on their companion apps.

However, this puts the onus on the user to keep posing the correct questions and wait for the answer - it could be fun but isn't necessarily efficient. Voice is a low-bandwidth delivery service for information, and so the utility of these assistants depends on the curiosity and persistence of the user.

For example, you can tell Siri how to pronounce a name in your Contacts - something I had attempted and failed to do multiple times. An interviewee showed me how to do it, and the phrasing was specific and non-contextual. I couldn't access these settings from my Contacts or from a menu. It didn't appear to 'remember' how I pronounced it. The feature effectively didn't exist for me.

Discoverability and not dead-ending the user are common conventions for screen based apps, so why are we not applying them to voice interfaces?