In order to try and find alternatives to the current way we communicate with voice assistants, I took a 4 step approach.
Smart home assistants like Alexa or Google Assistant are starting to appear all over the world. Their functionalities and integrations are often localised, but even so, those in the global South have a disproportionately worse experience using them.
These devices are not perfect and reflect the nature and biases of their human makers. One solution is to provide endless amounts of unbiased training data that is localised and relevant - a daunting task. Could there be another way?
A quick comparison of onboarding for existing voice assistants versus a conventional screen-based app reveals a heavy reliance on the idea that conversations are natural to people, but screen interfaces were/are more alien or have the potential to be more complex and therefore require explanation.
However, as the documentation for Conversational Design by Google explains, this way of speaking is already unnatural. By providing limited guidance, users are most certainly going to try to talk to it like they'd talk to the only other things in the world they converse with - other humans.
When voice assistants are held to these high standards and fail repeatedly, users lose trust that these devices are reliable. How can they be expected to purchase shoes from a device that has trouble turning the lights on?
Voice assistants can be 'confidently wrong' about the answers they provide, and don't follow up with an explanation. This can leave users baffled at how something they consider 'so obvious' is beyond the understanding of this intelligent device. Or, perhaps more commonly, the answer is 'I didn't get that'.
The perceived stubbornness and shortness can further a user's (often verbal) aggression towards the voice assistant.
Voice assistants encourage a 'learn-by-doing' approach to their use, via examples of brief commands you can give when you speak to them, or on their companion apps.
However, this puts the onus on the user to keep posing the correct questions and wait for the answer - it could be fun but isn't necessarily efficient. Voice is a low-bandwidth delivery service for information, and so the utility of these assistants depends on the curiosity and persistence of the user.
For example, you can tell Siri how to pronounce a name in your Contacts - something I had attempted and failed to do multiple times. An interviewee showed me how to do it, and the phrasing was specific and non-contextual. I couldn't access these settings from my Contacts or from a menu. It didn't appear to 'remember' how I pronounced it. The feature effectively didn't exist for me.
Discoverability and not dead-ending the user are common conventions for screen based apps, so why are we not applying them to voice interfaces?