This story was first published at The Aleph Report. If you want to read the latest reports, please subscribe to our newsletter and our Twitter.
During this past year, we’ve seen significant shifts on where and how people are consuming information. One of the most important ones is the rise of voice interfaces and voice platforms like Alexa.
Thanks to Amazon’s Alexa, voice controlled devices and voice-only applications are becoming more and more ubiquitous.
At first, these intelligent assistants only lived in our phones. It took a while to get people to use them but used they were.
Every single major technology company is developing their voice-enabled personal assistant. From Microsoft’s Cortana, Googles’s Google Assistant, Apple’s Siri, Yandex’s Alice to Baidu’s Duer.
Going beyond the smartphone
But usage on the phone was still limited. Voice commands are useful under certain circumstances. As what happened with the infamous famous Google Glasses, voice interfaces make the most sense when your hands aren’t free.
So, using the phone while trying to liberate your hands isn’t the best case scenario. What if we could free your hands by having a dedicated microphone device that would take care of that? And what if we placed that device in an environment where we know you’ll have your hands busy and need multitasking?
Say hello to Amazon’s smart invention, the Echo. This opened the gates to the Google Home, Apple’s HomePod and Baidu’s Xiaoyu (Little Fish). All this in less than a year.
Not happy with putting a far-field microphone (7 to be precise) in our homes, Amazon just partnered up with Intel. They are offering an Alexa development kit for third-party devices. Say hello to the Alexafication of your home!
Where does voice interfaces make sense?
While voice interfaces seem trendy, keep in mind they won’t fit every product. Some environments are better suited than others.
Home automation is the perfect environment to deploy voice-commands. Many of the daily operations we do in our homes find us with something in our hands.Switching the lights, playing music, getting the news, you name it.
Our incessant need to multitask everything is driving much of this behavior. We want to be able to do more in less time. We want to be able to get the shower running while in bed. To listen to the news while we’re showering. To get the coffee ready, while we’re dressing up.
We’re going to see an explosion of devices that will support one of the major voice platforms. Be ready for a voice-driven oven!
Lifestyle & sports
Sports or lifestyle activities like running or biking are another space that will benefit from voice-only apps. It’s another situation where we’re busy with our hands, and we might need an extra.
As much as I despised the original Apple Watch, Apple finally nailed it with the Series 3. Among other improvements, they managed to remove the dependency on the iPhone (kudos!) and made it a stand-alone device.
The implications of this might not be obvious, but no iPhone means no text-based input interface. It means that the Apple Watch will need to depend on voice commands and Siri. And Apple isn’t alone. Samsung is already deploying their Voice Assistant, Bixby.
Another obvious choice is transportation. Driving a car will soon (I hope), be a thing of the past. Meanwhile, we still need to keep our hands on the wheel. This fact makes voice-commands perfect for the new paradigm.
No wonder BMW, Ford, Nissan and even Garmin is deploying Alexa-powered devices.
We could generalize to other scenarios. We could substitute anything that right now a human is doing with their voice.
For example, we could remove waiters and have devices take your orders at a restaurant. The same can be said of any shop, like BestBuy.
Robotics is another field that’s going to thrive thanks to this technology. Imagine we could train robots with our voice. Ask for advice, help, etc.
Customer service and support will also receive an upgrade. We’ll finally be able to kill those inefficient dial-3-to-get-someone-that-knows-shit-in-this-companysystems.
A field that is already using something similar is the healthcare industry, more precisely, surgeons in the operating rooms. Several interesting solutions employ Augmented Reality. What if we could add voice-control on top of that?
“Alexa cross-reference this photo I’m taking of the liver with similar infections in the database.”
What’s the next frontier?
One of the most exciting spaces for voice interfaces is going to be Virtual Reality (VR) / Augmented Reality (AR) spaces. (Or Mixed Reality if you work at Microsoft).
Typing anything in VR/AR is painful. Either you need to use one of the available controllers, which isn’t fun. Or you need to type it in with your hands and pray the depth sensor picks it up.
Oculus already deployed something called Oculus Voice. But, as expected, it lags behind what Alexa, Google or Siri can do. It’s rather limited to a fixed set of commands (four so far).
I expect this field to expand and provide more powerful experiences for voice commands.
This space is growing, and growing fast. Still, there are significant challenges. For starters, consumer retention is a struggle.
“When a voice application acquires a user, there is only a 3% chance that user will be active in the second week. Outliers are yielding greater than 20% second-week retention.”
Also, monetization isn’t there, yet. Most applications are just using voice interfaces, as that, as another gateway to access their service. Purely voice-only apps aren’t making money, and this is in part due to the lack of paying options within the current platforms.
Another major challenge is the design of voice interfaces. When it comes to a voice-only application, each word matters. Asking the right questions will help nail user’s intention and avoid embarrassing ambiguities.
“Pay attention and design with your eyes closed.”
Designers will need to rethink how to display things like search results. On a voice interface, you can’t read all the results. This will force to deploy some neat dialog tree-driven interactions. This facet alone will open the doors to complete new experiences and a push towards more personalized results.
Security is another big concern. Voice platforms store your credentials to other applications. Enabling voice authentication isn’t big on the usability scale of most makers. This means that right now, anyone within reach can trigger your system. You will detect some of these attempts, but others will be stealthier. Doing access control with voice-enabled apps and not securing the authentication is going to spell disaster. Watch and see.
Getting the design of a voice-only application right isn’t easy. We’ll see the rise of the VX designers, and it will take time, analytics and expertise to nail the details.
The industry is currently virgin for anyone to take. This is especially true of specific categories, which aren’t seeing much competition. If an application gets enough traction, it will become one of those “editor’s choice” and aggregate traffic in the long term.
As a final thought, I would like to point out how everyone that’s integrating voice-based interfaces is doing it with Alexa. Amazon’s bet is panning out, and they’re beating Apple and Google at their own game. Meanwhile, Microsoft is taking Cortana’s ecosystem very seriously and it’s also speeding up.
Building voice platforms like Alexa or Cortana is mighty hard. On the one hand, you need specific hardware (far-field microphones and specific chipsets). On top of that, you need an Automatic Speech Recognition systems (ASR). The corresponding Natural Language Processors and an ecosystem to build on.
The fact that Amazon, Baidu or Google are so far into the game should worry everyone.
Building one of these platforms requires Deep Learning expertise and plenty of voice data. Who has access to such data? The big Internet guys. The one exception could be Samsung, which owns the hardware through their smartphone division and the team from former-Siri-makers.
A good example is Oculus. The fact that Oculus Voice is so limited is because it’s not a priority, mainly because they don’t have extensive voice data. They can, and most probably will pull a “Facebook.” Don’t be a surprise if Facebook starts using their Whatsapp voice data, to feed into an intelligent assistant. This assistant can then be deployed across different Facebook properties like Oculus, Messenger or TBH.
Anyone else will have a hard time competing. Owning the voice input hardware (i.e., Echo, Smartphone, etc.) and developing your ASR + NLP system will be critical for other organizations that want to move into this space. The rest will have to play by the big guys’ rules.