A key concept at TD Reply is that we consider technology to be an enabler, not a driver. In our previous post – “Alexa, what are my KPIs?” – we outlined three use cases: a morning briefing, KPI Feedback during meetings, and push notifications for KPI alerts. If we look at these cases again, there is one noticeable commonality. While all of them are voice-enable, they are really all about seamless access to key data.
Consider a best case scenario, which may become reality in 5 to 10 years. Ideally, all necessary data should be presented to you without a need for any interaction at all. Imagine sensors and analytics smart enough to anticipate when you require a specific piece of information. There would be no need to utter a question, or type anything on your laptop. Zero UI.
While this is not yet possible to the extent described, voice technologies offer support when you are otherwise occupied, e.g. when you have your hands full, or a high cognitive load. Focused on delivering this kind of integrated assistance in a business context, in form of a prototype, we reviewed currently available ecosystems and vendors for voice assistants.
Voice Assistant Landscape
Major players in the field of voice assistants are Amazon, Google, Microsoft, and Apple. Additionally, voice assistants have been launched or announced by HTC (Sense Companion), Nokia (Viki), Orange/Deutsche Telekom (Djingo), Samsung (Bixby/Vega), and Xiaomi (Mi Brain). None of these competitors beyond the Big Four provided hardware or an ecosystem allowing us to implement our use cases.
Apple’s Siri is a seasoned contender in the field, arguably enabling the current rise of voice assistant due to its prominence and availability. Running on iOS devices, macOS, Apple TV plus the recently announced HomePod, it primarily focuses on personal assistance. Creating appointments, sending messages, initiating phone calls, playing music, or starting searches – not to forget an integration with HomeKit to control your smart home. All this is where Siri is best. However, developing voice based services like envisioned in our use cases as a 3rd party is not possible within Apple’s environment, effectively eliminating this option.
Microsoft has brought Cortana to XBox and more than 500 million Windows PCs, resulting in over 100 million Cortana users. Although harman/kardon announced their Invoke for Fall 2017, which reportedly contains Cortana as a virtual assistant, there is no dedicated hardware from Microsoft. This means while Microsoft does offer a high reach and an ecosystem for voice based services, it would not support our use cases, as all interaction touch points are on systems that are either less relevant (XBox) in a business context, or are already covered by our Pulse dashboard (Windows PC).
Amazon and Google both offer dedicated hardware and ecosystems for voice services. Amazon’s hardware lineup includes Echo, Echo Dot, Echo Show, Echo Look, Echo Tap, Fire TV, phones and tablets – in Germany, UK and US. Through Amazon Voice Services (AVS), it is also possible for any hardware vendor to integrate Alexa into their physical products.
Alexa and its Echo devices have been launched in the US in November 2014, and came to Germany two years later. During the 2017 CES Alexa received a lot of positive press for a number of cooperation and integration partners. An estimated 11 million units have been sold to US households so far. And home automation, e.g. turning lights on and off, is one of the main drivers for adoption.
Google offers Google Home as a device, which is available in the US, UK, and Canada, and is planned to be released in Germany some time in 2017. Both Amazon and Google offer an ecosystem for voice services: Amazon Alexa and Google’s API.AI.
We chose Amazon Alexa due to hardware availability in Germany, UK and US, though we did extend our prototype to API.AI at a later stage as well. And other developers share our sentiment: according to a recent study, Amazon has over 15,000 live skills in June 2017, whereas Google Assistant has fewer than 400 and Cortana less than 70 apps.
How a voice service works
Such a prototype – a custom voice service developed by td – is called a “skill” for Alexa, or an “action” for Google Home. Both are analogous to a native application on a mobile phone, and both need to be activated by the user. With a main difference that no native code is downloaded, installed or running on the actual Echo and Home devices.
On a high level, processing a user’s request is identical in both ecosystems. To retrieve a KPI through a skill/action, these following steps are needed:
- Users utter a request for information. This needs to be a structured query, triggering a nearby virtual voice assistant – be it Alexa or Google Home – and instructing it to perform some task or retrieve data. For example, a user might say: “Alexa, ask dashboard for my main KPIs in Germany”, or “OK Google, what will be the weather in Berlin tomorrow?”
- After being triggered, the assistant transmits a user’s audio stream to a service platform (Amazon Alexa or Google API.AI). Here, spoken words are transcribed into text with speech-to-text, which is then analyzed using Natural Language Processing.
- As a result, the service platform has determined the intent of a user: what service (skill/action) to query, and how this query should look like. Service platforms can handle only a fixed set of built in tasks themselves, like setting an alarm, providing weather information, or adding items to a list. With 3rd party service, such as our prototype Pulse integration, a voice assistant’s feature range is extended by adding custom implementations. For these, a fulfillment layer is needed, where custom business logic resides.
- The service platform now triggers this fulfillment layer assigned to the skill/action addressed by a user, sending the detected intent and additional parameters from the utterance to a backend service. For example, the first sample utterance in the first step indicates that a skill called “dashboard” should be started, with an intent to query “main KPIs”, and a market, “Germany”, as a parameter.
- Our business logic then collects all needed data, formats it accordingly, and returns a response to the service platform.
- Which, in turn, sends it to the voice assistant, speaking back to a user via text-to-speech, potentially engaging in a discussion and follow-up questions.
Let’s take a closer look at an utterance which can start it all.
Anatomize an utterance
“Alexa, ask dashboard for main KPIs in Germany”
Here, “Alexa” is a wake word that prompts an Echo to transmit a voice stream to Amazon for processing. “Ask” is one of the possible launch words, and “dashboard” the invocation name of our skill/action. “For” is simply a connecting word to make a sentence sound natural. Finally, “main KPIs in Germany” is the utterance containing a user’s intent and potentially a number of parameters.
Now users can express the same intent in many ways, with different utterances. Consider a confirmation at the end of a purchase. Users can simply say “OK”, but also “sure”, “sounds good”, “proceed”, or many more phrases. To map an utterance to an intent is the primary task of a voice service.
This is a sample screen of Amazon Alexa, where sample utterances – in the middle – are mapped to intents on the left screen side. Sample utterances are feeding into a machine learning algorithm. The more samples are provided, the better the recognition will be. Amazon recommends around 30 samples per intent, though utterances for different intents should be sufficiently distinct. Through machine learning, additional utterances are recognized, even if a user is not speaking the exact same words as defined, but, for example, a partial combination of two or more utterances. This fuzziness enables a service to be called with a hopefully correct intent – or to be precise: with the statistically most likely intent.
Utterances also allow slots for parameters (KPI, MARKET in our screenshot), with each slot having one or more phrases assigned as possible values. If a matching word is spoken at this position in a sentence, it is extracted and sent to the fulfillment layer for further processing.
Each fulfilment layer handles a service’s business logic. It can run on any server, in any programming language. All it needs to provide is an API which accepts queries from a voice service, and returns data in a defined data structure. Naturally, Amazon offers a tight integration with their Code-as-a-Service offering Lambda. For our Pulse prototype we leveraged Lambda to run our code. Implementing open APIs, this backend was extended to respond to API.AI queries as well as to Amazon Alexa. This allows us to not only service Alexa but also Google Home and several chatbots (Slack, Telegram, Skype, Facebook Messenger, etc.) with the same code.
However, there a differences between chatbots and voice assistants – even if they are powered by the same business logic and fulfillment layer. Which brings us to our next section: “Alexa, what are my lessons learned?”. You can also check our previous blog post “Alexa, what are my KPIs?” to catch up on how voice assistants support business decisions in a corporate environment.