Why Far-field Voice Technology is Disrupting the Status Quo for Voice-enabled Product Development
Scott Wiley on the future of far-field voice and audio technology for voice-enabled consumer and industrial products — and why OEMs should take notice.
First, how did ArkX Labs come about?
ArkX Laboratories and the EveryWord™ portfolio were born out of the realization that when it comes to voice-enabled devices and speech recognition there is a real demand in the marketplace for a better voice experience. We believed that the “good enough” user experience with Voice Technology was no longer good enough for today’s consumers.
Our hypothesis was validated when we talked to clients and did additional research. The data showed that a significant percentage of consumers were not happy with the performance of the standard built-in solutions currently available on the market. The biggest complaint was the limitation of many of the existing products to capture voice commands accurately and clearly. That is, the device doesn’t always hear you (or understand what you say) because you may be too far away; there may be too much background noise like streaming music or movies on the TV, or objects between you and the device may be interfering. Another frustration was that the device hears words that sound similar to the trigger word and becomes confused. “Garbage in, garbage out,” as they say. Bottom line, we learned that people are tired of screaming at their devices.
From an OEM perspective, there are real limitations with the current built-in options that dominate the voice space. There are a growing number of companies across a wide variety of verticals who want a much higher standard of performance, something that can be customized to work seamlessly within their eco-systems and is uniquely “ownable” by their brand. From a business point of view, a better customer experience can increase brand value and translate into higher margins.
Using our acoustic know-how and experience, our audio and voice engineers developed a portfolio of advanced far-field voice solutions, featuring Cirrus and NXP technologies, that outperformed existing solutions in every test. Partnering with Ark Electronics’ manufacturing expertise and capabilities, we have produced a production-ready, Amazon Voice Service-qualified solution that allows existing OEMS and start-ups to save development time and cost, mitigate risk, and accelerate their time-to-market. We also offer a highly integrated vertical software stack developed to facilitate easy far-field voice capture on the audio front end with Amazon Voice Services and other cloud services on the backend.
The interest from the makers of smart hubs, IoT devices, video conferencing systems, smart speakers, kiosks, robotics, and others has been tremendous.
What makes your EveryWord™ voice technology stand out among other far-field solutions in the marketplace?
Our far-field capture is based on 3-D reverberation science. Using 3-D reverberation delivers more noise reduction, three times the usable range, and more accurate real-world trigger word performance versus traditional beam forming technology used by many competitors. Our technology doesn’t rely on geometric constraints to define microphone configuration, placement, or orientation. This provides industrial designers the ability to achieve their visions. The old beamforming technologies often resulted in false positives and false negatives or required users to repeatedly shout to have devices hear them accurately. It was clear that there needed to be a better way moving forward; 3-D reverberation overcomes those problems.
In addition, 3-D reverberation technology also tolerates fixed and moving obstructions in the audio path making it perfectly suitable for complex living spaces, workspace, or places with competing talkers.
Another game-changer is the use of 12 independent Acoustic Echo Cancellers (versus the competitions’ standard 1 or 2) that provides superior barge-in performance. The difference is seen in the results and a much better user experience.
A final point of difference is the fact that EveryWord™ is platform neutral. Thus, our solution is simultaneously compatible with multiple voice services and trigger-word providers, including Alexa, Google, Siri, Cortana, AliGenie, Baidu/Kitt.ai, Tencent, and Sensory. The benefit here is you are not locked into any voice service provider. It is very flexible.
So, you talk about performance. Can you give us a few examples?
When we demonstrate our technology to brand owners and OEMS, it’s really no contest. EveryWord™ works beyond 9 meters, even under the most demanding conditions, and can be heard clearly down a hallway, around the corner, or in another room. In comparison, many other voice devices don’t work beyond 3 meters. Where you really see the performance difference is how EveryWord™ hears and understands the user over loud background noises by filtering out competing audio distractions. In fact, EveryWord’s barge-in performance is so strong, your voice agent understands your commands without having to drop playback volume of your TV, for example. Try giving your existing Alexa device a command while also steaming music in the background. I’ll bet you will have to scream and repeat yourself more than a few times to get it to respond correctly.
The bottom line is that this technology can be easily built into most electronic devices, allowing nearly any standard device into be converted into a superior performing voice-operated device. Our solutions allow for both exceptionally enhanced human-to-human and human-to-machine speech recognition compared to anything in the marketplace today.
You mentioned a number of industry verticals and applications. Is this a fit for everyone?
Our flexible solutions cross as a horizontal platform solution across many verticals. Each vertical application is unique. Since our voice solutions can be customized for a company’s ecosystem, they can apply to a wide range of products. This goes well beyond smart speakers and digital assistants. We are in the testing phases with multiple global brands in smart hub, smart appliance, TV, and video conferencing products. For Smart Home applications, the modules can be installed in hubs, ceilings, and in-wall. There has been a lot of interest from both industrial and consumer robotic companies. In a hands-free world, we can enable great experiences from a classroom-in-a-box, to lobby check-in (hospitality or healthcare) and even to hands-free point of sale (POS) products and kiosks. We see endless possibilities.
What does the EveryWord™ product line consist of?
To date, our portfolio features an Audio Front End (AFE) Voice Processing Module, an Integrated Voice Module (SOM + Audio Board w/AFE), and an AVS Development Kit suitable for testing. We also offer a vertical software stack developed to integrate far-field voice capture on the audio frontend with Amazon Voice Services and other cloud services on the backend.
Last question: What comes next for ArkX?
From a technical track, we are focused on biometrics, privacy, and Artificial Intelligence (AI) applications for our hardware and software solutions. As we transition out of the COVID-19 crisis, it is easy to envision a future where hands-free solutions utilizing far-field voice interfaces will become more common. I believe better voice capture and AI will be the key drivers.
Yukuh Tung, VP at ArkX, recently said he believes that smart voice, the combination of voice-as-an-interface, and AI are all in their infancy. However, as its application usage explodes in the coming years, the impact on the human experience will grow by orders of magnitude.
I’m completely in alignment with Yukuh. As AI becomes smarter, natural language understanding (NLU) gets better, and machine learning gets faster. As a result, the value of AI devices for the consumer grows exponentially. As human interaction with AI becomes more natural and efficient, the appeal of using smart voice grows. At some point, you could have a healthcare version or a hospitality version of your voice assistant. Or you could have a single type of platform that knows who you are, where you are, or what you are doing. It then understands the context of your actions and adjusts accordingly.
Keep in mind, many of the devices are still utilizing push-to-talk technology. They aren’t listening for commands until you physically push the button. We eliminate the “push” process by utilizing voice commands as trigger words. While this is easier than walking across the room and pressing a button, it’s effectively the same thing, and it’s still an artificial experience to some extent. Everyone knows that “Alexa”, “Hey Google,” and “Siri” are trigger words. Moving forward, I believe voice interaction is going to be so common that we will get away from needing to use those trigger words. Instead, the devices will understand the context of the spoken word and will listen to (or ignore) the person speaking in an appropriate way.
We currently see clients approaching us, and they aren’t just the early innovators. This is a second group that’s coming in and they want to capitalize on that early brand success to integrate voice into their enterprise and digital solutions. AI, local libraries, and edge computing will enable contextual eco-systems for brands that bypass the current back-end cloud-based requirements of general-purpose voice. These solutions will still be able to access the cloud if needed, but they will not be required. We are at the leading edge of that technology. Integrating into different backends or existing ecosystems is something we’ve done several times. It’s a micro-specialty and we are very good at it.
Customers who engage us gain significant value-add because we know how to navigate that software stack and plug into those existing software solutions at the edge or in the cloud. We speed up their time to market, we lower their cost of implementation, and we reduce their project risk. In summary, that’s what we do and that’s what we can do for you, too.