3D Acoustic Design vs. Beamforming Technology for Voice Recognition Performance
Until recently, the bane of human-to-machine voice interaction has been that machines lack our brain’s ability to filter out noise and understand speech under real-world conditions. Nothing destroys the voice agent experience faster than having to repeat yourself, raise your voice, or get closer for the agent to understand you. The most frustrating part is telling your smart speaker to turn down playback volume, but ironically it can’t hear you over its playback volume.
At ArkX Laboratories’ we’ve developed advanced voice solutions that delivers on the promise of enhanced noise reduction, significantly extended voice capture range, and more accurate verbal interaction versus the traditional beamforming technology used by many competitors.
What’s the secret?
Exploiting reverberative energy in all 3 dimensions. Conventional beamforming originated from free field acoustics that focuses on the direct (aka shortest) path between the person speaking and the mic array. If you happen to live in an anechoic chamber, this approach is ideal, but in real home and office environments, the reverberative energy starts exceeding the direct path energy a short distance away from the speaker.
The bottom line is that focusing on the acoustic energy in all 3 dimensions, our far-field algorithms better characterize and suppress noise and capture speech, enabling the smart speaker or another voice agent to hear better, especially at longer ranges, around acoustic obstructions and in noisier environments.
Acoustic performance is not the only benefit of this approach. Until now algorithmic requirements often limited positioning, configuration, and orientation of mic arrays. This, in turn, constrained the industrial engineer’s design vision. This restriction was due to conventional algorithms using a planar array to view the world in the 2-D plane of the array. However, ArkX Labs use algorithms that hear the world in 3 dimensions by exploiting reverberation. This means microphones can be placed in previously forbidden locations and the product can be mounted on walls, ceilings, and odd angles without killing performance.
The growing demand for more accurate human-to-machine voice interaction is only half the story. The COVID crisis has highlighted the need for more natural, intelligible human-to-human communication between people working and living apart. People today are talking to their boss remotely while their dog is barking, the family is watching TV, and the lawn is being mowed outside the window. All the benefits of exploiting reverberation in 3 dimensions apply equally to human-to-human applications.
Our Ark Labs EveryWord™ solution separates itself from the competition in a way that many experts consider the most important function of a far-field audio front in, acoustic echo cancellation (AEC). In simple terms, this is what enables a smart speaker to hear you over itself when playing loudly. It also is what enables you to have a natural full-duplex conversation with someone without echo. To be able to reliably barge-in during loud playback or have a conversation without echo, ideally the AEC must create and maintain and unique model for each acoustic path between each microphone and each speaker. That means for a 4-microphone stereo solution 4×2= 8 AECs are needed. Most other solutions have only 1 or 2 available while EveryWord™ has 12 AECs.
Using this advanced approach to far-field voice capture and speech recognition, our EveryWord™ technology can tolerate fixed and moving obstructions in the audio path. This makes it perfectly suitable for use in extremely noisy and reverberative living spaces, workspace, or places with competing talkers and noise. For example, our solutions can capture voice commands from across the room (>9 meters) with other loud music or audio playback, and in the presence of competing conversations. It also means that permanent obstructions, like furniture, architectural columns, and other physical barriers directly in the voice path are not a problem. And in human-to-human communications (conference speakers, for instance), EveryWord™ also provides audio output processing that enhances fidelity and volume of playback. This results in “naturalness” and intelligibility of the person’s voice originating on the other end of the call or from the audio source. Beamforming technologies employed by most OEM devices just cannot compete.
Best of all, ArkX offers solutions that are platform-neutral. EveryWord™ is simultaneously compatible with multiple voice services and trigger-word providers, including Alexa, Google, Siri, Cortana, AliGenie, Baidu/Kitt.ai, Tencent, and Sensory. Users can simply pick and choose from the best skills available from each platform and craft a solution that best suits his or her needs.