The Vuzix M300XL and M300 audio architecture provides, in addition to the expected application audio capture and playback channels, a speech recognition audio capture channel and a low-power DSP-based trigger word detection speech recognizer. The following diagram represents the architecture of the audio system.

M300XL and M300 Audio Subsystem Conceptual Diagram

Audio Playback

Audio playback is a straightforward audio output, with appropriate overload protection and filtration blocks in the audio DSP. It is not configurable by user software beyond standard Android audioManager controls.

Audio Capture

Application Audio

Audio capture for Android application audio input is effected using two microphones, a user microphone and an environment microphone. These are conflated by the DSP audio processor to allow "beam forming" or controllable directionality. Hence the microphones may emphasize the user's voice with cancellation of environmental sound, or emphasize the environment with cancellation of user-generated sound (e.g. breathing noise), or operate omni-directionally. Additionally, acoustic echo cancellation removes sound from the audio output and speaker. The captured audio is filtered, noise reduced, and equalized before it becomes the Android microphone input for application audio using the media recorder. Most Media Recorder audio sources will use this audio path, adjusting beam forming and audio processing parameters as appropriate.

The Android application audio input is, by the design of the Android OS, a singleton resource. Only one application may "own" the audio capture at any one time, hence it is not possible to have a foreground activity consuming captured audio while a background service simultaneously does so.

Speech Audio

A second channel for audio capture is implemented as well, intended for use by speech recognition and other software-interpreted audio usage. This channel operates independently of the Android application audio channel.

Speech Audio is optimized for user microphone input; it cancels environmental sound and coupled sound from the output channel. However, it does not implement adaptive filtration, noise suppression, or other algorithms which may introduce artifacts to which machine recognition may be vulnerable. This channel is also available to Android application audio when AudioSource.VOICE_RECOGNITION is selected.

This is a background audio stream that has approximately ¼ second delay. Because of this latency, developers should expect to hear "pre-roll" in the captured audio. Also, developers should prevent speech from being cut-off early due to the latency by capturing an additional ¼ second after the user requests the capture to stop.

Audio Sources

The developer controls the DSP processing by selecting the appropriate AudioSource as defined by MediaRecorder.AudioSource. The following describes the audio processing in M300XL and M300 version 1.5.

  • DEFAULT: Identical to MIC

  • MIC: The noise cancellation emphasizes the user mic and reduces sound from the environment.

  • CAMCORDER: The noise cancellation emphasizes the environment mic and reduces sound from the user.

  • VOICE_COMMUNICATION: Similar to MIC, but adds acoustic echo cancellation so speaker sounds are not heard by the microphone. This is intended for use during Voice over IP (VoIP) and video calls.

  • VOICE_RECOGNITION: The high-latency Speech Audio stream from the DSP.

Native Speech Audio Capture

At the native code layer, this second audio capture channel is implemented as a local socket interface. Hence multiple (currently five) clients may "listen" to this channel simultaneously. The internal Vuzix Sensory command and control recognizer may listen, an application using AudioSource.VOICE_RECOGNITION may listen, an external cloud recognizer service may listen, and a user-defined native trigger daemon may listen, all simultaneously, to this capture stream.

Contact Vuzix Technical Support if your application requires a native speech interface.

Hotword Detection

The device is capable of waking on the hotword trigger "Hello Vuzix." Since the hotword detection occurs completely within the very low power audio DSP, the application processor and supporting devices may be in a low-power or power-off state while the DSP is listening for the trigger. Upon recognition of the phrase, the DSP wakes the application processor.

The consequence of this implementation is that, while voice control and recognition are implemented in application processor software and hence may be localized to a user's preferred language, the "Hello Vuzix" trigger phrase is an immutable part of the DSP software and may not be changed.

Note, however, that a user-defined trigger in application software is entirely possible; for such a design use of the DSP hotword trigger should be avoided to prevent confusion, and the low-power wake-on-hotword capability should be disabled.

Did this answer your question?