Magic Leap Patent | Method of waking a device using spoken voice commands

编辑：映维 | 分类：MagicLeap | 2025年5月22日

Patent: Method of waking a device using spoken voice commands

Publication Number: 20250168567

Publication Date: 2025-05-22

Assignee: Magic Leap

Abstract

Disclosed herein are systems and methods for processing speech signals in mixed reality applications. A method may include receiving an audio signal; determining, via first processors, whether the audio signal comprises a voice onset event; in accordance with a determination that the audio signal comprises the voice onset event: waking a second one or more processors; determining, via the second processors, that the audio signal comprises a predetermined trigger signal; in accordance with a determination that the audio signal comprises the predetermined trigger signal: waking third processors; performing, via the third processors, automatic speech recognition based on the audio signal; and in accordance with a determination that the audio signal does not comprise the predetermined trigger signal: forgoing waking the third processors; and in accordance with a determination that the audio signal does not comprise the voice onset event: forgoing waking the second processors.

Claims

1. A system comprising:a head-wearable device configured to be worn by a user, the head-wearable device comprising:a first microphone configured to rest at a first height with respect to the user's face when the head-wearable device is worn by the user;

a second microphone configured to rest at a second height with respect to the user's face when the head-wearable device is worn by the user, the second height different than the first height; and

one or more processors configured to perform a method comprising:receiving, via the first microphone, a first microphone output based on an audio signal, the audio signal provided by the user;

receiving, via the second microphone, a second microphone output based on the audio signal;

determining, via a first processor of the one or more processors, whether the audio signal comprises a voice onset event;

in accordance with a determination that the audio signal comprises the voice onset event:waking one or more second processors of the one or more processors;

determining, via the one or more second processors of the one or more processors, whether the audio signal comprises a predetermined trigger signal;

in accordance with a determination that the audio signal comprises the predetermined trigger signal:performing, via the one or more second processors of the one or more processors, automatic speech recognition based on the audio signal; and

in accordance with a determination that the audio signal does not comprise the predetermined trigger signal:forgoing performing the automatic speech recognition; and

in accordance with a determination that the audio signal does not comprise the voice onset event:forgoing waking the one or more second processors of the one or more processors,

wherein said determining whether the audio signal comprises the voice onset event comprises determining a probability of voice activity with respect to the audio signal, wherein said determining the probability of voice activity comprises performing beamforming based on the first microphone output and the second microphone output.

2. The system of claim 1, wherein the method further comprises:in accordance with the determination that the audio signal comprises the predetermined trigger signal:

providing, via the one or more second processors, an audio stream based on the audio signal;

identifying an endpoint of the audio signal; and

ceasing to provide the audio stream in response to identifying the endpoint.

3. The system of claim 1, wherein the first processor of the one or more processors comprises one or more of an application-specific integrated circuit or a digital signal processor, configured to determine whether the audio signal comprises the voice onset event.

4. The system of claim 1, wherein the one or more second processors comprise one or more of a digital signal processor or an application-specific integrated circuit.

5. The system of claim 1, wherein:the head-wearable device comprises the first processor of the one or more processors and a second processor of the one or more second processors, and

the system further comprises an auxiliary unit comprising a third processor of the one or more second processors, wherein the auxiliary unit is external to the head-wearable device and configured to communicate with the head-wearable device.

6. The system of claim 1, wherein the predetermined trigger signal comprises a phrase.

7. The system of claim 1, wherein the method further comprises storing the audio signal in a buffer.

8. The system of claim 1, wherein the method further comprises:in accordance with the determination that the audio signal comprises the voice onset event: performing, via the one or more second processors, acoustic echo cancellation based on the audio signal.

9. The system of claim 1, wherein the method further comprises:in accordance with the determination that the audio signal comprises the voice onset event: performing, via the one or more second processors, beamforming based on the audio signal.

10. The system of claim 1, wherein the method further comprises:in accordance with the determination that the audio signal comprises the voice onset event: performing, via the one or more second processors, noise reduction based on the audio signal.

11. The system of claim 1, wherein said determining whether the audio signal comprises the voice onset event further comprises determining an amount of voice activity in the audio signal.

12. The system of claim 11, wherein said determining whether the audio signal comprises the voice onset event further comprises determining whether the amount of voice activity in the audio signal is greater than a voice activity threshold.

13. The system of claim 1, wherein said determining whether the audio signal comprises the voice onset event comprises determining whether the audio signal comprises a voice of the user.

14. The system of claim 1, wherein a difference between the first microphone output and the second microphone output indicates whether the audio signal comprises a voice of the user.

15. The system of claim 1, wherein said determining whether the audio signal comprises the voice onset event further comprises determining a second probability of voice activity with respect to the audio signal based on the first microphone output.

16. A method comprising:receiving, via a first microphone, a first microphone output based on an audio signal, the audio signal provided by a user;

receiving, via a second microphone, a second microphone output based on the audio signal;

determining, via a first processor of one or more processors, whether the audio signal comprises a voice onset event;

in accordance with a determination that the audio signal comprises the voice onset event:waking one or more second processors of the one or more processors;

determining, via the one or more second processors of the one or more processors, whether the audio signal comprises a predetermined trigger signal;

in accordance with a determination that the audio signal does not comprise the predetermined trigger signal:forgoing performing the automatic speech recognition; and

in accordance with a determination that the audio signal does not comprise the voice onset event:forgoing waking the one or more second processors of the one or more processors, wherein:

a head-wearable device comprises the first microphone and further comprises the second microphone,

the first microphone is configured to rest at a first height with respect to the user's face when the head-wearable device is worn by the user,

the second microphone is configured to rest at a second height with respect to the user's face when the head-wearable device is worn by the user, the second height different than the first height, and

said determining whether the audio signal comprises the voice onset event comprises determining a probability of voice activity with respect to the audio signal, wherein the determining the probability of voice activity comprises performing beamforming based on the first microphone output and the second microphone output.

17. The method of claim 16, wherein the predetermined trigger signal comprises a phrase.

18. The method of method 16, further comprising storing the audio signal in a buffer.

19. The method of claim 16, further comprising determining, based on a difference between the first microphone output and the second microphone output, whether the audio signal comprises a voice of the user.

20. A non-transitory computer-readable medium storing instructions, which, when executed by one or more processors, cause the one or more processors to perform a method comprising:receiving, via a first microphone, a first microphone output based on an audio signal, the audio signal provided by a user;