Sony Patent | Alert system and method for virtual reality headset

Patent: Alert system and method for virtual reality headset

Patent PDF: 加入映维网会员获取

Publication Number: 20230019847

Publication Date: 2023-01-19

Assignee: Sony Interactive Entertainment Inc

Abstract

An alert method for a head mounted display includes: identifying the current user of the head mounted display, retrieving a speaker recognition profile for the current user, detecting audio using one or more microphones, estimating whether the detected audio comprises speech corresponding to that of the current user based on the retrieved speaker recognition profile, and if not, then relaying the detected audio comprising the speech to the current user of the head mounted display.

Claims

1.An alert method for a head mounted display, comprising the steps of: identifying the current user of the head mounted display; retrieving a speaker recognition profile for the current user; detecting audio using one or more microphones; estimating whether the detected audio comprises speech corresponding to that of the current user, based on the retrieved speaker recognition profile; and if not, relaying the detected audio comprising the speech to the current user of the head mounted display.

2.The alert method of claim 1, in which the step of identifying the current user comprises visual recognition of the user; and the step of retrieving a speaker recognition profile comprises retrieving a speaker recognition profile associated with that user.

3.The alert method of claim 1, in which the step of identifying the current user comprises obtaining a sample of the current user’s speech and identifying a speaker recognition profile that best matches the sample; and the step of retrieving a speaker recognition profile comprises keeping the speaker recognition profile identified as best matching the sample.

4.The alert method of claim 3, in which the step of obtaining a sample of the current user’s speech comprises one or more of: i. obtaining a sample of speech from a microphone that is proximate to the mouth of the current user of the HMD; ii. obtaining a sample of speech from a directional microphone or microphone array pointing substantially towards the mouth of the current user of the HMD; and iii. obtaining a sample of speech from a plurality of microphones, the samples of the speech from respective microphones having a pattern of relative delay characteristic of being spoken by the current user at the HMD.

5.The alert method of claim 1, comprising the step of: estimating whether the detected audio comprises speech corresponding to a different speaker, based upon one or more additional speaker recognition profiles; and if so, identifying the different speaker.

6.The method of claim 5, comprising the step of indicating the identity of the different speaker to the user of the head mounted display.

7.The method of claim 5, comprising the step of: comparing the identity of the different speaker with a list of muted speakers, and if the different speaker is listed as a muted speaker, then not relaying the detected audio comprising the speech to the current user of the head mounted display.

8.The method of claim 1, comprising the step of: estimating whether the detected audio comprises a predetermined key word or phrase; and if so, relaying the detected audio to the user of the mounted display at least for a predetermined period of time.

9.The method of claim 1, comprising the step of: estimating whether the detected audio comprises audio generated for content being presented to the head mounted display; and if so, not relaying the detected audio comprising the audio generated for the content to the user of the head mounted display.

10.The method of claim 9, in which: the step of estimating whether the detected audio comprises audio generated for the content comprises the steps of: retaining the audio generated for the content in a buffer for a predetermined period of time; and comparing the retained audio in the buffer with the detected audio to detect an offset match.

11.The method of claim 10, in which the step of not relaying the detected audio comprises subtracting the retained audio in the buffer, at an offset corresponding to the offset match, from the detected audio.

12.A non-transitory, computer readable storage medium containing a computer program comprising computer executable instructions, which when executed by a computer system, causes the computer system to perform an alert method for a head mounted display, comprising the steps of: identifying the current user of the head mounted display; retrieving a speaker recognition profile for the current user; detecting audio using one or more microphones; estimating whether the detected audio comprises speech corresponding to that of the current user, based on the retrieved speaker recognition profile; and if not, relaying the detected audio comprising the speech to the current user of the head mounted display.

13.An alert system for a head mounted display, comprising: a user identification processor configured to identify the current user of the head mounted display; a retrieval processor configured to retrieve a speaker recognition profile for the current user from storage; one or more microphones for detecting audio; an audio processor configured to estimate whether the detected audio comprises speech corresponding to that of the current user, based on the retrieved speaker recognition profile; and if not, to relay the detected audio comprising the speech to the current user of the head mounted display.

14.The alert system of claim 13 in which: the audio processor is configured to estimate whether the detected audio comprises speech corresponding to a different speaker, based upon one or more additional speaker recognition profiles, and if so, to identify the different speaker; and the audio processor being configured to compare the identity of the different speaker with a list of muted speakers, and if the different speaker is listed as a muted speaker, then to not relay the detected audio comprising the speech to the current user of the head mounted display.

15.The alert system of claim 13 in which: the audio processor is configured to estimate whether the detected audio comprises a predetermined key word or phrase, and if so, to relay the detected audio to the user of the mounted display at least for a predetermined period of time.

Description

BACKGROUND OF THE INVENTIONField of the Invention

The present invention relates to an alert system and method for a virtual reality headset.

Description of the Prior Art

The “background” description provided herein is for the purpose of generally presenting the context of the disclosure. Work of the presently named inventors, to the extent it is described in this background section, as well as aspects of the description which may not otherwise qualify as prior art at the time of filing, are neither expressly or impliedly admitted as prior art against the present invention.

A particular benefit of virtual reality is the ability of the user to immerse themselves within the experience provided. Typically this involves the user wearing a virtual reality headset otherwise known as a head mounted display, which provides a stereoscopic display of the virtual environment (typically generated by an entertainment device such as a videogame console, personal computer or the like, or by the head mounted display itself) in place of the user’s normal field-of-view. It also typically involves the user wearing headphones such as stereoscopic or binaural headphones to provide audio immersion that complements the video immersion provided by the display.

As a result the user can be to a large or complete degree shut off from the real world, at least with respect to sound and vision.

However, it may be desirable for the user to still have some situational awareness of the real world around them whilst using virtual reality in this manner.

The present invention seeks to alleviate or mitigate this need.

SUMMARY OF THE INVENTION

Various aspects and features of the present invention are defined in the appended claims and within the text of the accompanying description.

In a first aspect, an alert method for a head mounted display is provided in accordance with claim 1.

In another aspect, an alert system for a head mounted display is provided in accordance with claim 13.

It is to be understood that both the foregoing general description of the invention and the following detailed description are exemplary, but are not restrictive, of the invention.

BRIEF DESCRIPTION OF THE DRAWINGS

A more complete appreciation of the disclosure and many of the attendant advantages thereof will be readily obtained as the same becomes better understood by reference to the following detailed description when considered in connection with the accompanying drawings, wherein:

FIG. 1 is a schematic diagram of an alert system for a head mounted display in accordance with embodiments of the present description.

FIG. 2 is a flow diagram of an alert method for a head mounted display in accordance with embodiments of the present description.

DESCRIPTION OF THE EMBODIMENTS

An alert system and method for a virtual reality headset are disclosed. In the following description, a number of specific details are presented in order to provide a thorough understanding of the embodiments of the present invention. It will be apparent, however, to a person skilled in the art that these specific details need not be employed to practice the present invention. Conversely, specific details known to the person skilled in the art are omitted for the purposes of clarity where appropriate.

Referring now to the drawings, wherein like reference numerals designate identical or corresponding parts throughout the several views, FIG. 1 shows an example of an entertainment system 10 is a computer or console such as the Sony® PlayStation 5® (PS5).

In an example embodiment of the present description, the entertainment system 10 comprises a central processor 20. This may be a single or multi core processor, for example comprising eight cores as in the PS5. The entertainment system also comprises a graphical processing unit or GPU 30. The GPU can be physically separate to the CPU, or integrated with the CPU as a system on a chip (SoC) as in the PS5.

The entertainment device also comprises RAM 40, and may either have separate RAM for each of the CPU and GPU, or shared RAM as in the PS5. The or each RAM can be physically separate, or integrated as part of an SoC as in the PS5. Further storage is provided by a disk 50, either as an external or internal hard drive, or as an external solid state drive, or an internal solid state drive as in the PS5.

The entertainment device may transmit or receive data via one or more data ports 60, such as a USB port, Ethernet® port, WiFi® port, Bluetooth® port or similar, as appropriate. It may also optionally receive data via an optical drive 70.

Interaction with the system is typically provided using one or more handheld controllers 80, such as the DualSense® controller in the case of the PS5.

Audio/visual outputs from the entertainment device are typically provided through one or more A/V ports 90, or through one or more of the wired or wireless data ports 60.

Where components are not integrated, they may be connected as appropriate either by a dedicated data link or via a bus 100.

An example of a device for displaying images output by the entertainment system is a head mounted display ‘HMD’ 802, worn by a user 800.

As noted elsewhere herein, such an HMD typically provides a stereoscopic display to the user 800, typically from respective image display units (not shown) such as OLED or backlit LCD displays placed in front of the user’s eyes and typically modified by suitable optics (again not shown) interspersed between the displays and the user’s eyes, or are provided by other suitable technologies such as light guides from a display source.

The HMD may incorporate headphones 810, which may be integral to the HMD (as shown in FIG. 1), or maybe first party or third party headphones re-attachable to the HMD (for example via a headphone jack or USB port on the HMD providing suitable audio signals), or maybe entirely separate first party or third party headphones independently wearable but during use receiving suitable audio signals for the virtual reality experience, typically from the entertainment device or the HMD, for example via Bluetooth or a wired connection.

In addition, the HMD 802, the handheld controller 80, entertainment system 10, and/or the headphones 810 may comprise one or more microphones 90 for detecting ambient audio signals.

The signals may then be relayed to an audio processor, typically either in the entertainment device 10 or the head mounted display 802, or having its function shared between these devices. For example, the audio processor may be included in the system on a chip comprising the CPU 20, or the CPU and/or GPU operating under suitable software instruction may implement some or all of this processor’s function.

Typically the signals are relayed for example via a wired or wireless connection, such as a Bluetooth connection. In a case where the microphone is located in the device comprising the audio processor, then the signal may be relayed via a bus, such as bus 100 in the entertainment device 10.

In an embodiment of the description, the audio processor implements a voice recognition scheme, or more specifically a speaker recognition scheme.

A typical voice or speaker recognition scheme associates formant structure within voiced speech with an individual. The formant structure is reflected in the spectral envelope of a user’s speech. Variations in formant structure are caused by resonances in different parts of the vocal cavity, and hence are characteristic of the relative shape and size of mouth, throat, and nasal cavities of a person. Typically within small groups of people this is sufficiently unique to identify an individual.

The formant structure, or more generally the spectral envelope or mean spectral envelope, can be detected for example by a Fourier transform of received audio signals comprising speech, optionally transformed into a relatively small number of wide frequency range bins, which serves to reduce computational complexity. Hence for example instead of dividing a frequency range for example from 0 to 8 kHz into 256 bins as may typically occur in a spectrogram, the range may be divided into 16 or 32 bins instead.

Optionally the resulting spectrum, or a more conventional narrower bin spectrum (for example if such a spectrum is available using existing hardware), can be subjected to a further Fourier transform to create a so-called cepstrum. Again variations in formant structure are reflected in such a cepstrum. An advantage of this second step is that the spectrum can be transformed into yet fewer bins of a cepstrum. The cepstrum typically also does not retain or reflect the fundamental voice frequency of the speech, thereby removing this source of variability from the recognition process.

Whilst a spectrum or cepstrum may be used, any other suitable variants may also be considered such as outputs from a filter bank instead of a Fourier transform spectrum, or use of a so-called mel-cepstrum (a perceptually weighted cepstrum), or augmenting the outputs of any of these with first or second derivatives of the outputs to model dynamic changes.

In any event, the resulting signal properties indicative of a user, such as those reflecting the mean spectral envelope, spectral envelope, and/or formant structure of a user can be detected and associated with that particular user. This may be done for example using a discriminator such as a hidden Markov model, neural network, or other machine learning system, or parametrically by determining means and variances for ratios between different formant peaks or corresponding features of the data.

Whilst speaker recognition schemes are more robust when a keyword or phrase I used, because it enables a more consistent example set of signal properties to be learned by the discriminator, this is not essential, particularly when only a handful of speakers are required to be distinguished between.

This would be the typical scenario within a shared household or family. Within a family in particular, there is likely to be a relatively large difference in user voice characteristics between members of the household, making distinguishing between users robust even when keywords or phrases are not used or uttered.

The audio processor can then identify the voice of the current wearer of the HMD, if the voice of the current wearer has previously been identified (e.g. learned by the discriminator).

The association of the identified voice with the current wearer of the HMD can be achieved in a number of ways, including audio only, a mix of audio and video, and video only.

Using audio only, it may be achieved using one or more microphones 90 mounted on the HMD and/or on the controller, which will be most proximate to the current user and/or optionally arranged to selectively receive utterances from the user, for example due to cardioid directionality of the microphone, or the use of beamforming via a microphone array, or simply proximity.

Alternatively or in addition, the timing of signals from one or more microphones on the HMD and/or the controller may be used to determine the relative distance of these microphones from the source of the voice; it will be appreciated that the relative distance to a user’s mouth between an HMD and a handheld controller will have a characteristic disparity in timing in normal use, with the HMD receiving the sound much earlier than the handheld controller due to their relative proximities. Meanwhile the voice of a separate person in the room is more likely to reach both microphones at substantially the same time, or in any event at relative times unlikely to replicate those characteristic of the wearer of the HMD.

Using a mix of audio and video, then alternatively or in addition, a correlation between voiced utterances and an image of the user’s mouth moving may be used, for example based upon a camera associated with the entertainment device or the HMD for tracking the user whilst wearing the HMD.

Similarly alternatively or in addition, a correlation between the relative timing of signals received by one or more microphones (for example a stereo pair of microphones associated with a camera tracking the user) and the apparent position of the user within an image captured by that camera can be used to identify the individual as the wearer of the HMD.

Using video only, it will be appreciated that an image of a user may also be associated with the identified voice; consequently the user may be visually recognised by comparison with this reference image or suitable abstraction thereof (for example using eigenfaces), for example when putting the HMD on. This may be achieved by a camera associated with the entertainment device or the HMD.

In any event, the current wearer of the HMD may thus be identified and consequently also the voice of the current wearer of the HMD is known to the audio processor.

Consequently, the audio processor can implement an alert system for the user whereby any voice that is not the voice of the user currently wearing the HMD, which is picked up by one or more of the microphones 90, can be relayed to the user via the headphones 810.

Advantageously, this enables other people in the real world environment to still communicate with a user immersed in the virtual reality environment as necessary.

Further advantageously, optionally the user immersed in the virtual reality environment can selectively mute other individuals; for example if the user does not want to hear from a particular member of the household, they can indicate this via a user interface (for example pressing a particular button on the controller, or selecting the option from an interface displayed to them).

The user interface may for example comprise a pop-up dialogue box that appears as the audio is presented to the user, allowing the user for example to select to pause the game, mute the individual being heard, turn off the HMD, and/or any other suitable options.

Subsequently when that same member of the household talks again, their voice is not relayed to the current user of the HMD, for example for a predetermined period of time or until the current virtual reality session ends.

Alternatively or in addition, rather than allowing people to talk to the current user by default and then adding people to a temporary blacklist, the system may not allow people to talk to the current user by default, and people can then be added to (or removed from) a white list, optionally permanently or on a per session basis.

In any event, optionally this muting function can be overridden by an emergency word or phrase that is relayed to the current user of the HMD regardless of who says it. For example after the emergency word or phrase has been uttered, any ambient sound, and/or any speech, detected by the or each microphone is relayed to the user for a predetermined period of time.

As noted above, the system can relay speech to the user of the HMD, and optionally can selectively mute or allow speech from certain individuals.

Typically this comprises buffering detected speech and analysing it to determine whether the voice is recognised; if so then the identity of the speaker is compared with the list of speakers who are currently neutered or allowed, and the buffered speech is then relayed to the user or not accordingly.

Alternatively or in addition, where a camera associated with the entertainment device or the HMD can see another person in the room, then if the uttered speech is not that of the user, it can be assumed to be spoken by the other person. Consequently visual recognition of the other person may be used to identify them or to confirm their identity as determined from their voice.

Where a person other the current user of the HMD speaks and they are not recognised, then their vocal characteristics, as described elsewhere herein, can be learned to be recognised, and they can be assigned as a currently unknown but recognised third-party the purposes of being blacklisted or white listed for audio pass-through to the user of the HMD. The current user of the HMD can then optionally provide an identity (e.g. a name) for that third party for future reference.

It will be appreciated that typically the use of the HMD is listening to audio conversion environment via headphones, and hence audio from the virtual environment will not be audible within the real-world environment.

However, it is possible that audio from the virtual environment is played audibly for example via a television or surround sound system, for example for the benefit of the or each additional person in the room (for example in a so-called social screen mode, where the viewpoint of the use of the HMD, or a separate viewpoint of the first environment is provided on a television for the benefit of other people).

In this case, it is possible that audio including speech originating from the virtual environment may be picked up by the or each microphone being used to provide signals the audio processor for audio pass through to the user of the HMD.

Consequently, optionally the audio processor compares received microphone signals with the audio signal being output (typically by the audio processor itself) and discounts such audio or subtracts it from the overall received signal to detect any residual real-world speech. Typically there will be a short delay between the generation of audio by the audio processor and its output and subsequent reception by microphones, and so the audio processor or entertainment device may maintain a buffer of output audio to enable a comparison with a delayed version. Typically any such delay will be relatively stable once determined; the delay due to any output path of the entertainment device, and audio processing of the television or a surround sound system will typically be fixed or only have small variability, and the propagation delay from the speakers to the microphones will typically remain similar unless the user moves around the room to a large extent, which is generally unlikely when they are wearing an HMD. Hence once a representative delay has been determined, for example using correlation between buffered output sound and sound from the microphones, then any evolving change in the delay can be efficiently tracked from the current delay estimate, and the delayed version of the output audio from the buffer can then be used to remove or compensate for that audio as detected by the microphones.

In this way, both self-utterances from the current user of the HMD and audio from the virtual environment are not relayed to the user, whilst speech from other people that is detected by the or each microphone supplying the audio processor can be related to the user, and optionally one or more such recognised people may be muted or white listed. Meanwhile optionally an override word or phrase may be used to force speech and/or all sound (optionally with the exception of audio from the virtual environment) to be relayed to the user of the HMD.

A typical use case for this arrangement is when two people share an HMD during a playing session; for a period of time user A is wearing HMD, and so their own voice is not relayed to them whilst the voice of user B is. Then the users swap the HMD, and now the voice of user B is not relayed as they are the current user, but the voice of user A is as they have become the third party.

In this way, by identifying the current user and only relaying other people’s voices (optionally selectively, as described elsewhere herein), the current user of the HMD maintains situational awareness, even if the HMD is shared between a group of people within the same room so that the current user changes with time.

Referring now to FIG. 2, in a summary embodiment of the present description an alert method for a head mounted display comprises the following steps.

A first step s210 comprises identifying the current user of the head mounted display, as described elsewhere herein.

A second step s220 comprises retrieving a speaker recognition profile for the current user, as described elsewhere herein.

A third step s230 comprises detecting audio using one or more microphones 90, as described elsewhere herein.

It will be appreciated that these steps may occur in any suitable order; for example where identification of a user is based on audio, the order may comprise the third step and second steps before the first step.

A fourth step s240 comprises estimating whether the detected audio comprises speech corresponding to that of the current user, based on the retrieved speaker recognition profile, as described elsewhere herein.

Then if the detected audio does not comprise speech corresponding to that of the current user, a fifth step s250 comprises relaying the detected audio comprising the speech to the current user of the head mounted display, as described elsewhere herein.

It will be apparent to a person skilled in the art that variations in the above method corresponding to operation of the various embodiments of the apparatus as described and claimed herein are considered within the scope of the present invention, including but not limited to that:

the step of identifying the current user comprises visual recognition of the user; and the step of retrieving a speaker recognition profile comprises retrieving a speaker recognition profile associated with that user, as described elsewhere herein;

the step of identifying the current user comprises obtaining a sample of the current user’s speech and identifying a speaker recognition profile that best matches the sample; and the step of retrieving a speaker recognition profile comprises keeping the speaker recognition profile identified as best matching the sample, as described elsewhere herein; in this case, optionally the step of obtaining a sample of the current user’s speech comprises one or more selected from the list consisting of obtaining a sample of speech from a microphone that is proximate to the mouth of the current user of the HMD, obtaining a sample of speech from a directional microphone or microphone array pointing substantially towards the mouth of the current user of the HMD, and obtaining a sample of speech from a plurality of microphones, the samples of the speech from respective microphones having a pattern of relative delay characteristic of being spoken by the current user at the HMD, as described elsewhere herein;

the method comprises the step of estimating whether the detected audio comprises speech corresponding to a different speaker, based upon one or more additional speaker recognition profiles, and if so, identifying the different speaker, as described elsewhere herein; in this case, optionally the method comprises the step of indicating the identity of the different speaker to the user of the head mounted display, as described elsewhere herein;

in this case, similarly optionally the method comprises the step of comparing the identity of the different speaker with a list of muted speakers, and if the different speaker is listed as a muted speaker (for example either due to being black listed or not being white listed), then not relaying the detected audio comprising the speech to the current user of the head mounted display, as described elsewhere herein;

the method comprises the step of estimating whether the detected audio comprises a predetermined key word or phrase, and if so, relaying the detected audio to the user of the mounted display at least fora predetermined period of time, as described elsewhere herein; and

the method comprises the step of estimating whether the detected audio comprises audio generated for content being presented to the head mounted display, and if so, not relaying the detected audio comprising the audio generated for the content to the user of the head mounted display, as described elsewhere herein; in this case, optionally the step of estimating whether the detected audio comprises audio generated for the content comprises the steps of retaining the audio generated for the content in a buffer for a predetermined period of time, and comparing the retained audio in the buffer with the detected audio to detect an offset match, as described elsewhere herein; in this case, optionally the step of not relaying the detected audio comprises subtracting the retained audio in the buffer, at an offset corresponding to the offset match, from the detected audio, as described elsewhere herein.

It will be appreciated that the above methods may be carried out on conventional hardware suitably adapted as applicable by software instruction or by the inclusion or substitution of dedicated hardware.

Thus the required adaptation to existing parts of a conventional equivalent device may be implemented in the form of a computer program product comprising processor implementable instructions stored on a non-transitory machine-readable medium such as a floppy disk, optical disk, hard disk, solid state disk, PROM, RAM, flash memory or any combination of these or other storage media, or realised in hardware as an ASIC (application specific integrated circuit) or an FPGA (field programmable gate array) or other configurable circuit suitable to use in adapting the conventional equivalent device. Separately, such a computer program may be transmitted via data signals on a network such as an Ethernet, a wireless network, the Internet, or any combination of these or other networks.

Accordingly, in a summary embodiment of the present description, an alert system operable to implement the methods and/or techniques described herein (such as for example an entertainment system 10 like a video games console such as for example the PlayStation 5®), for and typically in conjunction with a head mounted display 802, comprises the following.

Firstly, a user identification processor (such as GPU 30 and/or CPU 20) configured (for example by suitable software instruction) to identify the current user of the head mounted display, as described elsewhere herein.

Additionally, a retrieval processor (again such as GPU 30 and/or CPU 20) configured (again for example by suitable software instruction) to retrieve a speaker recognition profile for the current user from storage such as RAM 40 and/or SSD 50, or a remote storage accessible online (not shown), as described elsewhere herein.

Additionally, one or more microphones 90 for detecting audio, as described elsewhere herein.

Additionally, an audio processor (again such as GPU 30 and/or CPU 20) configured (again for example by suitable software instruction) to estimate whether the detected audio comprises speech corresponding to that of the current user, based on the retrieved speaker recognition profile, and if not, to relay the detected audio comprising the speech to the current user of the head mounted display, as described elsewhere herein.

Instances of this summary embodiment implementing the methods and techniques described herein (for example by use of suitable software instruction) are envisaged within the scope of the application, including but not limited to that:

the audio processor is configured to estimate whether the detected audio comprises speech corresponding to a different speaker, based upon one or more additional speaker recognition profiles, and if so, to identify the different speaker, and the audio processor being configured to compare the identity of the different speaker with a list of muted speakers, and if the different speaker is listed as a muted speaker, then to not relay the detected audio comprising the speech to the current user of the head mounted display, as described elsewhere herein; and

the audio processor is configured to estimate whether the detected audio comprises a predetermined key word or phrase, and if so, to relay the detected audio to the user of the mounted display at least for a predetermined period of time, as described elsewhere herein.

The foregoing discussion discloses and describes merely exemplary embodiments of the present invention. As will be understood by those skilled in the art, the present invention may be embodied in other specific forms without departing from the spirit or essential characteristics thereof. Accordingly, the disclosure of the present invention is intended to be illustrative, but not limiting of the scope of the invention, as well as other claims. The disclosure, including any readily discernible variants of the teachings herein, defines, in part, the scope of the foregoing claim terminology such that no inventive subject matter is dedicated to the public.

You may also like...