Qualcomm Patent | Audio Parallax For Virtual Reality, Augmented Reality, And Mixed Reality
Patent: Audio Parallax For Virtual Reality, Augmented Reality, And Mixed Reality
Publication Number: 10659906
Publication Date: 20200519
Applicants: Qualcomm
Abstract
An example audio decoding device includes processing circuitry and a memory device coupled to the processing circuitry. The processing circuitry is configured to receive, in a bitstream, encoded representations of audio objects of a three-dimensional (3D) soundfield, to receive metadata associated with the bitstream, to obtain, from the received metadata, one or more transmission factors associated with one or more of the audio objects, and to apply the transmission factors to the one or more audio objects to obtain parallax-adjusted audio objects of the 3D soundfield. The memory device is configured to store at least a portion of the received bitstream, the received metadata, or the parallax-adjusted audio objects of the 3D soundfield.
TECHNICAL FIELD
The disclosure relates to the encoding and decoding of audio data and, more particularly, audio data coding techniques for virtual reality and augmented reality environments.
BACKGROUND
Various technologies have been developed that allow a person to sense and interact with a computer-generated environment, often through visual and sound effects provided to the person or persons by the devices providing the computer-generated environment. These computer-generated environments are sometimes referred to as “virtual reality” or “VR” environments. For example, a user may avail of a VR experience using one or more wearable devices, such as a headset. A VR headset may include various output components, such as a display screen that provides visual images to the user, and speakers that output sounds. In some examples, a VR headset may provide additional sensory effects, such as tactile sensations provided by way of movement or vibrations. In some examples, the computer-generated environment may provide audio effects to a user or users through speakers or other devices not necessarily worn by the user, but rather, where the user is positioned within audible range of the speakers. Similarly, head-mounted displays (HMDs) exist that allow a user to see the real world in front of the user (as the lenses are transparent) and to see graphic overlays (e.g., from projectors embedded in the HMD frame), as a form of “augmented reality” or “AR.” Similarly, systems exist that allow a user to experience the real world with the addition to VR elements, as a form of “mixed reality” or “MR.”
VR, MR, and AR systems may incorporate capabilities to render higher-order ambisonics (HOA) signals, which are often represented by a plurality of spherical harmonic coefficients (SHC) or other hierarchical elements. That is, the HOA signals that are rendered by a VR, MR, or AR system may represent a three dimensional (3D) soundfield. The HOA or SHC representation may represent the 3D soundfield in a manner that is independent of the local speaker geometry used to playback a multi-channel audio signal rendered from the SHC signal. The SHC signal may also facilitate backwards compatibility as the SHC signal may be rendered to well-known and highly adopted multi-channel formats, such as a 5.1 audio channel format or a 7.1 audio channel format. The SHC representation may therefore enable a better representation of a soundfield that also accommodates backward compatibility.
SUMMARY
In general, techniques are described by which audio decoding devices and audio encoding devices may leverage video data from a computer-generated environment’s video feed, to provide a more accurate representation of the 3D soundfield associated with the computer-generated reality experience. Generally, the techniques of this disclosure may enable various systems to adjust audio objects in the HOA domain to generate a more accurate representation of the energies and directional components of the audio data upon rendering. As one example, the techniques may enable rendering the 3D soundfield to accommodate a six degree-of-freedom (6-DOR) capability of the computer-generated reality system. Moreover, the techniques of this disclosure enable the rendering devices to use data represented in the HOA domain to alter audio data based on characteristics of the video feed being provided for the computer-generated reality experience.
For instance, according to the techniques described herein, the audio rendering device of the computer-generated reality system may adjust foreground audio objects for parallax-related changes that stem from “silent objects” that may attenuate the foreground audio objects. As another example, the techniques of this disclosure may enable the audio rendering device of the computer-generated reality system to determine relative distances between the user and a particular foreground audio object. As another example, the techniques of this disclosure may enable the audio rendering device to apply transmission factors to render the 3D soundfield to provide a more accurate computer-generated reality experience to a user.
In one example, this disclosure is directed to an audio decoding device. The audio decoding device may include processing circuitry and a memory device coupled to the processing circuitry. The processing circuitry is configured to receive, in a bitstream, encoded representations of audio objects of a three-dimensional (3D) soundfield, to receive metadata associated with the bitstream, to obtain, from the received metadata, one or more transmission factors associated with one or more of the audio objects, and to apply the transmission factors to the one or more audio objects to obtain parallax-adjusted audio objects of the 3D soundfield. The memory device is configured to store at least a portion of the received bitstream, the received metadata, or the parallax-adjusted audio objects of the 3D soundfield.
In another example, this disclosure is directed to a method that includes receiving, in a bitstream, encoded representations of audio objects of a three-dimensional (3D) soundfield, and receiving metadata associated with the bitstream. The method may further include obtaining, from the received metadata, one or more transmission factors associated with one or more of the audio objects, and applying the transmission factors to the one or more audio objects to obtain parallax-adjusted audio objects of the 3D soundfield.
In another example, this disclosure is directed to an audio decoding apparatus. The audio decoding apparatus may include means for receiving, in a bitstream, encoded representations of audio objects of a three-dimensional (3D) soundfield, and means for receiving metadata associated with the bitstream. The audio decoding apparatus may further include means for obtaining, from the received metadata, one or more transmission factors associated with one or more of the audio objects, and means for applying the transmission factors to the one or more audio objects to obtain parallax-adjusted audio objects of the 3D soundfield.
In another example, this disclosure is directed to a non-transitory computer-readable storage medium encoded with instructions. The instructions, when executed, cause processing circuitry of an audio decoding device to receive, in a bitstream, encoded representations of audio objects of a three-dimensional (3D) soundfield, and to receive metadata associated with the bitstream. The instructions, when executed, further cause the processing circuitry of the audio decoding device to obtain, from the received metadata, one or more transmission factors associated with one or more of the audio objects, and to apply the transmission factors to the one or more audio objects to obtain parallax-adjusted audio objects of the 3D soundfield.
The details of one or more aspects of the techniques are set forth in the accompanying drawings and the description below. Other features, objects, and advantages of these techniques will be apparent from the description and drawings, and from the claims.
BRIEF DESCRIPTION OF DRAWINGS
FIG. 1 is a diagram illustrating spherical harmonic basis functions from the zero order (n=0) to the fourth order (n=4).
FIG. 2A is a diagram illustrating a system that may perform various aspects of the techniques described in this disclosure.
FIGS. 2B-2D are diagrams illustrating different examples of the system shown in the example of FIG. 2A.
FIG. 3 is a diagram illustrating a six degree-of-freedom (6-DOF) head movement scheme for AVR and/or AR applications.
FIGS. 4A-4D are diagrams illustrating an example of parallax issues that may be presented in a VR scene.
FIGS. 5A and 5B are diagrams illustrating another example of parallax issues that may be presented in a VR scene.
FIGS. 6A-6D are flow diagrams illustrating various encoder-side techniques of this disclosure.
FIG. 7 is a flowchart illustrating a decoding process that an audio decoding device may perform, in accordance with aspects of this disclosure.
FIG. 8 is a diagram illustrating an object classification mechanism that an audio encoding device may implement to categorize silent objects, foreground objects, and background objects, in accordance with aspects of this disclosure.
FIG. 9A is a diagram illustrating an example of stitching of audio/video capture data from multiple microphones and cameras, in accordance with aspects of this disclosure.
FIG. 9B is a flowchart illustrating a process that includes encoder- and decoder-side operations of parallax adjustments with stitching and interpolation, in accordance with aspects of this disclosure.
FIG. 9C is a diagram illustrating the capture of foreground objects and background objects at multiple locations.
FIG. 9D illustrates a mathematical expression of an interpolation technique that an audio decoding device may perform, in accordance with aspects of this disclosure.
FIG. 9E is a diagram illustrating an application of point cloud-based interpolation that an audio decoding device may implement, in accordance with aspects of this disclosure.
FIG. 10 is a diagram illustrating aspects of an HOA domain calculation of attenuation of foreground audio objects that an audio decoding device may perform, in accordance with aspects of this disclosure.
FIG. 11 is a diagram illustrating aspects of transmission factor calculations that an audio encoding device may perform, in accordance with one or more techniques of this disclosure.
FIG. 12 is a diagram illustrating a process that may be performed by an integrated encoding/rendering device, in accordance with aspects of this disclosure.
FIG. 13 is a flowchart illustrating a process that an audio encoding device or an integrated encoding/rendering device may perform, in accordance with aspects of this disclosure.
FIG. 14 is a flowchart illustrating an example process that an audio decoding device or an integrated encoding/decoding/rendering device may perform, in accordance with aspects of this disclosure.
FIG. 15 is a flowchart illustrating an example process that an audio decoding device or an integrated encoding/decoding/rendering device may perform, in accordance with aspects of this disclosure.
FIG. 16 is a flowchart illustrating a process that an audio encoding device or an integrated encoding/rendering device may perform, in accordance with aspects of this disclosure.
FIG. 17 is a flowchart illustrating an example process that an audio decoding device or an integrated encoding/decoding/rendering device may perform, in accordance with aspects of this disclosure.
FIG. 18 is a flowchart illustrating an example process that an audio decoding device or an integrated encoding/decoding/rendering device may perform, in accordance with aspects of this disclosure.
DETAILED DESCRIPTION
In some aspects, this disclosure describes techniques by which audio decoding devices and audio encoding devices may leverage video data from a VR, MR, or AR video feed to provide a more accurate representation of the 3D soundfield associated with the VR/MR/AR experience. For instance, techniques of this disclosure may enable various systems to adjust audio objects in the HOA domain to generate a more accurate representation of the energies and directional components of the audio data upon rendering. As one example, the techniques may enable rendering the 3D soundfield to accommodate a six degree-of-freedom (6-DOR) capability of the VR system.
Moreover, the techniques of this disclosure enable the rendering devices to use HOA domain data to alter audio data based on characteristics of the video feed being provided for the VR experience. For instance, according to the techniques described herein, the audio rendering device of the VR system may adjust foreground audio objects for parallax-related changes that stem from “silent objects” that may attenuate the foreground audio objects. As another example, the techniques of this disclosure may enable the audio rendering device of the VR system to determine relative distances between the user and a particular foreground audio object.
Surround sound technology may be particularly suited to incorporation into VR systems. For instance, the immersive audio experience provided by surround sound technology complements the immersive video and sensory experience provided by other aspects of VR systems. Moreover, augmenting the energy of audio objects with directional characteristics as provided by ambisonics technology provides for a more realistic simulation by the VR environment. For instance, the combination of realistic placement of visual objects in combination with corresponding placement of audio objects via the surround sound speaker array may more accurately simulate the environment that is being replicated.
There are various surround-sound channel-based formats in the market. They range, for example, from the 5.1 home theatre system (which has been the most successful in terms of making inroads into living rooms beyond stereo) to the 22.2 system developed by NHK (Nippon Hoso Kyokai or Japan Broadcasting Corporation). Content creators (e.g., Hollywood studios) would like to produce the soundtrack for a movie once, and not spend effort to remix it for each speaker configuration. A Moving Pictures Expert Group (MPEG) has released a standard allowing for soundfields to be represented using a hierarchical set of elements (e.g., Higher-Order Ambisonic–HOA–coefficients) that can be rendered to speaker feeds for most speaker configurations, including 5.1 and 22.2 configuration whether in location defined by various standards or in non-uniform locations.
MPEG released the standard as MPEG-H 3D Audio standard, formally entitled “Information technology–High efficiency coding and media delivery in heterogeneous environments–Part 3: 3D audio,” set forth by ISO/IEC JTC 1/SC 29, with document identifier ISO/IEC DIS 23008-3, and dated Jul. 25, 2014. MPEG also released a second edition of the 3D Audio standard, entitled “Information technology–High efficiency coding and media delivery in heterogeneous environments–Part 3: 3D audio, set forth by ISO/IEC JTC 1/SC 29, with document identifier ISO/IEC 23008-3:201x(E), and dated Oct. 12, 2016. Reference to the “3D Audio standard” in this disclosure may refer to one or both of the above standards.
As noted above, one example of a hierarchical set of elements is a set of spherical harmonic coefficients (SHC). The following expression demonstrates a description or representation of a soundfield using SHC:
.function..theta..phi..omega..infin..times..times..pi..times..infin..time- s..function..times..times..function..times..function..theta..phi..times..t- imes..times..omega..times..times. ##EQU00001##
The expression shows that the pressure p.sub.i at any point {r.sub.r, .theta..sub.r, .phi..sub.r} of the soundfield, at time t, can be represented uniquely by the SHC, A.sub.n.sup.m(k). Here,
.omega. ##EQU00002## c is the speed of sound (.about.343 m/s), {r.sub.r, .theta..sub.r, .phi..sub.r} is a point of reference (or observation point), j.sub.n( ) is the spherical Bessel function of order n, and Y.sub.n.sup.m(.theta..sub.r,.phi..sub.r) are the spherical harmonic basis functions (which may also be referred to as a spherical basis function) of order n and suborder m. It can be recognized that the term in square brackets is a frequency-domain representation of the signal (i.e., S(.omega., r.sub.r, .theta..sub.r, .phi..sub.r)) which can be approximated by various time-frequency transformations, such as the discrete Fourier transform (DFT), the discrete cosine transform (DCT), or a wavelet transform. Other examples of hierarchical sets include sets of wavelet transform coefficients and other sets of coefficients of multiresolution basis functions.
FIG. 1 is a diagram illustrating spherical harmonic basis functions from the zero order (n=0) to the fourth order (n=4). As can be seen, for each order, there is an expansion of suborders m which are shown but not explicitly noted in the example of FIG. 1 for ease of illustration purposes.
The SHC A.sub.n.sup.m(k) can either be physically acquired (e.g., recorded) by various microphone array configurations or, alternatively, they can be derived from channel-based or object-based descriptions of the soundfield. The SHC (which also may be referred to as higher order ambisonic–HOA–coefficients) represent scene-based audio, where the SHC may be input to an audio encoder to obtain encoded SHC that may promote more efficient transmission or storage. For example, a fourth-order representation involving (1+4).sup.2 (25, and hence fourth order) coefficients may be used.
As noted above, the SHC may be derived from a microphone recording using a microphone array. Various examples of how SHC may be derived from microphone arrays are described in Poletti, M., “Three-Dimensional Surround Sound Systems Based on Spherical Harmonics,” J. Audio Eng. Soc., Vol. 53, No. 11, 2005 November, pp. 1004-1025.
To illustrate how the SHCs may be derived from an object-based description, consider the following equation. The coefficients A.sub.n.sup.m(k) for the soundfield corresponding to an individual audio object may be expressed as: A.sub.n.sup.m(k)=g(.omega.)(-4.pi.ik)h.sub.n.sup.(2)(kr.sub.s)Y.sub.n.sup- .m*(.theta..sub.s,.phi..sub.s), where i is {square root over (-1)}, h.sub.n.sup.(2)( ) is the spherical Hankel function (of the second kind) of order n, and {r.sub.s, .theta..sub.s, .phi..sub.s} is the location of the object. Knowing the object source energy g(.omega.) as a function of frequency (e.g., using time-frequency analysis techniques, such as performing a fast Fourier transform on the PCM stream) allows us to convert each PCM object and the corresponding location into the SHC A.sub.n.sup.m(k). Further, it can be shown (since the above is a linear and orthogonal decomposition) that the A.sub.n.sup.m(k) coefficients for each object are additive. In this manner, a number of PCM objects can be represented by the A.sub.n.sup.m(k) coefficients (e.g., as a sum of the coefficient vectors for the individual objects). Essentially, the coefficients contain information about the soundfield (the pressure as a function of 3D coordinates), and the above represents the transformation from individual objects to a representation of the overall soundfield, in the vicinity of the observation point {r.sub.r, .theta..sub.r, .phi..sub.r}. The remaining figures are described below in the context of SHC-based audio coding.
FIG. 2A is a diagram illustrating a system 10A that may perform various aspects of the techniques described in this disclosure. As shown in the example of FIG. 2A, the system 10A includes a content creator device 12 and a content consumer device 14. While described in the context of the content creator device 12 and the content consumer device 14, the techniques may be implemented in any context in which SHCs (which may also be referred to as HOA coefficients) or any other hierarchical representation of a soundfield are encoded to form a bitstream representative of the audio data. Moreover, the content creator device 12 may represent any form of computing device capable of implementing the techniques described in this disclosure, including a handset (or cellular phone), a tablet computer, a smart phone, or a desktop computer to provide a few examples. Likewise, the content consumer device 14 may represent any form of computing device capable of implementing the techniques described in this disclosure, including a handset (or cellular phone), a tablet computer, a smart phone, a set-top box, or a desktop computer to provide a few examples.
The content creator device 12 may be operated by a movie studio, game programmer, manufacturers of VR systems, or any other entity that may generate multi-channel audio content for consumption by operators of content consumer devices, such as the content consumer device 14. In some examples, the content creator device 12 may be operated by an individual user who would like to compress HOA coefficients 11. Often, the content creator device 12 generates audio content in conjunction with video content and/or content that can be expressed via tactile or haptic output. For instance, the content creator device 12 may include, be, or be part of a system that generates VR, MR, or AR environment data. The content consumer device 14 may be operated by an individual. The content consumer device 14 may include an audio playback system 16, which may refer to any form of audio playback system capable of rendering SHC for play back as multi-channel audio content.
For instance, the content consumer device 14 may include, be, or be part of a system that provides a VR, MR, or AR environment or experience to a user. As such, the content consumer device 14 may also include components for output of video data, for the output and input of tactile or haptic communications, etc. For ease of illustration purposes only, the content creator device 12 and the content consumer device 14 are illustrated in FIG. 2A using various audio-related components, although it will be appreciated that, in accordance with VR and AR technology, one or both devices may include additional components configured to process non-audio data (e.g., other sensory data), as well.
The content creator device 12 includes an audio editing system 18. The content creator device 12 obtain live recordings 7 in various formats (including directly as HOA coefficients) and audio objects 9, which the content creator device 12 may edit using audio editing system 18. Two or more microphones or microphone arrays (hereinafter, “microphones 5”) may capture the live recordings 7. The content creator device 12 may, during the editing process, render HOA coefficients 11 from audio objects 9, listening to the rendered speaker feeds in an attempt to identify various aspects of the soundfield that require further editing. The content creator device 12 may then edit the HOA coefficients 11 (potentially indirectly through manipulation of different ones of the audio objects 9 from which the source HOA coefficients may be derived in the manner described above). The content creator device 12 may employ the audio editing system 18 to generate the HOA coefficients 11. The audio editing system 18 represents any system capable of editing audio data and outputting the audio data as one or more source spherical harmonic coefficients.
When the editing process is complete, the content creator device 12 may generate a bitstream 21 based on the HOA coefficients 11. That is, the content creator device 12 includes an audio encoding device 20 that represents a device configured to encode or otherwise compress HOA coefficients 11 in accordance with various aspects of the techniques described in this disclosure to generate the bitstream 21. The audio encoding device 20 may generate the bitstream 21 for transmission, as one example, across a transmission channel, which may be a wired or wireless channel, a data storage device, or the like. The bitstream 21 may represent an encoded version of the HOA coefficients 11 and may include a primary bitstream and another side bitstream, which may be referred to as side channel information. As shown in FIG. 2A, the audio encoding device 20 may also transmit metadata 23 over the transmission channel. In various examples, the audio encoding device 20 may generate the metadata 23 to include parallax-adjusting information with respect to the audio objects communicated via the bitstream 21. Although the metadata 23 is illustrated as being separate from the bitstream 21, the bitstream 21 may, in some examples, include the metadata 23.
According to techniques of this disclosure, the audio encoding device 20 may include, in the metadata 23, one or more of directional vector information, silent object information, and transmission factors for the HOA coefficients 11. For instance, the audio encoding device 20 may include transmission factors that, when applied, attenuate the energy of one or more of the HOA coefficients 11 communicated via the bitstream 21. In accordance with various aspects of this disclosure, the audio encoding device 20 may derive the transmission factors using object locations in video frames corresponding to the audio frames represented by the particular coefficients of the HOA coefficients 11. For instance, the audio encoding device 20 may determine that a silent object represented in the video data has a location that would interfere with the volume of certain foreground audio objects represented by the HOA coefficients 11, in a real-life scenario. In turn, the audio encoding device 20 may generate transmission factors that, when applied by the audio decoding device 24, would attenuate the energies of the HOA coefficients 11 to more accurately simulate the way the 3D soundfield would be heard by a listener in the corresponding video scene.
According to the techniques of this disclosure, the audio encoding device 20 may classify the audio objects 9, as expressed by the HOA coefficients 11, into foreground objects and background objects. For instance, the audio encoding device 20 may implement aspects of this disclosure to identify a silence object or silent object based on a determination that the object is represented in the video data, but does not correspond to a pre-identified audio object. Although described with respect to the audio encoding device 20 performing the video analysis, a video encoding device (not shown) or a dedicated visual analysis device or unit may perform the classification of the silent object, providing the classification and transmission factors to audio encoding device 20 for purposes of generating the metadata 23.
In the context of captured video and audio, the audio encoding device 20 may determine that an object does not correspond to a pre-identified audio object if the object is not equipped with a sensor. As used herein, the term “equipped with a sensor” may include scenarios where a sensor is attached (permanently or detachably) to an audio source, or placed within earshot (though not attached to) an audio source. If the sensor is not attached to the audio source but is positioned within earshot, then, in applicable scenarios, multiple audio sources that are within earshot of the sensor are considered to be “equipped” with the sensor. In a synthetic VR environment, the audio encoding device 20 may implement techniques of this disclosure to determine that an object does not correspond to a pre-identified audio object if the object in question does not map to any audio object in a predetermined list. In a combination recorded-synthesized VR or AR environment, the audio encoding device 20 may implement techniques of this disclosure to determine that an object does not correspond to a pre-identified audio object using one or both of the techniques described above.
Moreover, the audio encoding device 20 may determine relative foreground location information that reflects a relationship between the location of the listener and the respective locations of the foreground audio objects represented by the HOA coefficients 11 in the bitstream 21. For instance, the audio encoding device 20 may determine a relationship between the “first person” aspect of the video capture or video synthesis for the VR experience, and may determine the relationship between the location of the “first person” and the respective video object corresponding to each respective foreground audio object of the 3D soundfield.
In some examples, the audio encoding device 20 may also use the relative foreground location information to determine relative location information between the listener location and a silent object that attenuates the energy of the foreground object. For instance, the audio encoding device 20 may apply a scaling factor to the relative foreground location information, to derive the distance between the listener location and the silent object that attenuates the energy of the foreground audio object. The scaling factor may range in value from zero to one, with a zero value indicating that the silent object is co-located or substantially co-located with the listener location, and with the value of one indicating that the silent object is co-located or substantially co-located with the foreground audio object.