雨果巴拉:行业北极星Vision Pro过度设计不适合市场

Qualcomm Patent | Audio Parallax For Virtual Reality, Augmented Reality, And Mixed Reality

Patent: Audio Parallax For Virtual Reality, Augmented Reality, And Mixed Reality

Publication Number: 20200260210

Publication Date: 20200813

Applicants: Qualcomm

Abstract

An example audio decoding device includes processing circuitry and a memory device coupled to the processing circuitry. The processing circuitry is configured to receive, in a bitstream, encoded representations of audio objects of a three-dimensional (3D) soundfield, to receive metadata associated with the bitstream, to obtain, from the received metadata, one or more transmission factors associated with one or more of the audio objects, and to apply the transmission factors to the one or more audio objects to obtain parallax-adjusted audio objects of the 3D soundfield. The memory device is configured to store at least a portion of the received bitstream, the received metadata, or the parallax-adjusted audio objects of the 3D soundfield.

[0001] This application is a continuation of U.S. application Ser. No. 15/868,656, filed 11 Jan. 2018, which claims the benefit of U.S. Provisional Application No. 62/446,324, filed 13 Jan. 2017, the entire content of each of which is incorporated by reference herein.

TECHNICAL FIELD

[0002] The disclosure relates to the encoding and decoding of audio data and, more particularly, audio data coding techniques for virtual reality and augmented reality environments.

BACKGROUND

[0003] Various technologies have been developed that allow a person to sense and interact with a computer-generated environment, often through visual and sound effects provided to the person or persons by the devices providing the computer-generated environment. These computer-generated environments are sometimes referred to as “virtual reality” or “VR” environments. For example, a user may avail of a VR experience using one or more wearable devices, such as a headset. A VR headset may include various output components, such as a display screen that provides visual images to the user, and speakers that output sounds. In some examples, a VR headset may provide additional sensory effects, such as tactile sensations provided by way of movement or vibrations. In some examples, the computer-generated environment may provide audio effects to a user or users through speakers or other devices not necessarily worn by the user, but rather, where the user is positioned within audible range of the speakers. Similarly, head-mounted displays (HMDs) exist that allow a user to see the real world in front of the user (as the lenses are transparent) and to see graphic overlays (e.g., from projectors embedded in the HMD frame), as a form of “augmented reality” or “AR.” Similarly, systems exist that allow a user to experience the real world with the addition to VR elements, as a form of “mixed reality” or “MR.”

[0004] VR, MR, and AR systems may incorporate capabilities to render higher-order ambisonics (HOA) signals, which are often represented by a plurality of spherical harmonic coefficients (SHC) or other hierarchical elements. That is, the HOA signals that are rendered by a VR, MR, or AR system may represent a three dimensional (3D) soundfield. The HOA or SHC representation may represent the 3D soundfield in a manner that is independent of the local speaker geometry used to playback a multi-channel audio signal rendered from the SHC signal. The SHC signal may also facilitate backwards compatibility as the SHC signal may be rendered to well-known and highly adopted multi-channel formats, such as a 5.1 audio channel format or a 7.1 audio channel format. The SHC representation may therefore enable a better representation of a soundfield that also accommodates backward compatibility.

SUMMARY

[0005] In general, techniques are described by which audio decoding devices and audio encoding devices may leverage video data from a computer-generated environment’s video feed, to provide a more accurate representation of the 3D soundfield associated with the computer-generated reality experience. Generally, the techniques of this disclosure may enable various systems to adjust audio objects in the HOA domain to generate a more accurate representation of the energies and directional components of the audio data upon rendering. As one example, the techniques may enable rendering the 3D soundfield to accommodate a six degree-of-freedom (6-DOR) capability of the computer-generated reality system. Moreover, the techniques of this disclosure enable the rendering devices to use data represented in the HOA domain to alter audio data based on characteristics of the video feed being provided for the computer-generated reality experience.

[0006] For instance, according to the techniques described herein, the audio rendering device of the computer-generated reality system may adjust foreground audio objects for parallax-related changes that stem from “silent objects” that may attenuate the foreground audio objects. As another example, the techniques of this disclosure may enable the audio rendering device of the computer-generated reality system to determine relative distances between the user and a particular foreground audio object. As another example, the techniques of this disclosure may enable the audio rendering device to apply transmission factors to render the 3D soundfield to provide a more accurate computer-generated reality experience to a user.

[0007] In one example, this disclosure is directed to an audio decoding device. The audio decoding device may include processing circuitry and a memory device coupled to the processing circuitry. The processing circuitry is configured to receive, in a bitstream, encoded representations of audio objects of a three-dimensional (3D) soundfield, to receive metadata associated with the bitstream, to obtain, from the received metadata, one or more transmission factors associated with one or more of the audio objects, and to apply the transmission factors to the one or more audio objects to obtain parallax-adjusted audio objects of the 3D soundfield. The memory device is configured to store at least a portion of the received bitstream, the received metadata, or the parallax-adjusted audio objects of the 3D soundfield.

[0008] In another example, this disclosure is directed to a method that includes receiving, in a bitstream, encoded representations of audio objects of a three-dimensional (3D) soundfield, and receiving metadata associated with the bitstream. The method may further include obtaining, from the received metadata, one or more transmission factors associated with one or more of the audio objects, and applying the transmission factors to the one or more audio objects to obtain parallax-adjusted audio objects of the 3D soundfield.

[0009] In another example, this disclosure is directed to an audio decoding apparatus. The audio decoding apparatus may include means for receiving, in a bitstream, encoded representations of audio objects of a three-dimensional (3D) soundfield, and means for receiving metadata associated with the bitstream. The audio decoding apparatus may further include means for obtaining, from the received metadata, one or more transmission factors associated with one or more of the audio objects, and means for applying the transmission factors to the one or more audio objects to obtain parallax-adjusted audio objects of the 3D soundfield.

[0010] In another example, this disclosure is directed to a non-transitory computer-readable storage medium encoded with instructions. The instructions, when executed, cause processing circuitry of an audio decoding device to receive, in a bitstream, encoded representations of audio objects of a three-dimensional (3D) soundfield, and to receive metadata associated with the bitstream. The instructions, when executed, further cause the processing circuitry of the audio decoding device to obtain, from the received metadata, one or more transmission factors associated with one or more of the audio objects, and to apply the transmission factors to the one or more audio objects to obtain parallax-adjusted audio objects of the 3D soundfield.

[0011] The details of one or more aspects of the techniques are set forth in the accompanying drawings and the description below. Other features, objects, and advantages of these techniques will be apparent from the description and drawings, and from the claims.

BRIEF DESCRIPTION OF DRAWINGS

[0012] FIG. 1 is a diagram illustrating spherical harmonic basis functions from the zero order (n=0) to the fourth order (n=4).

[0013] FIG. 2A is a diagram illustrating a system that may perform various aspects of the techniques described in this disclosure.

[0014] FIGS. 2B-2D are diagrams illustrating different examples of the system shown in the example of FIG. 2A.

[0015] FIG. 3 is a diagram illustrating a six degree-of-freedom (6-DOF) head movement scheme for AVR and/or AR applications.

[0016] FIGS. 4A-4D are diagrams illustrating an example of parallax issues that may be presented in a VR scene.

[0017] FIGS. 5A and 5B are diagrams illustrating another example of parallax issues that may be presented in a VR scene.

[0018] FIGS. 6A-6D are flow diagrams illustrating various encoder-side techniques of this disclosure.

[0019] FIG. 7 is a flowchart illustrating a decoding process that an audio decoding device may perform, in accordance with aspects of this disclosure.

[0020] FIG. 8 is a diagram illustrating an object classification mechanism that an audio encoding device may implement to categorize silent objects, foreground objects, and background objects, in accordance with aspects of this disclosure.

[0021] FIG. 9A is a diagram illustrating an example of stitching of audio/video capture data from multiple microphones and cameras, in accordance with aspects of this disclosure.

[0022] FIG. 9B is a flowchart illustrating a process that includes encoder- and decoder-side operations of parallax adjustments with stitching and interpolation, in accordance with aspects of this disclosure.

[0023] FIG. 9C is a diagram illustrating the capture of foreground objects and background objects at multiple locations.

[0024] FIG. 9D illustrates a mathematical expression of an interpolation technique that an audio decoding device may perform, in accordance with aspects of this disclosure.

[0025] FIG. 9E is a diagram illustrating an application of point cloud-based interpolation that an audio decoding device may implement, in accordance with aspects of this disclosure.

[0026] FIG. 10 is a diagram illustrating aspects of an HOA domain calculation of attenuation of foreground audio objects that an audio decoding device may perform, in accordance with aspects of this disclosure.

[0027] FIG. 11 is a diagram illustrating aspects of transmission factor calculations that an audio encoding device may perform, in accordance with one or more techniques of this disclosure.

[0028] FIG. 12 is a diagram illustrating a process that may be performed by an integrated encoding/rendering device, in accordance with aspects of this disclosure.

[0029] FIG. 13 is a flowchart illustrating a process that an audio encoding device or an integrated encoding/rendering device may perform, in accordance with aspects of this disclosure.

[0030] FIG. 14 is a flowchart illustrating an example process that an audio decoding device or an integrated encoding/decoding/rendering device may perform, in accordance with aspects of this disclosure.

[0031] FIG. 15 is a flowchart illustrating an example process that an audio decoding device or an integrated encoding/decoding/rendering device may perform, in accordance with aspects of this disclosure.

[0032] FIG. 16 is a flowchart illustrating a process that an audio encoding device or an integrated encoding/rendering device may perform, in accordance with aspects of this disclosure.

[0033] FIG. 17 is a flowchart illustrating an example process that an audio decoding device or an integrated encoding/decoding/rendering device may perform, in accordance with aspects of this disclosure.

[0034] FIG. 18 is a flowchart illustrating an example process that an audio decoding device or an integrated encoding/decoding/rendering device may perform, in accordance with aspects of this disclosure.

DETAILED DESCRIPTION

[0035] In some aspects, this disclosure describes techniques by which audio decoding devices and audio encoding devices may leverage video data from a VR, MR, or AR video feed to provide a more accurate representation of the 3D soundfield associated with the VR/MR/AR experience. For instance, techniques of this disclosure may enable various systems to adjust audio objects in the HOA domain to generate a more accurate representation of the energies and directional components of the audio data upon rendering. As one example, the techniques may enable rendering the 3D soundfield to accommodate a six degree-of-freedom (6-DOR) capability of the VR system.

[0036] Moreover, the techniques of this disclosure enable the rendering devices to use HOA domain data to alter audio data based on characteristics of the video feed being provided for the VR experience. For instance, according to the techniques described herein, the audio rendering device of the VR system may adjust foreground audio objects for parallax-related changes that stem from “silent objects” that may attenuate the foreground audio objects. As another example, the techniques of this disclosure may enable the audio rendering device of the VR system to determine relative distances between the user and a particular foreground audio object.

[0037] Surround sound technology may be particularly suited to incorporation into VR systems. For instance, the immersive audio experience provided by surround sound technology complements the immersive video and sensory experience provided by other aspects of VR systems. Moreover, augmenting the energy of audio objects with directional characteristics as provided by ambisonics technology provides for a more realistic simulation by the VR environment. For instance, the combination of realistic placement of visual objects in combination with corresponding placement of audio objects via the surround sound speaker array may more accurately simulate the environment that is being replicated.

[0038] There are various surround-sound channel-based formats in the market. They range, for example, from the 5.1 home theatre system (which has been the most successful in terms of making inroads into living rooms beyond stereo) to the 22.2 system developed by NHK (Nippon Hoso Kyokai or Japan Broadcasting Corporation). Content creators (e.g., Hollywood studios) would like to produce the soundtrack for a movie once, and not spend effort to remix it for each speaker configuration. A Moving Pictures Expert Group (MPEG) has released a standard allowing for soundfields to be represented using a hierarchical set of elements (e.g., Higher-Order Ambisonic–HOA–coefficients) that can be rendered to speaker feeds for most speaker configurations, including 5.1 and 22.2 configuration whether in location defined by various standards or in non-uniform locations.

[0039] MPEG released the standard as MPEG-H 3D Audio standard, formally entitled “Information technology–High efficiency coding and media delivery in heterogeneous environments–Part 3: 3D audio,” set forth by ISO/IEC JTC 1/SC 29, with document identifier ISO/IEC DIS 23008-3, and dated Jul. 25, 2014. MPEG also released a second edition of the 3D Audio standard, entitled “Information technology–High efficiency coding and media delivery in heterogeneous environments–Part 3: 3D audio, set forth by ISO/IEC JTC 1/SC 29, with document identifier ISO/IEC 23008-3:201x(E), and dated Oct. 12, 2016. Reference to the “3D Audio standard” in this disclosure may refer to one or both of the above standards.

[0040] As noted above, one example of a hierarchical set of elements is a set of spherical harmonic coefficients (SHC). The following expression demonstrates a description or representation of a soundfield using SHC:

p i ( t , r r , .theta. r , .PHI. r ) = .omega. = 0 .infin. [ 4 .pi. n = 0 .infin. j n ( k r r ) m = - n n A n m ( k ) Y n m ( .theta. r , .PHI. r ) ] e j .omega. t , ##EQU00001##

[0041] The expression shows that the pressure p.sub.i at any point {r.sub.r, .theta..sub.r, .phi..sub.r} of the soundfield, at time t, can be represented uniquely by the SHC, A.sub.n.sup.m(k). Here,

k = .omega. c , ##EQU00002##

c is the speed of sound (.about.343 m/s), {r.sub.r, .theta..sub.r, .phi..sub.r} is a point of reference (or observation point), j.sub.n( ) is the spherical Bessel function of order n, and Y.sub.n.sup.m(.theta..sub.r, .phi..sub.r) are the spherical harmonic basis functions (which may also be referred to as a spherical basis function) of order n and suborder m. It can be recognized that the term in square brackets is a frequency-domain representation of the signal (i.e., S(.omega., r.sub.r, .theta..sub.r, .phi..sub.r)) which can be approximated by various time-frequency transformations, such as the discrete Fourier transform (DFT), the discrete cosine transform (DCT), or a wavelet transform. Other examples of hierarchical sets include sets of wavelet transform coefficients and other sets of coefficients of multiresolution basis functions.

[0042] FIG. 1 is a diagram illustrating spherical harmonic basis functions from the zero order (n=0) to the fourth order (n=4). As can be seen, for each order, there is an expansion of suborders m which are shown but not explicitly noted in the example of FIG. 1 for ease of illustration purposes.

[0043] The SHC A.sub.n.sup.m(k) can either be physically acquired (e.g., recorded) by various microphone array configurations or, alternatively, they can be derived from channel-based or object-based descriptions of the soundfield. The SHC (which also may be referred to as higher order ambisonic–HOA–coefficients) represent scene-based audio, where the SHC may be input to an audio encoder to obtain encoded SHC that may promote more efficient transmission or storage. For example, a fourth-order representation involving (1+4).sup.2 (25, and hence fourth order) coefficients may be used.

[0044] As noted above, the SHC may be derived from a microphone recording using a microphone array. Various examples of how SHC may be derived from microphone arrays are described in Poletti, M., “Three-Dimensional Surround Sound Systems Based on Spherical Harmonics,” J. Audio Eng. Soc., Vol. 53, No. 11, 2005 November, pp. 1004-1025.

[0045] To illustrate how the SHCs may be derived from an object-based description, consider the following equation. The coefficients A.sub.n.sup.m(k) for the soundfield corresponding to an individual audio object may be expressed as:

A.sub.n.sup.m(k)=g(.omega.)(-4.pi.ik)h.sub.n.sup.(2)(kr.sub.s)Y.sub.n.su- p.m*(.theta..sub.s,.phi..sub.s),

where i is {square root over (-1)}, h.sub.n.sup.(2)( ) is the spherical Hankel function (of the second kind) of order n, and {r.sub.s, .theta..sub.s, .phi..sub.s} is the location of the object. Knowing the object source energy g(.omega.) as a function of frequency (e.g., using time-frequency analysis techniques, such as performing a fast Fourier transform on the PCM stream) allows us to convert each PCM object and the corresponding location into the SHC A.sub.n.sup.m(k). Further, it can be shown (since the above is a linear and orthogonal decomposition) that the A.sub.n.sup.m (k) coefficients for each object are additive. In this manner, a number of PCM objects can be represented by the A.sub.n.sup.m(k) coefficients (e.g., as a sum of the coefficient vectors for the individual objects). Essentially, the coefficients contain information about the soundfield (the pressure as a function of 3D coordinates), and the above represents the transformation from individual objects to a representation of the overall soundfield, in the vicinity of the observation point {r.sub.r, .theta..sub.r, .phi..sub.r}. The remaining figures are described below in the context of SHC-based audio coding.

[0046] FIG. 2A is a diagram illustrating a system 10A that may perform various aspects of the techniques described in this disclosure. As shown in the example of FIG. 2A, the system 10A includes a content creator device 12 and a content consumer device 14. While described in the context of the content creator device 12 and the content consumer device 14, the techniques may be implemented in any context in which SHCs (which may also be referred to as HOA coefficients) or any other hierarchical representation of a soundfield are encoded to form a bitstream representative of the audio data. Moreover, the content creator device 12 may represent any form of computing device capable of implementing the techniques described in this disclosure, including a handset (or cellular phone), a tablet computer, a smart phone, or a desktop computer to provide a few examples. Likewise, the content consumer device 14 may represent any form of computing device capable of implementing the techniques described in this disclosure, including a handset (or cellular phone), a tablet computer, a smart phone, a set-top box, or a desktop computer to provide a few examples.

[0047] The content creator device 12 may be operated by a movie studio, game programmer, manufacturers of VR systems, or any other entity that may generate multi-channel audio content for consumption by operators of content consumer devices, such as the content consumer device 14. In some examples, the content creator device 12 may be operated by an individual user who would like to compress HOA coefficients 11. Often, the content creator device 12 generates audio content in conjunction with video content and/or content that can be expressed via tactile or haptic output. For instance, the content creator device 12 may include, be, or be part of a system that generates VR, MR, or AR environment data. The content consumer device 14 may be operated by an individual. The content consumer device 14 may include an audio playback system 16, which may refer to any form of audio playback system capable of rendering SHC for play back as multi-channel audio content.

[0048] For instance, the content consumer device 14 may include, be, or be part of a system that provides a VR, MR, or AR environment or experience to a user. As such, the content consumer device 14 may also include components for output of video data, for the output and input of tactile or haptic communications, etc. For ease of illustration purposes only, the content creator device 12 and the content consumer device 14 are illustrated in FIG. 2A using various audio-related components, although it will be appreciated that, in accordance with VR and AR technology, one or both devices may include additional components configured to process non-audio data (e.g., other sensory data), as well.

[0049] The content creator device 12 includes an audio editing system 18. The content creator device 12 obtain live recordings 7 in various formats (including directly as HOA coefficients) and audio objects 9, which the content creator device 12 may edit using audio editing system 18. Two or more microphones or microphone arrays (hereinafter, “microphones 5”) may capture the live recordings 7. The content creator device 12 may, during the editing process, render HOA coefficients 11 from audio objects 9, listening to the rendered speaker feeds in an attempt to identify various aspects of the soundfield that require further editing. The content creator device 12 may then edit the HOA coefficients 11 (potentially indirectly through manipulation of different ones of the audio objects 9 from which the source HOA coefficients may be derived in the manner described above). The content creator device 12 may employ the audio editing system 18 to generate the HOA coefficients 11. The audio editing system 18 represents any system capable of editing audio data and outputting the audio data as one or more source spherical harmonic coefficients.

[0050] When the editing process is complete, the content creator device 12 may generate a bitstream 21 based on the HOA coefficients 11. That is, the content creator device 12 includes an audio encoding device 20 that represents a device configured to encode or otherwise compress HOA coefficients 11 in accordance with various aspects of the techniques described in this disclosure to generate the bitstream 21. The audio encoding device 20 may generate the bitstream 21 for transmission, as one example, across a transmission channel, which may be a wired or wireless channel, a data storage device, or the like. The bitstream 21 may represent an encoded version of the HOA coefficients 11 and may include a primary bitstream and another side bitstream, which may be referred to as side channel information. As shown in FIG. 2A, the audio encoding device 20 may also transmit metadata 23 over the transmission channel. In various examples, the audio encoding device 20 may generate the metadata 23 to include parallax-adjusting information with respect to the audio objects communicated via the bitstream 21. Although the metadata 23 is illustrated as being separate from the bitstream 21, the bitstream 21 may, in some examples, include the metadata 23.

[0051] According to techniques of this disclosure, the audio encoding device 20 may include, in the metadata 23, one or more of directional vector information, silent object information, and transmission factors for the HOA coefficients 11. For instance, the audio encoding device 20 may include transmission factors that, when applied, attenuate the energy of one or more of the HOA coefficients 11 communicated via the bitstream 21. In accordance with various aspects of this disclosure, the audio encoding device 20 may derive the transmission factors using object locations in video frames corresponding to the audio frames represented by the particular coefficients of the HOA coefficients 11. For instance, the audio encoding device 20 may determine that a silent object represented in the video data has a location that would interfere with the volume of certain foreground audio objects represented by the HOA coefficients 11, in a real-life scenario. In turn, the audio encoding device 20 may generate transmission factors that, when applied by the audio decoding device 24, would attenuate the energies of the HOA coefficients 11 to more accurately simulate the way the 3D soundfield would be heard by a listener in the corresponding video scene.

[0052] According to the techniques of this disclosure, the audio encoding device 20 may classify the audio objects 9, as expressed by the HOA coefficients 11, into foreground objects and background objects. For instance, the audio encoding device 20 may implement aspects of this disclosure to identify a silence object or silent object based on a determination that the object is represented in the video data, but does not correspond to a pre-identified audio object. Although described with respect to the audio encoding device 20 performing the video analysis, a video encoding device (not shown) or a dedicated visual analysis device or unit may perform the classification of the silent object, providing the classification and transmission factors to audio encoding device 20 for purposes of generating the metadata 23.

[0053] In the context of captured video and audio, the audio encoding device 20 may determine that an object does not correspond to a pre-identified audio object if the object is not equipped with a sensor. As used herein, the term “equipped with a sensor” may include scenarios where a sensor is attached (permanently or detachably) to an audio source, or placed within earshot (though not attached to) an audio source. If the sensor is not attached to the audio source but is positioned within earshot, then, in applicable scenarios, multiple audio sources that are within earshot of the sensor are considered to be “equipped” with the sensor. In a synthetic VR environment, the audio encoding device 20 may implement techniques of this disclosure to determine that an object does not correspond to a pre-identified audio object if the object in question does not map to any audio object in a predetermined list. In a combination recorded-synthesized VR or AR environment, the audio encoding device 20 may implement techniques of this disclosure to determine that an object does not correspond to a pre-identified audio object using one or both of the techniques described above.

[0054] Moreover, the audio encoding device 20 may determine relative foreground location information that reflects a relationship between the location of the listener and the respective locations of the foreground audio objects represented by the HOA coefficients 11 in the bitstream 21. For instance, the audio encoding device 20 may determine a relationship between the “first person” aspect of the video capture or video synthesis for the VR experience, and may determine the relationship between the location of the “first person” and the respective video object corresponding to each respective foreground audio object of the 3D soundfield.

[0055] In some examples, the audio encoding device 20 may also use the relative foreground location information to determine relative location information between the listener location and a silent object that attenuates the energy of the foreground object. For instance, the audio encoding device 20 may apply a scaling factor to the relative foreground location information, to derive the distance between the listener location and the silent object that attenuates the energy of the foreground audio object. The scaling factor may range in value from zero to one, with a zero value indicating that the silent object is co-located or substantially co-located with the listener location, and with the value of one indicating that the silent object is co-located or substantially co-located with the foreground audio object.

[0056] In some instances, the audio encoding device 20 may signal the relative foreground location information and/or the listener location-to-silent object distance information to the audio encoding device 24. In other examples, the audio encoding device 20 may signal the listener location information and the foreground audio object location information to the audio decoding device 24, thereby enabling the audio decoding device 24 to derive the relative foreground location information and/or the distance from the listener location to the silent object that attenuates the energy/directional data of the foreground audio object. While the metadata 23 and the bitstream 21 are illustrated in FIG. 2A as being signaled separately by the audio encoding device 20 as an example, it will be appreciated that, in some examples, the bitstream 21 may include portions or an entirety of the metadata 23. One or both of the audio encoding device 20 or the audio decoding device 24 may conform to a 3D audio standard, such as “Information technology–High efficiency coding and media delivery in heterogeneous environments” (ISO/IEC JTC 1/SC 29) or simply, the “MPEG-H” standard.

[0057] While shown in FIG. 2A as being directly transmitted to the content consumer device 14, the content creator device 12 may output the bitstream 21 to an intermediate device positioned between the content creator device 12 and the content consumer device 14. The intermediate device may store the bitstream 21 for later delivery to the content consumer device 14, which may request the bitstream. The intermediate device may comprise a file server, a web server, a desktop computer, a laptop computer, a tablet computer, a mobile phone, a smart phone, or any other device capable of storing the bitstream 21 for later retrieval by an audio decoder. The intermediate device may reside in a content delivery network capable of streaming the bitstream 21 (and possibly in conjunction with transmitting a corresponding video data bitstream) to subscribers, such as the content consumer device 14, requesting the bitstream 21.

[0058] Alternatively, the content creator device 12 may store the bitstream 21 to a storage medium, such as a compact disc, a digital video disc, a high definition video disc or other storage media, most of which are capable of being read by a computer and therefore may be referred to as computer-readable storage media or non-transitory computer-readable storage media. In this context, the transmission channel may refer to the channels by which content stored to the mediums are transmitted (and may include retail stores and other store-based delivery mechanism). In any event, the techniques of this disclosure should not therefore be limited in this respect to the example of FIG. 2A.

[0059] As further shown in the example of FIG. 2A, the content consumer device 14 includes the audio playback system 16. The audio playback system 16 may represent any audio playback system capable of playing back multi-channel audio data. The audio playback system 16 may include a number of different renderers 22. The renderers 22 may each provide for a different form of rendering, where the different forms of rendering may include one or more of the various ways of performing vector-base amplitude panning (VBAP), and/or one or more of the various ways of performing soundfield synthesis. As used herein, “A and/or B” means “A or B”, or both “A and B”.

[0060] The audio playback system 16 may further include an audio decoding device 24. The audio decoding device 24 may represent a device configured to decode HOA coefficients 11’ from the bitstream 21, where the HOA coefficients 11’ may be similar to the HOA coefficients 11 but differ due to lossy operations (e.g., quantization) and/or transmission via the transmission channel. The audio playback system 16 may, after decoding the bitstream 21 to obtain the HOA coefficients 11’ and render the HOA coefficients 11’ to output loudspeaker feeds 25. The loudspeaker feeds 25 may drive one or more loudspeakers (which are not shown in the example of FIG. 2A for ease of illustration purposes).

[0061] While described with respect to loudspeaker feeds 25, the audio playback system 16 may render headphone feeds from either the loudspeaker feeds 25 or directly from the HOA coefficients 11’, outputting the headphone feeds to headphone speakers. The headphone feeds may represent binaural audio speaker feeds, which the audio playback system 16 renders using a binaural audio renderer.

[0062] To select the appropriate renderer or, in some instances, generate an appropriate renderer, the audio playback system 16 may obtain loudspeaker information 13 indicative of a number of loudspeakers and/or a spatial geometry of the loudspeakers. In some instances, the audio playback system 16 may obtain the loudspeaker information 13 using a reference microphone and driving the loudspeakers in such a manner as to dynamically determine the loudspeaker information 13. In other instances or in conjunction with the dynamic determination of the loudspeaker information 13, the audio playback system 16 may prompt a user to interface with the audio playback system 16 and input the loudspeaker information 13.

[0063] The audio playback system 16 may then select one of the audio renderers 22 based on the loudspeaker information 13. In some instances, the audio playback system 16 may, when none of the audio renderers 22 are within some threshold similarity measure (in terms of the loudspeaker geometry) to the loudspeaker geometry specified in the loudspeaker information 13, generate the one of audio renderers 22 based on the loudspeaker information 13. The audio playback system 16 may, in some instances, generate one of the audio renderers 22 based on the loudspeaker information 13 without first attempting to select an existing one of the audio renderers 22. One or more speakers 3 may then playback the rendered loudspeaker feeds 25.

[0064] The audio decoding device 24 may implement various techniques of this disclosure to perform parallax-based adjustments for the encoded representations of the audio objects received via the bitstream 21. For instance, the audio decoding device 24 may apply transmission factors included in the metadata 23 to one or more audio objects conveyed as encoded representations in the bitstream 21. In various examples, the audio decoding device 24 may attenuate the energies and/or adjust directional information with respect to the foreground audio objects, based on the transmission factors. In some examples, the audio decoding device 24 may also use the metadata 23 to obtain silence object location information and/or relative foreground location information that relates a listener’s location to the foreground audio objects’ respective locations. By attenuating the energy of the foreground audio objects and/or adjusting the directional information of the foreground audio objects using the transmission factors, the audio decoding device 24 may enable the content consumer device 14 to render audio data over the speakers 3 that provides a more realistic auditory experience as part of a VR experience that also provides video data and, optionally, other sensory data as well.

[0065] In some examples, the audio decoding device 24 may locally derive the relative foreground location information using information included in the metadata 23. For instance, the audio decoding device 24 may receive listener location information and foreground audio object locations in the metadata 23. In turn, the audio decoding device 24 may derive the relative foreground location information, such as by calculating a displacement between the listener location and the foreground audio location.

[0066] For example, the audio decoding device 24 may use a coordinate system to calculate the relative foreground location information, by using the coordinates of the listener location and the foreground audio locations as operands in a distance calculation function. In some examples, the audio decoding device 24 may also receive, as part of the metadata 23, a scaling factor that is applicable to the relative foreground location information. In some such examples, the audio decoding device 24 may apply the scaling factor to the relative foreground location information to calculate the distance between the listener location and a silence object that attenuates the energy or alters the directional information of the foreground audio object(s). While the metadata 23 and the bitstream 21 are illustrated in FIG. 2A as being received separately at the audio decoding device 24 as an example, it will be appreciated that, in some examples, the bitstream 21 may include portions or an entirety of the metadata 23.

[0067] The system 10B shown in FIG. 2B is similar to the system 10A shown in FIG. 2A, except that an automobile 460 includes the microphones 5. As such, some of the techniques set forth in this disclosure may be performed in the context of automobiles.

[0068] The system 10C shown in FIG. 2C is similar to the system 10A shown in FIG. 2A, except that a remotely-piloted and/or autonomous controlled flying device 462 includes the microphones 5. The flying device 462 may for example represent a quadcopter, a helicopter, or any other type of drone. As such, the techniques set forth in this disclosure may be performed in the context of drones.

[0069] The system 10D shown in FIG. 2D is similar to the system 10A shown in FIG. 2A, except that a robotic device 464 includes the microphones 5. The robotic device 464 may for example represent a device that operates using artificial intelligence, or other types of robots. In some examples, the robotic device 464 may represent a flying device, such as a drone. In other examples, the robotic device 464 may represent other types of devices, including those that do not necessarily fly. As such, the techniques set forth in this disclosure may be performed in the context of robots.

[0070] FIG. 3 is a diagram illustrating a six degree-of-freedom (6-DOF) head movement scheme for AVR and/or AR applications. Aspects of this disclosure address the rendering of 3D audio content in scenarios in which a listener receives 3D audio content, and if the listener moves within the 6-DOF confines illustrated in FIG. 3. In various examples, the listener may receive the 3D audio content by way of a device, such as in situations where the 3D audio content has been recorded and/or transmitted to a VR headset or AR HDM worn by the listener. In the example of FIG. 3, the listener may move his/her head according to rotation (e.g., as expressed by the pitch, yaw, and roll axes). The audio decoding device 24 illustrated in FIG. 2A may implement conventional HOA rendering to address head rotation along the pitch, yaw, and roll axes.

[0071] As shown in FIG. 3 however, the 6-DOF scheme includes three additional movement lines. More specifically, the 6-DOF scheme of FIG. 3 includes, in addition to the rotation axes discussed above, three lines along which the user’s head position may translationally move, or actuate. The three translational directions are left-right (L/R), up-down (U/D), and forward-backward (F/B). The audio encoding device 20 and/or the audio decoding device 24 may use various techniques of this disclosure to implement parallax handling, to address the three translational directions. For instance, the audio decoding device 24 may apply one or more transmission factors to adjust the energies and/or directional information of various foreground audio objects to implement parallax adjustments based on the 6-DOF range of motion of a VR/AR user.

[0072] FIGS. 4A-4D are diagrams illustrating an example of parallax issues that may be presented in a VR scene 30. In the example of VR scene 30A of FIG. 4A, the listener’s virtual position moves according to the first person account captured at or synthesized with respect to positions A, B, and C. At each of virtual positions A, B, and C, the listener may hear foreground audio objects associated with sounds emanating from the lion depicted at the right of FIG. 4A. Additionally, at each of virtual positions A, B, and C, the listener may hear foreground audio objects associated with sounds emanating from the running person depicted in the middle of FIG. 4A. Moreover, in a corresponding real-life situation, each of virtual positions A, B, and C, the listener may hear a different soundfield, due to different directional information and different occlusion or masking characteristics.

[0073] The different occlusion/masking characteristics at each of virtual positions A, B, and C is illustrated in the left column of FIG. 4A. At virtual position A, the lion is roaring (e.g. producing foreground audio objects) behind and to the left of the running person. The audio encoding device 20 may perform beamforming to encode the aspects of the 3D soundfield experienced at virtual position A due to the interference of foreground audio objects (e.g., yelling) emanating from the position of the running person with the foreground audio objects (e.g., roaring) emanating from the position of the lion.

[0074] At virtual position B, the lion is roaring directly behind the running person. That is, the foreground audio objects related to the lion’s roar are masked, to some degree, by the occlusion caused by the running person as well as by the masking caused by the yelling of the running person. The audio encoding device 20 may perform the masking based on the relative position of the listener (at the virtual position B) and the lion, as well as the distance between the running person and the listener (at the virtual position B).

[0075] For instance, the closer the running person is to the lion, the lesser the masking that the audio encoding device 20 may apply to the foreground audio objects of the lion’s roar. The closer the running person is to the virtual position B where the listener is positioned, the greater the masking that the audio encoding device 20 may apply to the foreground audio objects of the lion’s roar. The audio encoding device 20 may cease the masking to allow for some predetermined minimum energy with respect to the foreground audio objects of the lion’s roar. That is, techniques of this disclosure enable the audio encoding device 20 to assign at least a minimum energy to the foreground audio objects of the lion’s roar, regardless of how close the running person is to virtual position B, to accommodate some level of the lion’s roar that will be heard at virtual position B.

[0076] FIG. 4B illustrates the foreground audio objects’ paths from the respective sources to virtual position A. Virtual scene 30B of FIG. 4B illustrates that the listener, at virtual position A, hears the lion’s roar coming from behind and to the left of the running person.

[0077] FIG. 4C illustrates the foreground audio objects’ paths from the respective sources to virtual position C. Virtual scene 30C of FIG. 4C illustrates that the listener, at virtual position C, hears the lion’s roar coming from behind and to the right of the running person.

[0078] FIG. 4D illustrates the foreground audio objects’ paths from the respective sources to virtual position B. Virtual scene 30D of FIG. 4D illustrates that the listener, at virtual position B, hears the lion’s roar coming from directly behind the running person. In the case of virtual scene 30D illustrated in FIG. 4D, the audio encoding device 20 may implement masking based on all three of the listener’s virtual position, the running person’s position, and the lion’s position being co-linear. For instance, the audio encoding device may adjust the loudness of the running person’s yelling as well as the lion’s roar based on the respective distances between every two of the three illustrated objects. For instance, the lion’s roar may be masked by the sound of the running person’s yell, as well as by the occlusion or physical blocking of the running person’s body. The audio encoding device 20 may form various transmission factors based on the criteria discussed above, and may signal the transmission factors to the audio decoding device 24 within the metadata 23.

[0079] In turn, the audio decoding device 24 may apply the transmission factors in rendering the foreground audio objects associated with the lion’s roar, to attenuate the loudness of the lion’s roar based on the audio masking and physical occlusion caused by the running person. Additionally, the audio decoding device 24 may adjust the directional data of the foreground audio objects of the lion’s roar, to account for the occlusion. For instance, the audio decoding device 24 may adjust the foreground audio objects of the lion’s roar to simulate an experience at virtual position B in which the lion’s roar is heard, at an attenuated loudness, from above and around the position of the running person’s body.

[0080] FIGS. 5A and 5B are diagrams illustrating another example of parallax issues that may be presented in a VR scene 40. In the example of VR scene 40A of FIG. 5A, the foreground audio objects of the lion’s roar are, at some virtual positions, further occluded by the presence of a wall. In the example of FIG. 5A, the dimensions (e.g., width) of the wall prevent the wall from occluding the foreground audio objects of the lion’s roar at virtual position A. However, the dimensions of the wall cause occlusion of the foreground audio objects of the lion’s roar at virtual position B. In the left panel of FIG. 5A, the 3D soundfield effect at virtual position B is illustrated with a minimal display of the lion, to illustrate that some minimum energy is assigned to the foreground audio objects of the lion’s roar, because some volume of the lion’s roar can be heard at virtual position B, due to sound waves traveling over and (in some cases) around the wall.

[0081] The wall represents a “silent object” in the context of the techniques of this disclosure. As such, the presence of the wall is not directly indicated by audio objects captured by the microphones 5. Instead, the audio encoding device 20 may infer the locations of occlusion caused by the wall by leveraging video data captured by one or more cameras of (or coupled to) the content creator device 12. For instance, the audio encoding device 20 may translate the video scene position of the wall to audio position data, to represent the silent object (“SO”) using HOA coefficients. Using the positional information of the SO derived in this fashion, the audio encoding device may form transmission factors with respect to the foreground audio objects of the lion’s roar, with respect to the virtual position B.

[0082] Moreover, based on the relative positioning of the running person to the virtual position B and the SO, the audio encoding device 20 may not form transmission factors with respect to foreground audio objects of the yell of the running person. As shown, the SO is not positioned in such a way as to occlude the foreground audio objects of the running person with respect to the virtual position B. The audio encoding device 20 may signal the transmission factors (with respect to the foreground audio objects of the lion’s roar) in the metadata 23 to the audio decoding device 24.

[0083] In turn, the audio decoding device 24 may apply the transmission factors received in the metadata 23 to the foreground audio objects associated with the lion’s roar, with respect to a “sweet spot” position at virtual position B. By applying the transmission factors to the foreground audio objects of the lion’s roar at the virtual position B, the audio decoding device 24 may attenuate the energy assigned to the foreground audio objects of the lion’s roar, thereby simulating the occlusion caused by the presence of the SO. In this manner, the audio decoding device 24 may implement the techniques of this disclosure to apply transmission factors to render the 3D soundfield to provide a more accurate VR experience to a user of the content consumer device 14.

[0084] FIG. 5B illustrates virtual scene 40B, which includes the various features discussed with respect to the virtual scene 40A with respect to FIG. 5A, with additional details. For instance, the virtual scene 40B of FIG. 5B includes a source of background audio objects. In the example illustrated in FIG. 5B, the audio encoding device 20 may classify audio objects into SOs, foreground (FG) audio objects, and background (BG) audio objects. For instance, the audio encoding device 20 may identify a SO as an object that is represented in a video scene, but is not associated with any pre-identified audio object.

[0085] The audio encoding device 20 may identify a FG object as an audio object that is represented by an audio object in an audio frame, and is also associated with a pre-identified audio object. The audio encoding device 20 may identify a BG object as an audio object that is represented by an audio object in an audio frame, but is not associated with any pre-identified audio object. As used herein, an audio object may be associated with a pre-identified audio object if the audio object is associated with an object that is equipped with a sensor (in case of captured audio/video) or maps to an object in a predetermined list (e.g., in case of synthetic audio/video). The BG audio objects may not change or translate based on listener moving between virtual positions A-C. As discussed above, the SO may not generate audio objects of its own, but is used by the audio encoding device 20 to determine transmission factors for the attenuation of the FG objects. As such, the audio encoding device 20 may represent the FG and BG objects separately in the bitstream 21. As discussed above, the audio encoding device 20 may represent the transmission factors derived from the SO in the metadata 23.

[0086] FIGS. 6A-6D are flow diagrams illustrating various encoder-side techniques of this disclosure. FIG. 6A illustrates an encoding process 50A that the audio encoding device 20 may perform in an instance where the audio encoding device 20 processes a live recording, and in which the audio encoding device 20 performs compression and transmission functions. In the example of process 50A, the audio encoding device may process audio data captured via the microphones 5, and may also leverage data extracted from video data captured via one or more cameras. In turn, the audio encoding device 20 may classify the audio objects represented by the HOA coefficients 11 into FG objects, BG objects, and SOs. In turn, the audio encoding device 20 may compress the audio objects (e.g., by removing redundancies from the HOA coefficients 11), and transmit the bitstream 21 to represent the FG objects and BG objects. The audio encoding device 20 may also transmit the metadata 23 to represent transmission factors that the audio encoding device derives using the SOs.

[0087] As shown in the legend 52 of FIG. 6A, the audio encoding device may transmit the following data:

[0088] F.sub.i: ith FG audio signal (person and lion) where i=1, … ,* I*

[0089] V(r.sub.i, .theta..sub.i, .PHI..sub.i): ith directional vector (from a distance, azimuth, elevation)

[0090] B.sub.j: jth BG audio signal (ambient sound from safari) where j=1, … ,* J*

[0091] S.sub.k: location of an kth SO where k=1, … ,* K*

[0092] In various examples, the audio encoding device 20 may transmit one or more of the V vector calculation (with its parameters/arguments), and the S.sub.k value in the metadata 23. The audio encoding device may transmit the values of F.sub.i and B.sub.j in the bitstream 21.

[0093] FIG. 6B is a flowchart illustrating an encoding process 50B that the audio encoding device 20 may perform. As in the case of process 50A of FIG. 6A, process 50B represents a process in which the audio encoding device 20 encodes the bitstream 21 and the metadata 23 using live capture data from the microphones 5 and one or more cameras. In contrast to process 50A of FIG. 6A, process 50B represents a process in which the audio encoding device 20 does not perform compression operations before transmitting the bitstream 21 and the metadata 23. Alternatively, process 50B may also represent an example in which the audio encoding device does not perform transmission, but instead, communicates the bitstream 21 and the metadata 23 to decoding components within an integrated VR device that also includes the audio encoding device 20.

[0094] FIG. 6C is a flowchart illustrating an encoding process 50C that the audio encoding device 20 may perform. In contrast to of processes 50A & 50B of FIGS. 6A & 6B, process 50c represents a process in which the audio encoding device 20 uses synthetic audio and video data, instead of live-capture data.

[0095] FIG. 6D is a flowchart illustrating an encoding process 50C that the audio encoding device 20 may perform. Process 50D represents a process in which the audio encoding device 20 uses a combination of live-captured and synthetic audio and video data.

[0096] FIG. 7 is a flowchart illustrating a decoding process 70 that the audio decoding device 24 may perform, in accordance with aspects of this disclosure. The audio decoding device 24 may receive the bitstream 21 and the metadata 23 from the audio encoding device 20. In various examples, the audio decoding device 24 may receive the bitstream 21 and the metadata 23 via transmission, or via internal communication if the audio encoding device 20 is included within an integrated VR device that also includes the audio decoding device 24. The audio decoding device 24 may decode the bitstream 21 and the metadata 23 to reconstruct the following data, which are described above with respect to the legend 52 of FIGS. 6A-6D:

[0097] {F.sub.1, … , F.sub.I}

[0098] {V(r.sub.1, .theta..sub.1, .PHI..sub.1), … , V(r.sub.I, .theta..sub.I, .PHI..sub.I)}

[0099] {B.sub.1, … , B.sub.J}

[0100] {S.sub.1, … , S.sub.K}

[0101] In turn, the audio decoding device 24 may combine data indicating the user location estimation with the FG object location and directional vector calculations, the FG object attenuation (via application of the transmission factors), and the BG object translation calculations. In FIG. 7, the formula .rho..sub.i.ident..rho..sub.i(f, F.sub.1, … , F.sub.I, B.sub.1, … , B.sub.J, S.sub.1, … , S.sub.K) represents the attenuation of an i.sup.th FG object, using the transmission factors received in the metadata 23. In turn, the audio decoding device 24 may render an audio scene of the 3D soundfield by solving the following equation:

H = i = 1 I .rho. i F i V ( r _ i , .theta. _ i , .phi. _ i ) T + j = 1 J B j T j T ##EQU00003##

[0102] As shown, the audio decoding device 24 may calculate one summation with respect to FG objects, and a second summation with respect to BG objects. With respect to the FG object summation, the audio decoding device 24 may apply the transmission factor .rho. for an i.sup.th object to a product of the FG audio signal for the i.sup.th object and the directional vector calculation for the i.sup.th object. In turn, the audio decoding device 24 may perform a summation of the resulting product values for a series of values of i.

[0103] With respect to the BG objects, the audio decoding device 24 may calculate a product of the j.sup.th BG audio signal and the corresponding translation factor for the j.sup.th BG audio signal. In turn, the audio decoding device 24 may add the FG object-related summation value and the BG object-related summation value to calculate H, for rendering of the 3D soundfield.

[0104] FIG. 8 is a diagram illustrating an object classification mechanism that the audio encoding device 20 may implement to categorize SOs, FG objects, and BG objects, in accordance with aspects of this disclosure. The particular example of FIG. 8 is directed to an example in which the video data and the audio data are captured live, using the microphones 5 and various cameras. The audio encoding device 20 may classify an object as a SO if the object satisfies two conditions, namely, (i) the object appears only a video scene (i.e., is not represented in the corresponding audio scene), and (ii) no sensor is attached to the object. In the example illustrated in FIG. 8, the wall is a SO. In the example of FIG. 8, the audio encoding device 20 may classify an object as a FG object if the object satisfies two conditions, namely, (i) the object appears in an audio scene, and (ii) a sensor is attached to the object. In the example of FIG. 8, the audio encoding device 20 may classify an object as a FG object if the object satisfies two conditions, namely, (i) the object appears in an audio scene, and (ii) no sensor is attached to the object.

[0105] Again, the specific example of FIG. 8 is directed to scenarios in which SOs, FG objects, and BG objects are identified using information on whether a sensor is attached to the object. That is, FIG. 8 may be an example of object classification techniques that the audio encoding device 20 may use in cases of live capture of video data and audio data for a VR/MR/AR experience. In other examples, such as if the video and/or audio data are synthetic, as in some aspects of VR/MR/AR experiences, the audio encoding device 20 may classify the SOs, FG objects, and the BG objects based on whether or not the audio objects map to a pre-identified audio object in a list.

[0106] FIG. 9A is a diagram illustrating an example of stitching of audio/video capture data from multiple microphones and cameras, in accordance with aspects of this disclosure.

[0107] FIG. 9B is a flowchart illustrating a process 90 that includes encoder- and decoder-side operations of parallax adjustments with stitching and interpolation, in accordance with aspects of this disclosure. The process 90 may generally correspond to a combination of the process 50A of FIG. 6A with respect to the operations of the audio encoding device 20 and the process 70 of FIG. 7 with respect to the operations of the audio decoding device 24. However, as shown in FIG. 9B, the process 90 includes data from multiple locations, such as locations L1 and L2. Moreover, the audio encoding device 20 performs stitching along with joint compression and transmission, and the audio decoding device 24 performs interpolation of multiple audio/video scenes at the listener or user location. For instance, to perform the interpolation, the audio decoding device 24 may use point clouds. In various examples, the audio decoding device 24 may use the point clouds to interpolate the listener location between multiple candidate listener locations. For instance, the audio decoding device 24 may receive various listener location candidates in the bitstream 21.

[0108] FIG. 9C is a diagram illustrating the capture of FG objects and BG objects at multiple locations.

[0109] FIG. 9D illustrates a mathematical expression of an interpolation technique that the audio decoding device 24 may perform, in accordance with aspects of this disclosure. The audio decoding device 24 may perform the interpolation operations of FIG. 9D as a reciprocal operation to stitching operations performed by the audio encoding device 20. For instance, to perform stitching operations of this disclosure, the audio encoding device 20 may rearrange FG objects of the 3D soundfield in such a way that a foreground signal F.sub.i at a location L.sub.1 and a foreground signal F.sub.j at a location L.sub.2 both originate from the same FG object, if i=j. The audio encoding device 20 may implement one or more sound identification and/or image identification algorithms to check or verify the identity of each FG object. Moreover, the audio encoding device 20 may perform the stitching operations not only with respect to the FG objects, but with respect to other parameters, as well.

[0110] As shown in FIG. 9D, the audio decoding device may perform the interpolation operations of this disclosure according to the following equations:

F.sub.i=.alpha.F.sub.i(L.sub.1)+(1-.alpha.)F.sub.i(L.sub.2)

B.sub.i=.alpha.B.sub.i(L.sub.1)+(1-.alpha.)B.sub.i(L.sub.2)

[0111] That is, the equations presented above are applicable to FG and BG object-based calculations, such as the foreground and background signals applicable for a particular location i. In terms of the directional vectors and the silent objects at various locations, the audio decoding device 24 may perform the interpolation operations of this disclosure according to the following equations:

{V(r.sub.1,.theta..sub.1,.PHI..sub.1), … ,V(r.sub.I,.theta..sub.I,.PHI..sub.I)}

{S.sub.1, … ,S.sub.K}

[0112] Aspects of the silent object interpolation may be calculated by the following operations, as illustrated in FIG. 9D:

[(sin .theta..sub.1)/L.sub.1]=[(sin .theta..sub.2)/L2]=[(sin .theta..sub.3)/L.sub.3]

[0113] FIG. 9E is a diagram illustrating an application of point cloud-based interpolation that the audio decoding device 24 may implement, in accordance with aspects of this disclosure. The audio decoding device 24 may use the point clouds (denoted by rings in FIG. 9E) to obtain a sampling (e.g. a dense sampling) of 3D space with audio and video signals. For instance, the received bitstream 21 may represent audio and video data captured from multiple locations {L.sub.q}.sub.q=1, … , Q where the audio encoding device 20 has stitched and performed joint compression and interpolation with adjacent data from the user location L*. in the example illustrated in FIG. 9E, the audio decoding device 24 may use data of four capture locations (positioned within the rectangle with rounded corners), to generate or reconstruct the virtually captured data at the user location L*.

[0114] FIG. 10 is a diagram illustrating aspects of an HOA domain calculation of attenuation of foreground audio objects that the audio decoding device 24 may perform, in accordance with aspects of this disclosure. In the example of FIG. 10, the audio decoding device 24 may use an HOA order of four (4), thereby using a total of twenty-five (25) HOA coefficients. As illustrated in FIG. 10, the audio decoding device 24 may use an audio frame size of 1,280 samples.

[0115] FIG. 11 is a diagram illustrating aspects of transmission factor calculations that the audio encoding device 20 may perform, in accordance with one or more techniques of this disclosure.

[0116] FIG. 12 is a diagram illustrating a process 1200 that may be performed by an integrated encoding/rendering device, in accordance with aspects of this disclosure. As such, according to the process 1200, the integrated device may include both of the audio encoding device 20 and the audio decoding device 24, and optionally, other components and/or devices discussed herein. As such, the process 1200 of FIG. 12 does not include compression or transmission steps, because the audio encoding device 20 may communicate the bitstream 21 and the metadata 23 to the audio decoding device 24 using internal communication channels within the integrated device, such as communication bus architecture of the integrated device.

[0117] FIG. 13 is a flowchart illustrating a process 1300 that an audio encoding device or an integrated encoding/rendering device may perform, in accordance with aspects of this disclosure. Process 1300 may begin when one or more microphone arrays capture audio objects of a 3D soundfield (1302). In turn, processing circuitry of the audio encoding device may obtain, from the microphone array(s), the audio objects of the 3D soundfield, where each audio object is associated with a respective audio scene of the audio data captured by the microphone array(s) (1304). The processing circuitry of the audio encoding device may determine that a video object included in a first video scene is not represented by any corresponding audio object in a first audio scene that corresponds to the first video scene (1306).

[0118] The processing circuitry of the audio encoding device may determine that the video object is not associated with any pre-identified audio object (1308). In turn, responsive to the determinations that the video object is not represented by any corresponding audio object in the first audio scene and that the video object is not associated with any pre-identified audio object, the processing circuitry of the audio encoding device may identify the video object as a silent object (1310).

[0119] As such, in some examples of this disclosure, an audio encoding device of this disclosure includes a memory device configured to store audio objects obtained from one or more microphone arrays with respect to a three-dimensional (3D) soundfield, wherein each obtained audio object is associated with a respective audio scene, and to store video data obtained from one or more video capture devices, the video data comprising one or more video scenes, each respective video scene being associated with a respective audio scene of the obtained audio data. The device further includes processing circuitry coupled to the memory device, the processing circuitry being configured to determine that a video object included in a first video scene is not represented by any corresponding audio object in a first audio scene that corresponds to the first video scene, to determine that the video object is not associated with any pre-identified audio object, and to identify, responsive to the determinations that the video object is not represented by any corresponding audio object in the first audio scene and that the video object is not associated with any pre-identified audio object, the video object as a silent object.

[0120] In some examples, the processing circuitry is further configured to determine that a first audio object included in obtained audio data is associated with a pre-identified audio object, and to identify, responsive to the determination that the audio object is associated with the pre-identified audio object, the first audio object as a foreground audio object. In some examples, the processing circuitry is further configured to determine that a second audio object included in obtained audio data is not associated with any pre-identified audio object, and to identify, responsive to the determination that the second audio object is not associated with any pre-identified audio object, the second audio object as a background audio object.

[0121] In some examples, the processing circuitry being is configured to determine that the first audio object is associated with a pre-identified audio object by determining that the first audio object is associated with an audio source that is equipped with one or more sensors. In some examples, the audio encoding device further includes the one or more microphone arrays coupled to the processing circuitry, the one or more microphone arrays being configured to capture the audio objects associated with the 3D soundfield. In some examples, the audio encoding device further includes the one or more video capture devices coupled to the processing circuitry, the one or more video capture devices being configured to capture the video data. The video capture devices may include, be, or be part of, the cameras illustrated in the drawings and described above with respect to the drawings. For example, the video capture devices may represent multiple (e.g., dual) cameras positioned such that the cameras capture video data or images of a scene from different perspectives. In some examples, the foreground audio object is included in the first audio scene that corresponds to the first video scene, and the processing circuitry being further configured to determine whether positional information of the silent object with respect to the first video scene causes attenuation of the foreground audio object.

[0122] In some examples, the processing circuitry is further configured to generate, responsive to determining that the silent object causes the attenuation of the foreground audio object, one or more transmission factors with respect to the foreground audio object, wherein the generated transmission factors represent adjustments with respect to the foreground audio object. In some examples, the generated transmission factors represent adjustments with respect to an energy of the foreground audio object. In some examples, the generated transmission factors represent adjustments with respect to directional characteristics of the foreground audio object. In some examples, the processing circuitry is further configured to transmit the transmission factors out of band with respect to a bitstream that includes the foreground audio object. In some examples, the generated transmission factors represent metadata with respect to the bitstream.

[0123] FIG. 14 is a flowchart illustrating an example process 1400 that an audio decoding device or an integrated encoding/decoding/rendering device may perform, in accordance with aspects of this disclosure. Process 1400 may begin when processing circuitry of the audio decoding device receives, in a bitstream, encoded representations of audio objects of a 3D soundfield (1402). Additionally, the processing circuitry of the audio decoding device may receive metadata associated with the bitstream (1404). It will be appreciated that the sequence illustrated in FIG. 14 is a non-limiting example, and that the processing circuitry of the audio decoding device may receive the bitstream and the metadata in any order, or in parallel, or partly in parallel.

[0124] The processing circuitry of the audio decoding device may obtain, from the received metadata, one or more transmission factors associated with one or more of the audio objects (1406). In addition, the processing circuitry of the audio decoding device may apply the transmission factors to the one or more audio objects to obtain parallax-adjusted audio objects of the 3D soundfield (1408). The audio decoding device may further comprise a memory coupled to the processing circuitry. The memory device may store at least a portion of the received bitstream, the received metadata, or the parallax-adjusted audio objects of the 3D soundfield. The processing circuitry of the audio decoding device may render the parallax-adjusted audio objects of the 3D soundfield to one or more speakers (1410). For instance, the processing circuitry of the audio decoding device may render the parallax-adjusted audio objects of the 3D soundfield into one or more speaker feeds that drive the one or more speakers.

[0125] In some examples of this disclosure, an audio decoding device includes processing circuitry configured to receive, in a bitstream, encoded representations of audio objects of a three-dimensional (3D) soundfield, to receive metadata associated with the bitstream, to obtain, from the received metadata, one or more transmission factors associated with one or more of the audio objects, and to apply the transmission factors to the one or more audio objects to obtain parallax-adjusted audio objects of the 3D soundfield. The device further includes a memory device coupled to the processing circuitry, the memory device being configured to store at least a portion of the received bitstream, the received metadata, or the parallax-adjusted audio objects of the 3D soundfield. In some examples, the processing circuitry is further configured to determine listener location information, and to apply the listener location information in addition to applying the transmission factors to the one or more audio objects. In some examples, the processing circuitry is further configured to apply relative foreground location information between the listener location information and respective locations associated with foreground audio objects of the one or more audio objects. In some examples, the processing circuitry is further configured to apply background translation factors that are calculated using respective locations associated with background audio objects of the one or more audio objects.

[0126] In some examples, the processing circuitry is further configured to apply foreground attenuation factors to respective foreground audio objects of the one or more audio objects. In some examples, the processing circuitry is further configured to determine a minimum transmission value for the respective foreground audio objects, to determine whether applying the transmission factors to the respective foreground audio objects produces an adjusted transmission value that is lower than the minimum transmission value, and to render, responsive to determining that the adjusted transmission value that is lower than the minimum transmission value, the respective foreground audio objects using the minimum transmission value. In some examples, the processing circuitry is further configured to adjust an energy of the respective foreground audio objects. In some examples, the processing circuitry being further configured to attenuate respective energies of the respective foreground audio objects. In some examples, the processing circuitry is further configured to adjust directional characteristics of the respective foreground audio objects. In some examples, the processing circuitry is further configured to adjust parallax information of the respective foreground audio objects. In some examples, the processing circuitry is further configured to adjust the parallax information to account for one or more silent objects represented in a video stream associated with the 3D soundfield. In some examples, the processing circuitry is further configured to receive the metadata within the bitstream.

[0127] In some examples, the processing circuitry is further configured to receive the metadata out of band with respect to the bitstream. In some examples, the processing circuitry is further configured to output video data associated with the 3d soundfield to one or more displays. In some examples, the device further includes the one or more displays, the one or more displays being configured to receive the video data from the processing circuitry, and to output the received video data in visual form.

[0128] FIG. 15 is a flowchart illustrating an example process 1500 that an audio decoding device or an integrated encoding/decoding/rendering device may perform, in accordance with aspects of this disclosure. Process 1500 may begin when processing circuitry of the audio decoding device determines relative foreground location information between a listener location and respective locations associated with one or more foreground audio objects of a 3D soundfield (1502). For instance, the processing circuitry of the audio decoding device may be coupled or otherwise in communication with a memory of the audio decoding device.

……
……
……

您可能还喜欢...