Qualcomm Patent | Audio capture and rendering for extended reality experiences
Patent: Audio capture and rendering for extended reality experiences
Drawings: Click to check drawins
Publication Number: 20210004201
Publication Date: 20210107
Applicant: Qualcomm
Abstract
In some examples, a content consumer device configured to play one or more of a plurality of audio streams includes a memory configured to store the plurality of audio streams and audio location information associated with the plurality of audio streams and representative of audio stream coordinates in an acoustical space where an audio stream was captured or synthesized or both. Each of the audio streams is representative of a soundfield. The content consumer device also includes one or more processors coupled to the memory, and configured to determine device location information representative of device coordinates of the content consumer device in the acoustical space. The one or more processors are configured to select, based on the device location information and the audio location information, a subset of the plurality of audio streams, and output, based on the subset of the plurality of audio streams, one or more speaker feeds.
Claims
-
A content consumer device configured to play one or more of a plurality of audio streams, the content consumer device comprising: a memory configured to store the plurality of audio streams and audio location information associated with the plurality of audio streams and representative of audio stream coordinates in an acoustical space where an audio stream was captured or audio stream coordinates in a virtual acoustical space where an audio stream was synthesized, each of the audio streams representative of a soundfield or both; and one or more processors coupled to the memory, and configured to: determine device location information representative of device coordinates of the content consumer device in the acoustical space; select, based on the device location information and the audio location information, a subset of the plurality of audio streams, the subset of the plurality of audio streams excluding at least one of the plurality of audio streams; and output, based on the subset of the plurality of audio streams, one or more speaker feeds.
-
The content consumer device of claim 1, wherein the one or more processors are further configured to: determine a proximity distance based on the device location information and the audio location information for at least one of the plurality of audio streams; and select, based on the proximity distance, the subset of the plurality of audio streams.
-
The content consumer device of claim 2, wherein the one or more processors are configured to: compare the proximity distance to a threshold proximity distance; and select, when the proximity distance is less than or equal to a threshold proximity distance, a larger number of the plurality of audio streams compared to when the proximity distance is greater than the threshold proximity distance to obtain the subset of the plurality of audio streams.
-
The content consumer device of claim 2, wherein the one or more processors are configured to: compare the proximity distance to a threshold proximity distance; and select, when the proximity distance is greater than a threshold proximity distance, a smaller number of the plurality of audio streams compared to when the proximity distance is less than or equal to the threshold proximity distance to obtain the subset of the plurality of audio streams.
-
The content consumer device of claim 1, wherein the one or more processors are further configured to: obtain a new audio stream and corresponding new audio location information; and update the subset of the plurality of audio streams to include the new audio stream.
-
The content consumer device of claim 1, wherein the one or more processors are further configured to: determine, based on the plurality of audio streams, an energy map representative of an energy of a common soundfield represented by the plurality of audio streams; and determine, based on the energy map, the device location information and the audio location information, the subset of the plurality of audio streams.
-
The content consumer device of claim 6, wherein the one or more processors are further configured to: analyze the energy map to determine an audio source location of an audio stream in the common soundfield; and determine, based on the audio source location, the device location information and the audio location information, a subset of the plurality of audio streams.
-
The content consumer device of claim 7, wherein the one or more processors are further configured to: determine an audio source distance as a distance between the audio source location and the device coordinates; compare the audio source distance to an audio source distance threshold; and select, when the audio source distance is greater than the audio source distance threshold, a single audio stream of the plurality of audio streams as the subset of the plurality of audio streams, the single audio stream associated with the audio stream coordinates having a shortest distance to the device coordinates.
-
The content consumer device of claim 7, wherein the one or more processors are configured to: determine an audio source distance as a distance between the audio source location and the device coordinates; compare the audio source distance to an audio source distance threshold; and select, when the audio source distance is less than or equal to the audio source distance threshold, multiple audio streams of the plurality of audio streams as the subset of the plurality of audio streams, the multiple audio streams being the subset of the plurality of audio streams with the audio stream coordinates surrounding the device coordinates.
-
The content consumer device of claim 1, wherein the one or more processors are further configured to: determine a first audio source distance as a distance between first audio stream coordinates for a first audio stream and the device coordinates; compare the first audio source distance to a first audio source distance threshold; select, when the first audio source distance is less than or equal to the first audio source distance threshold, the first audio stream of the plurality of audio streams; and output, based on the first audio stream, one or more speaker feeds, wherein the first audio stream is an only audio stream selected.
-
The content consumer device of claim 10, wherein the one or more processors are further configured to: determine a second audio source distance as a distance between second audio stream coordinates for a second audio stream and the device coordinates; compare the second audio source distance to a second audio source distance threshold; select, when both the first audio source distance is greater than the first audio source distance threshold and the second audio source distance is greater than the second audio source distance threshold, the first audio stream of the plurality of audio streams and the second audio stream of the plurality of audio streams; and output, based on the first audio stream and the second audio stream, one or more speaker feeds.
-
The content consumer device of claim 11, wherein the one or more processors are configured to combine the first audio stream and the second audio stream by at least one of adaptive mixing the first audio stream and the second audio stream or interpolating a third audio stream based on the first audio stream and the second audio stream.
-
The content consumer device of claim 12, wherein the one or more processors are configured to combine the first audio stream and the second audio stream by applying a function F(x) to the first audio stream and the second audio stream.
-
The content consumer device of claim 11, wherein the one or more processors are further configured to: determine whether the device coordinates have been steady relative to the first audio source distance threshold and the second audio source distance threshold for a predetermined period of time; and based on the device coordinates being steady relative to the first audio source distance threshold and the second audio source distance threshold for a predetermined period of time, select the first audio stream, the first audio stream and the second audio stream, or the second audio stream.
-
The content consumer device of claim 11, wherein the one or more processors are further configured to: select, when the second audio source distance is less than or equal to the second audio source threshold, the second audio stream of the plurality of audio streams; and output, based on the second audio stream, one or more speaker feeds, wherein the second audio stream is an only audio stream selected.
-
The content consumer device of claim 11, wherein the one or more processors are further configured to select a different audio stream based the device coordinates changing.
-
The content consumer device of claim 10, wherein the one or more processors are further configured to provide an alert to a user based on the first audio source distance equaling the first audio source distance threshold, wherein the alert is at least one of a visual alert, an auditory alert, or a haptic alert.
-
The content consumer device of claim 1, wherein the audio stream coordinates in the acoustical space or the audio stream coordinates in the virtual acoustical space are coordinates in a displayed world in relation to which the corresponding audio stream was captured or synthesized.
-
The content consumer device of claim 18, wherein the content consumer device comprises an extended reality headset, and wherein the displayed world comprises a scene represented by video data captured by a camera.
-
The content consumer device of claim 18, wherein the content consumer device comprises an extended reality headset, and wherein the displayed world comprises a virtual world.
-
The content consumer device of claim 1, wherein the content consumer device comprises a mobile handset.
-
The content consumer device of claim 1, further comprising a transceiver configured to wirelessly receive the plurality of audio streams, wherein the transceiver is configured to wirelessly receive the plurality of audio streams in accordance with at least one of a fifth generation (5G) cellular standard, a personal area network standard or a local area network standard.
-
The content consumer device of claim 1, wherein the one or more processors are further configured to only decode the subset of the plurality of audio streams, in response to the selection.
-
The content consumer device of claim 1, wherein the one or more processors are further configured to: determine an audio source distance as a distance between an audio source in the acoustical space and the device coordinates; compare the audio source distance to an audio source distance threshold; and select, when the audio source distance is greater than the audio source distance threshold, a single audio stream of the plurality of audio streams as the subset of the plurality of audio streams, the single audio stream having a shortest audio source distance.
-
A method of playing one or more of a plurality of audio streams, the method comprising: storing, by a memory of a content consumer device, the plurality of audio streams and audio location information associated with the plurality of audio streams and representative of audio stream coordinates in an acoustical space where an audio stream was captured or audio stream coordinates in a virtual acoustical space where an audio stream was synthesized or both, each of the audio streams representative of a soundfield; and determining, by one or more processors of the content consumer device, device location information representative of device coordinates of the content consumer device in the acoustical space; selecting, by the one or more processors and based on the device location information and the audio location information, a subset of the plurality of audio streams, the subset of the plurality of audio streams excluding at least one of the plurality of audio streams; and outputting, by the one or more processors and based on the subset of the plurality of audio streams, one or more speaker feeds.
-
The method of claim 25, wherein selecting the subset of the plurality of audio streams comprises: determining a proximity distance based on the device location information and the audio location information for at least one of the plurality of audio streams; and selecting, based on the proximity distance, the subset of the plurality of audio streams.
-
The method of claim 26, wherein selecting the subset of the plurality of audio streams comprises: comparing the proximity distance to a threshold proximity distance; and selecting, when the proximity distance is less than or equal to a threshold proximity distance, a larger number of the plurality of audio streams compared to when the proximity distance is greater than the threshold proximity distance to obtain the subset of the plurality of audio streams.
-
The method of claim 26, wherein selecting the subset of the plurality of audio streams comprises: comparing the proximity distance to a threshold proximity distance; and selecting, when the proximity distance is greater than a threshold proximity distance, a smaller number of the plurality of audio streams compared to when the proximity distance is less than or equal to the threshold proximity distance to obtain the subset of the plurality of audio streams.
-
The method of claim 25, further comprising: obtaining a new audio stream and corresponding new audio location information; and updating the subset of the plurality of audio streams to include the new audio stream.
-
The method of claim 25, further comprising: determining, by the one or more processors and based on the plurality of audio streams, an energy map representative of an energy of a common soundfield represented by the plurality of audio streams; and determining, by the one or more processors and based on the energy map, the device location information and the audio location information, a subset of the plurality of audio streams.
-
The method of claim 30, wherein selecting the subset of the plurality of audio streams comprises: analyzing the energy map to determine audio stream coordinates of an audio source in the common soundfield; and determining, based on the audio source coordinates, the device location information and the audio location information, the subset of the plurality of audio streams.
-
The method of claim 31, wherein selecting the subset of the plurality of audio streams comprises: determining an audio source distance as a distance between the audio stream coordinates and the device coordinates; comparing the audio source distance to an audio source distance threshold; and selecting, when the audio source distance is greater than the audio source distance threshold, a single audio stream of the plurality of audio streams as the subset of the plurality of audio streams, the single audio stream having a shortest audio source distance.
-
The method of claim 31, wherein selecting the subset of the plurality of audio streams comprises: determining an audio source distance as a distance between the audio stream coordinates and the device coordinates; comparing the audio source distance to an audio source distance threshold; and selecting, when the audio source distance is less than or equal to the audio source distance threshold, multiple audio streams of the plurality of audio streams as the subset of the plurality of audio streams, the multiple audio streams being the subset of the plurality of audio streams with audio stream coordinates surrounding the device coordinates.
-
The method of claim 25, further comprising: determining, by the one or more processors, a first audio source distance as a distance between first audio stream coordinates for a first audio stream and the device coordinates; comparing, by the one or more processors, the first audio source distance to a first audio source distance threshold; selecting, by the one or more processors and when the first audio source distance is less than or equal to the first audio source distance threshold, the first audio stream of the plurality of audio streams; and outputting, by the one or more processors, based on the first audio stream, one or more speaker feeds, wherein the first audio stream is an only audio stream selected.
-
The method of claim 34, further comprising: determining, by the one or more processors, a second audio source distance as a distance between second audio stream coordinates for a second audio stream and the device coordinates; comparing, by the one or more processors, the second audio source distance to a second audio source distance threshold; selecting, by the one or more processors and when both the first audio source distance is greater than the first audio source distance threshold and the second audio source distance is greater than the second audio source distance threshold, the first audio stream of the plurality of audio streams and the second audio stream of the plurality of audio streams; and outputting, by the one or more processors and based on the first audio stream and the second audio stream, one or more speaker feeds.
-
The method of claim 35, further comprising combining, by the one or more processors, the first audio stream and the second audio stream by at least one of adaptive mixing the first audio stream and the second audio stream or interpolating a third audio stream based on the first audio stream and the second audio stream.
-
The method of claim 35, wherein the combining comprises applying a function F(x) to the first audio stream and the second audio stream.
-
The method of claim 35, further comprising: determining, by the one or more processors, whether the device coordinates have been steady relative to the first audio source distance threshold and the second audio source distance threshold for a predetermined period of time; and based on the device coordinates being steady relative to the first audio source distance threshold and the second audio source distance threshold for a predetermined period of time, selecting, by the one or more processors, the first single audio stream, or the first single audio stream and the second single audio stream, or second single audio stream.
-
The method of claim 35, further comprising: selecting, by the one or more processors and when the second audio source distance is less than or equal to the second audio source threshold, the second audio stream of the plurality of audio streams; and outputting, based on the second single audio stream, one or more speaker feeds, wherein the second audio stream is an only audio stream selected.
-
The method of claim 35, further comprising selecting, by the one or more processors, a different audio stream based the device coordinates changing.
-
The method of claim 34, further comprising providing an alert to a user based on the first audio source distance equaling the first audio source distance threshold, wherein the alert is at least one of a visual alert, an auditory alert, or a haptic alert.
-
The method of claim 25, wherein the content consumer device comprises an extended reality headset, and wherein a displayed world comprises a scene represented by video data captured by a camera.
-
The method of claim 25, wherein the content consumer device comprises an extended reality headset, and wherein a displayed world comprises a virtual world.
-
The method of claim 25, wherein the content consumer device comprises a mobile handset.
-
The method of claim 25, further comprising wirelessly receiving, by a transceiver module of the content consumer device, the plurality of audio streams, wherein wirelessly receiving the plurality of audio streams comprises wirelessly receiving the plurality of audio streams in accordance with a fifth generation (5G) cellular standard, a personal area network standard, or a local area network standard.
-
The method of claim 25, further comprising only decoding, by the one or more processors, the subset of the plurality of audio streams, in response to the selection.
-
The method of claim 25, further comprising: determining, by the one or more processors, an audio source distance as a distance between an audio source in the acoustical space and the device coordinates; comparing, by the one or more processors, the audio source distance to an audio source distance threshold; and selecting, by the one or more processors and when the audio source distance is greater than the audio source distance threshold, a single audio stream of the plurality of audio streams as the subset of the plurality of audio streams, the single audio stream having a shortest audio source distance.
-
A content consumer device configured to play one or more of a plurality of audio streams, the content consumer device comprising: means for storing the plurality of audio streams and audio location information associated with the plurality of audio streams and representative of audio stream coordinates in an acoustical space where an audio stream was captured or audio stream coordinates in a virtual acoustical space where an audio stream was synthesized or both, each of the audio streams representative of a soundfield; and means for determining device location information representative of device coordinates of the content consumer device in the acoustical space; means for selecting, based on the device location information and the audio location information, a subset of the plurality of audio streams, the subset of the plurality of audio streams excluding at least one of the plurality of audio streams; and means for outputting, based on the subset of the plurality of audio streams, one or more speaker feeds.
-
A non-transitory computer-readable storage medium having stored thereon instructions that, when executed, cause one or more processors of a content consumer device to: store a plurality of audio streams and audio location information associated with the plurality of audio streams and representative of audio stream coordinates in an acoustical space where an audio stream was captured or audio stream coordinates in a virtual acoustical space where an audio stream was synthesized or both, each of the audio streams representative of a soundfield; and determine device location information representative of device coordinates of the content consumer device in the acoustical space; select, based on the device location information and the audio location information, a subset of the plurality of audio streams, the subset of the plurality of audio streams excluding at least one of the plurality of audio streams; and output, based on the subset of the plurality of audio streams, one or more speaker feeds.
Description
CROSS-REFERENCE TO RELATED APPLICATIONS
[0001] This application claims priority to U.S. Provisional Patent Application 62/870,573, filed on Jul. 3, 2019, and U.S. Provisional Patent Application 62/992,635, filed on Mar. 20, 2020, the entire content of both of which is incorporated by reference.
TECHNICAL FIELD
[0002] This disclosure relates to processing of media data, such as audio data.
BACKGROUND
[0003] Computer-mediated reality systems are being developed to allow computing devices to augment or add to, remove or subtract from, or generally modify existing reality experienced by a user. Computer-mediated reality systems (which may also be referred to as “extended reality systems,” or “XR systems”) may include, as examples, virtual reality (VR) systems, augmented reality (AR) systems, and mixed reality (MR) systems. The perceived success of computer-mediated reality systems are generally related to the ability of such computer-mediated reality systems to provide a realistically immersive experience in terms of both the video and audio experience where the video and audio experience align in ways expected by the user. Although the human visual system is more sensitive than the human auditory systems (e.g., in terms of perceived localization of various objects within the scene), ensuring an adequate auditory experience is an increasingly important factor in ensuring a realistically immersive experience, particularly as the video experience improves to permit better localization of video objects that enable the user to better identify sources of audio content.
SUMMARY
[0004] This disclosure relates generally to auditory aspects of the user experience of computer-mediated reality systems, including virtual reality (VR), mixed reality (MR), augmented reality (AR), computer vision, and graphics systems. Various aspects of the techniques may provide for adaptive audio capture and rendering of an acoustical space for extended reality systems. As used herein, an acoustic environment is represented as either an indoor environment or an outdoor environment, or both an indoor environment and an outdoor environment. The acoustic environment may include one or more sub-acoustic spaces that may include various acoustic elements. An example of an outdoor environment could include a car, buildings, walls, a forest, etc. An acoustical space may be an example of an acoustical environment and may be an indoor space or an outdoor space. As used herein, an audio element is either a sound captured by a microphone (e.g., directly captured from near-field sources or reflections from far-field sources whether real or synthetic), or a sound field previously synthesized, or a mono sound synthesized from text to speech, or a reflection of a virtual sound from an object in the acoustic environment. An audio element may also be referred to herein as a receiver.
[0005] In one example, various aspects of the techniques are directed to a content consumer device configured to play one or more of a plurality of audio streams, the content consumer device including: a memory configured to store the plurality of audio streams and audio location information associated with the plurality of audio streams and representative of audio stream coordinates in an acoustical space where an audio stream was captured or audio stream coordinates in a virtual acoustical space where an audio stream was synthesized or both, each of the audio streams representative of a soundfield; and one or more processors coupled to the memory, and configured to: determine device location information representative of device coordinates of the content consumer device in the acoustical space; select, based on the device location information and the audio location information, a subset of the plurality of audio streams, the subset of the plurality of audio streams excluding at least one of the plurality of audio streams; and output, based on the subset of the plurality of audio streams, one or more speaker feeds.
[0006] In another example, various aspects of the techniques are directed to a method of playing one or more of a plurality of audio streams, the method including: storing, by a memory of a content consumer device, the plurality of audio streams and audio location information associated with the plurality of audio streams and representative of audio stream coordinates in an acoustical space where an audio stream was captured or audio stream coordinates in a virtual acoustical space where an audio stream was synthesized or both, each of the audio streams representative of a soundfield; and determining, by one or more processors of the content consumer device, device location information representative of device coordinates of the content consumer device in the acoustical space; selecting, by the one or more processors and based on the device location information and the audio location information, a subset of the plurality of audio streams, the subset of the plurality of audio streams excluding at least one of the plurality of audio streams; and outputting, by the one or more processors and based on the subset of the plurality of audio streams, one or more speaker feeds.
[0007] In another example, various aspects of the techniques are directed to a content consumer device configured to play one or more of a plurality of audio streams, the content consumer device including: means for storing the plurality of audio streams and audio location information associated with the plurality of audio streams and representative of audio stream coordinates in an acoustical space where an audio stream was captured or audio stream coordinates in a virtual acoustical space where an audio stream was synthesized or both, each of the audio streams representative of a soundfield; and means for determining device location information representative of device coordinates of the content consumer device in the acoustical space; means for selecting, based on the device location information and the audio location information, a subset of the plurality of audio streams, the subset of the plurality of audio streams excluding at least one of the plurality of audio streams; and means for outputting, based on the subset of the plurality of audio streams, one or more speaker feeds.
[0008] In another example, various aspects of the techniques are directed to a non-transitory computer-readable storage medium having stored thereon instructions that, when executed, cause one or more processors of a content consumer device to: store a plurality of audio streams and audio location information associated with the plurality of audio streams and representative of audio stream coordinates in an acoustical space where an audio stream was captured or audio stream coordinates in a virtual acoustical space where an audio stream was synthesized or both, each of the audio streams representative of a soundfield; and determine device location information representative of device coordinates of the content consumer device in the acoustical space; select, based on the device location information and the audio location information, a subset of the plurality of audio streams, the subset of the plurality of audio streams excluding at least one of the plurality of audio streams; and output, based on the subset of the plurality of audio streams, one or more speaker feeds.
[0009] The details of one or more examples of this disclosure are set forth in the accompanying drawings and the description below. Other features, objects, and advantages of various aspects of the techniques will be apparent from the description and drawings, and from the claims.
BRIEF DESCRIPTION OF DRAWINGS
[0010] FIGS. 1A-1C are diagrams illustrating systems that may perform various aspects of the techniques described in this disclosure.
[0011] FIG. 2 is a diagram illustrating an example of a VR device worn by a user.
[0012] FIGS. 3A-3E are diagrams illustrating, in more detail, example operations of the stream selection unit shown in the examples of FIGS. 1A-1C.
[0013] FIGS. 4A-4E are flowcharts illustrating example operation of the stream selection unit shown in the examples of FIGS. 1A-1C in performing various aspects of the stream selection techniques.
[0014] FIGS. 5A-5D are conceptual diagrams illustrating examples of snapping in accordance with aspects of the present disclosure.
[0015] FIG. 6 is a diagram illustrating an example of a wearable device that may operate in accordance with various aspect of the techniques described in this disclosure.
[0016] FIGS. 7A and 7B are diagrams illustrating other example systems that may perform various aspects of the techniques described in this disclosure.
[0017] FIG. 8 is a block diagram illustrating example components of one or more of the source device and the content consumer device shown in the example of FIG. 1.
[0018] FIG. 9 illustrates an example of a wireless communications system in accordance with aspects of the present disclosure.
DETAILED DESCRIPTION
[0019] When rendering an XR scene, such as a six degrees of freedom (6DoF) scene, with many audio sources which may be obtained from audio capture devices of a live scene or synthesized sources in a virtual or live scene may require a balance between including more or less audio information. This balancing may be performed offline by a mixing engineer, which may be expensive and time consuming. In some cases, the balancing may be performed by a server in communication with the renderer. In these cases, the balancing may not occur in situations where the renderer is off-line and when the renderer is on-line, may lead to latency as the renderer may be repeatedly communicating with the server to provide information relating to the position of the XR device and receive updated audio information.
[0020] According to the techniques of this disclosure, a content consumer device (such as an XR device) may determine device location information representative of device coordinates in an acoustical space. The content consumer device may select, based on proximity distance between the device location information and audio location information associated with the plurality of audio streams and representative of audio stream coordinates in an acoustical space where an audio stream was captured or synthesized, a greater number or lesser number of the plurality of audio streams, based on whether the proximity distance less than or greater than, respectively, the proximity distance threshold. The techniques of this disclosure may eliminate the need for balancing by a mixing engineer and repeated communication between the content consumer device and a server.
[0021] Furthermore, when a user is in an XR scene, the user may desire to experience audio from a different listening position than where the device location information indicates. According to the techniques of this disclosure, a user may enter a snapping mode. In the snapping mode, the audio experience of the user may snap to an audio stream based on one or more audio source distances and one or more audio source distance thresholds, an audio source distance may be a distance between the device coordinates and audio stream coordinates for an audio stream. In this manner, a user’s auditory experience may be improved.
[0022] There are a number of different ways to represent a soundfield. Example formats include channel-based audio formats, object-based audio formats, and scene-based audio formats. Channel-based audio formats refer to the 5.1 surround sound format, 7.1 surround sound formats, 22.2 surround sound formats, or any other channel-based format that localizes audio channels to particular locations around the listener in order to recreate a soundfield.
[0023] Object-based audio formats may refer to formats in which audio objects, often encoded using pulse-code modulation (PCM) and referred to as PCM audio objects, are specified in order to represent the soundfield. Such audio objects may include metadata identifying a location of the audio object relative to a listener or other point of reference in the soundfield, such that the audio object may be rendered to one or more speaker channels for playback in an effort to recreate the soundfield.
[0024] Scene-based audio formats may include a hierarchical set of elements that define the soundfield in three dimensions. One example of a hierarchical set of elements is a set of spherical harmonic coefficients (SHC). The following expression demonstrates a description or representation of a soundfield using SHC:
p i ( t , r r , .theta. r , .PHI. r ) = .omega. = 0 .infin. [ 4 .pi. n = 0 .infin. j n ( k r r ) m = - n n A n m ( k ) Y n m ( .theta. r , .PHI. r ) ] e j .omega. t , ##EQU00001##
[0025] The expression shows that the pressure p.sub.i at any point {r.sub.r, .theta..sub.r, .phi..sub.r} of the soundfield, at time t, can be represented uniquely by the SHC, A.sub.n.sup.m(k). Here,
k = .omega. c , ##EQU00002##
c is the speed of sound (.about.343 m/s), {r.sub.r, .theta..sub.r, .phi..sub.r} is a point of reference (or observation point), j.sub.n( ) is the spherical Bessel function of order n, and Y.sub.n.sup.m(.theta..sub.r, .phi..sub.r) are the spherical harmonic basis functions (which may also be referred to as a spherical basis function) of order n and suborder m. It can be recognized that the term in square brackets is a frequency-domain representation of the signal (e.g., S(.omega., r.sub.r, .theta..sub.r, .phi..sub.r)) which can be approximated by various time-frequency transformations, such as the discrete Fourier transform (DFT), the discrete cosine transform (DCT), or a wavelet transform. Other examples of hierarchical sets include sets of wavelet transform coefficients and other sets of coefficients of multiresolution basis functions.
[0026] The SHC A.sub.n.sup.m(k) can either be physically acquired (e.g., recorded) by various microphone array configurations or, alternatively, they can be derived from channel-based or object-based descriptions of the soundfield. The SHC (which also may be referred to as ambisonic coefficients) represent scene-based audio, where the SHC may be input to an audio encoder to obtain encoded SHC that may promote more efficient transmission or storage. For example, a fourth-order representation involving (1+4).sup.2 (25, and hence fourth order) coefficients may be used.
[0027] As noted above, the SHC may be derived from a microphone recording using a microphone array. Various examples of how SHC may be physically acquired from microphone arrays are described in Poletti, M., “Three-Dimensional Surround Sound Systems Based on Spherical Harmonics,” J. Audio Eng. Soc., Vol. 53, No. 11, 2005 November, pp. 1004-1025.
[0028] The following equation may illustrate how the SHCs may be derived from an object-based description. The coefficients A.sub.n.sup.m(k) for the soundfield corresponding to an individual audio object may be expressed as:
A.sub.n.sup.m(k)=g(.omega.)(-4.pi.ik)h.sub.2.sup.(2)(kr.sub.s)Y.sub.n.su- p.m*(.theta..sub.s,.phi..sub.s),
where i is, {square root over (-1)}, h.sub.n.sup.(2)( ) is the spherical Hankel function (of the second kind) of order n, and {r.sub.s, .theta..sub.s, .phi..sub.s} is the location of the object. Knowing the object source energy g(.omega.) as a function of frequency (e.g., using time-frequency analysis techniques, such as performing a fast Fourier transform on the pulse code modulated–PCM–stream) may enable conversion of each PCM object and the corresponding location into the SHC A.sub.n.sup.m(k). Further, it can be shown (since the above is a linear and orthogonal decomposition) that the A.sub.n.sup.m(k) coefficients for each object are additive. In this manner, a number of PCM objects can be represented by the A.sub.n.sup.m(k) coefficients (e.g., as a sum of the coefficient vectors for the individual objects). The coefficients may contain information about the soundfield (the pressure as a function of 3D coordinates), and the above represents the transformation from individual objects to a representation of the overall soundfield, in the vicinity of the observation point {r.sub.r, .theta..sub.r, .phi..sub.r}.
[0029] The techniques described in this disclosure may apply to any of the formats discussed herein, including scene-based audio formats, channel-based audio formats, object-based audio formats, or any combination thereof.
……
……
……