# Google Patent | Signal processing methods and systems for rendering audio on virtual loudspeaker arrays

**Patent: **Signal processing methods and systems for rendering audio on virtual loudspeaker arrays

**Publication Number: **10142755

**Publication Date: **2018-11-27

**Applicants: **Google

**Abstract**

Techniques of rendering audio involve applying a balanced-realization state space model to each head-related transfer function (HRTF) to reduce the order of an effective FIR or even an infinite impulse response (IIR) filter. Along these lines, each HRTF G(z) is derived from a head-related impulse response filter (HRIR) via, e.g., a z-transform. The data of the HRIR may be used to construct a first state space representation [A, B, C, D] of the HRTF via the relation G(z)=C(zI-A).sup.-1B+D This first state space representation is not unique and so for an FIR filter, A and B may be set to simple, binary-valued arrays, while C and D contain the HRIR data. This representation leads to a simple form of a Gramian Q whose eigenvectors provide system states that maximize the system gain as measured by a Hankel norm.

**Background**

A virtual array of loudspeakers surrounding a listener is commonly used in the creation of a virtual spatial acoustic environment for headphone delivered audio. The sound field created by this speaker array can be manipulated to deliver the effect of sound sources moving relative to the user or in order to stabilize the source at fixed spatial location when the user moves their head. These are operations that are of major importance to the delivery of audio through headphones in Virtual Reality (VR) systems.

The multi-channel audio, which is processed for delivery to the virtual loudspeakers, is combined to provide a pair of signals to the left and right headphone speakers. This process of combination of multi-channel audio is known as binaural rendering. The commonly accepted most effective way of implementing this rendering is to use a multi-channel filtering system that implements Head Related Transfer Functions (HRTFs). In a system based on a number, for example, M, (where M is an arbitrary number) of virtual loudspeakers, the binaural renderer will need to have 2M HRTF filter as a pair is used per loudspeaker to model the transfer function between the loudspeaker and the user’s left and right ears.

**Summary**

Conventional approaches to performing binaural rendering require large amounts of computational resources. Along these lines, when an HRTF is represented as a finite impulse response (FIR) filter of order n, each binaural output requires 2 M n multiply and addition operations per channel. Such operations may tax the limited resources allotted for binaural rendering in, for example, virtual reality applications.

In contrast to the conventional approaches to performing binaural rendering which require large amounts of computational resources, improved techniques involve applying a balanced-realization state space model to each HRTF to reduce the order of an effective FIR or even an infinite impulse response (IIR) filter. Along these lines, each HRTF G(z) is derived from a head-related impulse response filter (HRIR) via, e.g., a z-transform. The data of the HRIR may be used to construct a first state space representation [A, B, C, D] of the HRTF via the relation G(z)=C(zI-A).sup.-1B+D. This first state space representation is not unique and so for an FIR filter, A and B may be set to simple, binary-valued arrays, while C and D contain the HRIR data. This representation leads to a simple form of a Gramian Q whose eigenvectors provide system states that maximize the system gain as measured by a Hankel norm. Further, a factorization of Q provides a transformation into a balanced state space in which the Gramian is equal to a diagonal matrix of the eigenvalues of Q. By considering only those states associated with an eigenvalue greater than some threshold, the balanced state space representation of the HRTF may be truncated to provide an approximate HRTF that approximates the original HRTF very well while reducing the amount of computation required by as much as 90%.

One general aspect of the improved techniques includes a method of rendering sound fields in a left ear and a right ear of a human listener, the sound fields being produced by a plurality of virtual loudspeakers. The method can include obtaining, by processing circuitry of a sound rendering computer configured to render the sound fields in the left ear and the right ear of the head of the human listener, a plurality of head-related impulse responses (HRIRs), each of the plurality of HRIRs being associated with a virtual loudspeaker of the plurality of virtual loudspeakers and an ear of the human listener, each of the plurality of HRIRs including samples of a sound field produced at a specified sampling rate in a left or right ear produced in response to an audio impulse produced by that virtual loudspeaker. The method can also include generating a first state space representation of each of the plurality of HRIRs, the first state space representation including a matrix, a column vector, and a row vector, each of the matrix, the column vector, and the row vector of the first state space representation having a first size. The method can further include performing a state space reduction operation to produce a second state space representation of each of the plurality of HRIRs, the second space representation including a matrix, a column vector, and a row vector, each of the matrix, the column vector, and the row vector of the second state space representation having a second size that is less than first size. The method can further include producing a plurality head-related transfer functions (HRTFs) based on the second state representation, each of the plurality of HRTFs corresponding to a respective HRIR of the plurality of HRIRs, an HRTF corresponding to a respective HRIR producing, upon multiplication by a frequency-domain sound field produced by the virtual loudspeaker with which the respective HRIR is associated, a component of a sound field rendered in an ear of the human listener.

Performing the state space reduction operation can include, for each HRIR of the plurality of HRIRs, generating a respective Gramian matrix based on the first state space representation of that HRIR, the Gramian matrix having a plurality of eigenvalues arranged in descending order of magnitude, and generating the second state space representation of that HRIR based on the Gramian matrix and the plurality of eigenvalues, wherein the second size is equal to a number of eigenvalues of the plurality of eigenvalues greater than a specified threshold.

Generating the second state space representation of each HRIR of the plurality of HRIRs can include forming a transformation matrix that, when applied to the Gramian matrix that is based on the first state space representation of that HRIR, produces a diagonal matrix, each diagonal element of the diagonal matrix being equal to a respective eigenvalue of the plurality of eigenvalues.

The method can further include, for each of the plurality of HRIRs, generating a cepstrum of that HRIR, the cepstrum having causal samples taken at positive times and non-causal samples taken at negative times, for each of the non-causal samples of the cepstrum, performing a phase minimization operation by adding that non-causal sample taken at a negative time to a causal sample of the cepstrum taken at the opposite of that negative time, and producing a minimum-phase HRIR by setting each of the non-causal samples of the cepstrum to zero after performing the phase minimization operation for each of the non-causal samples of the cepstrum.

The method can further include generating a multiple input, multiple output (MIMO) state space representation, the MIMO state space representation including a composite matrix, a column vector matrix, and a row vector matrix, the composite matrix of the MIMO state space representation including the matrix of the first representation of each of the plurality HRIRs, the column vector matrix of the MIMO state space representation including the column vector of the first representation of each of the plurality HRIRs, the row vector matrix of the MIMO state space representation including the row vector of the first representation of each of the plurality HRIRs. In this case, performing the state space reduction operation includes generating a reduced composite matrix, a reduced column vector matrix, and a reduced row vector matrix, each of the reduced composite matrix, reduced column vector matrix, and reduced row vector matrix having a size that is respectively less than a size of the composite matrix, the column vector matrix, and the row vector matrix.

Generating the MIMO state space representation can include forming, as the composite matrix of the MIMO state space representation, a first block matrix having a matrix of the first state space representation of an HRIR associated with a virtual loudspeaker of the plurality of virtual loudspeakers as a diagonal element of the first block matrix, matrices of the first state space representation of HRIRs associated with the same virtual loudspeaker being in adjacent diagonal elements of the first block matrix. Generating the MIMO state space representation can also include forming, as the column vector matrix of the MIMO state space representation, a second block matrix having a column vector of the first state space representation of an HRIR associated with a virtual loudspeaker of the plurality of virtual loudspeakers as a diagonal element of the second block matrix, column vectors of the first state space representation of HRIRs associated with the same virtual loudspeaker being in adjacent diagonal elements of the second block matrix. Generating the MIMO state space representation can further include forming, as the row vector matrix of the MIMO state space representation, a third block matrix having a row vector of the first state space representation of an HRIR associated with a virtual loudspeaker of the plurality of virtual loudspeakers as an element of the third block matrix, row vectors of the first state space representation of HRIRs that render sounds in the left ear being in odd-numbered elements of the first row of the third block matrix, row vectors of the first state space representation of HRIRs that render sounds in the right ear being in even-numbered elements of the second row of the third block matrix.

The method can further include, prior to generating the MIMO state space representation, for each HRIR of the plurality of HRIRs, performing a single input single output (SISO) state space reduction operation to produce, as the first state space representation of that HRIR, a SISO state space representation of that HRIR.

Regarding the method, for each of the plurality of virtual loudspeakers, there are a left HRIR and a right HRIR of the plurality of HRIRs associated with that virtual loudspeaker, the left HRIR producing, upon multiplication by the frequency-domain sound field produced by that virtual loudspeaker, the component of the sound field rendered in the left ear of the human listener, the right HRIR producing, upon multiplication by the frequency-domain sound field produced by that virtual loudspeaker, the component of the sound field rendered in the right ear of the human listener. Further, for each of the plurality of virtual loudspeakers, there is an interaural time delay (ITD) between the left HRIR associated with that virtual loudspeaker and the right HRIR associated with that virtual loudspeaker, the ITD being manifested in the left HRIR and the right HRIR by a difference between a number of initial samples of the sound field of the left HRIR that have zero values and a number of initial samples of the sound field of the right HRIR that have zero values. In this case, the method can further include generating an ITD unit subsystem matrix based on the ITD between the left HRIR and right HRIR associated with each of the plurality of virtual loudspeakers, and multiplying the plurality of HRTFs by the ITD unit subsystem matrix to produce a plurality of delayed HRTFs.

Regarding the method, each of the plurality of HRTFs can be represented by finite impulse filters (FIRs). In this case, the method can further include performing a conversion operation on each of the plurality of HRTFs to produce another plurality of HRTFs that are each represented by infinite impulse response filters (IIRs).

Regarding the method, for each of the plurality of virtual loudspeakers, there is a HRIR associated with that virtual loudspeaker that corresponds to the ear on the side of the head nearest the loudspeaker, this is called the ipsilateral HRIR. The other HRIR associated with that virtual loudspeaker is called the contralateral HRIR. The plurality of HRTFs can be partitioned into two groups. One group contains all the ipsilateral HRTFs and the other group contains all the contralateral HRTFs. In this case, the method can be applied independently to each group and thereby produce a degree of approximation appropriate to that group.