# Sony Patent | Sound Processing Apparatus And Method, And Program

**Patent: **Sound Processing Apparatus And Method, And Program

**Publication Number: **10595148

**Publication Date: **20200317

**Applicants: **Sony

**Abstract**

The present technique relates to sound processing apparatus and method, and a program that can more efficiently reproduce sound; A sound processing apparatus includes: a head direction acquisition unit that acquires a head direction of a user listening to sound; a rotation matrix generation unit that selects two first rotation matrices on the basis of the head direction from a plurality of first rotation matrices for rotation in a first direction held in advance, selects one second rotation matrix on the basis of the head direction from a plurality of second rotation matrices for rotation in a second direction held in advance, and generates a third rotation matrix on the basis of the selected two first rotation matrices and the selected one second rotation matrix; and a head-related transfer function composition unit that composes an input signal in a spherical harmonic domain, a head-related transfer function in the spherical harmonic domain, and the third rotation matrix to generate a headphone drive signal in a time-frequency domain; The present technique can be applied to a sound processing apparatus.

**CROSS-REFERENCE TO RELATED APPLICATIONS**

This is a U.S. National Stage Application under 35 U.S.C. .sctn. 371, based on International Application No. PCT/JP2016/088382, filed Dec. 22, 2016, which claims priority to Japanese Patent Application JP 2016-002169, filed Jan. 8, 2016, each of which is hereby incorporated by reference in its entirety.

**TECHNICAL FIELD**

The present technique relates to sound processing apparatus and method, and a program, and particularly, to sound processing apparatus and method, and a program that can more efficiently reproduce sound.

**BACKGROUND ART**

In recent years, systems that record, transmit, and reproduce spatial information from the entire circumference have been developed and popularized in the field of sound. For example, broadcasting with three-dimensional multi-channel sound of 22.2 channels is planned in Super Hi-Vision.

Furthermore, systems that reproduce signals of sound surrounding the entire circumference in addition to video surrounding the entire circumference are also started to be distributed in the field of virtual reality.

Among these, there is so-called Ambisonics that is an expression method of three-dimensional sound information which can flexibly correspond to an arbitrary recording and reproduction system, and Ambisonics is drawing attention. Particularly, Ambisonics with second or higher order is called higher order Ambisonics (HOA) (for example, see NPL 1).

In the three-dimensional multi-channel sound, the information of the sound spreads to the spatial axis in addition to the temporal axis. In Ambisonics, frequency conversion, that is, spherical harmonic function conversion, is applied to the angle direction of the three-dimensional polar coordinates, and the information is held. The spherical harmonic function conversion can be considered to be equivalent to time-frequency conversion of the temporal axis of the sound signal.

An advantage of the method is that information can be encoded and decoded from an arbitrary microphone array to an arbitrary speaker array without limiting the number of microphones or the number of speakers.

On the other hand, factors that prevent popularization of Ambisonics include that a speaker array including a large number of speakers is necessary in the reproduction environment and that the reproduction range (sweet spot) of the sound space is narrow.

For example, although a speaker array including a larger number of speakers is necessary to increase the spatial resolution of sound, it is unrealistic to create such a system at home or the like. Furthermore, the area that can reproduce the sound space is narrow in a space such as a movie theater, and it is difficult to provide a desired effect to the entire audience.

**CITATION LIST**

**Non Patent Literature**

[NPL 1] Jerome Daniel, Rozenn Nicol, Sebastien Moreau, “Further Investigations of High Order Ambisonics and Wavefield Synthesis for Holophonic Sound Imaging,” AES 114th Convention, Amsterdam, Netherlands, 2003.

**SUMMARY**

**Technical Problem**

Therefore, Ambisonics and a binaural reproduction technique can be combined. The binaural reproduction technique is generally called a virtual auditory display (VAD), and a head-related transfer function (HRTF) is used to realize the binaural reproduction technique.

Here, the head-related transfer function is a function of the frequency and the direction of arrival expressing information regarding the transmission of sound from the entire direction surrounding the head of a human to the eardrums of both ears.

In a case where the target sound and the head-related transfer function from a certain direction are composed and presented to a headphone, the listener perceives as if the sound comes from the direction of the used head-related transfer function, instead of from the headphone. The VAD is a system using such a principle.

If the VAD is used to reproduce a plurality of virtual speakers, the same effect as Ambisonics in a speaker array system including a large number of speakers difficult in reality can be realized by the headphone presentation.

However, the system cannot attain sufficiently effective reproduction of sound. For example, in the case where Ambisonics and the binaural reproduction technique are combined, not only the amount of operation, such as convolution of head-related transfer functions, becomes large, but also the used amount of memory used for the operation and the like becomes large.

The present technique has been made in view of the circumstances, and the present technique enables to more efficiently reproduce sound.

**Solution to Problem**

An aspect of the present technique provides a sound processing apparatus including a head direction acquisition unit, a rotation matrix generation unit, and a head-related transfer function composition unit. The head direction acquisition unit acquires a head direction of a user listening to sound. The rotation matrix generation unit selects two first rotation matrices on a basis of the head direction from a plurality of first rotation matrices for rotation in a first direction held in advance, selects one second rotation matrix on a basis of the head direction from a plurality of second rotation matrices for rotation in a second direction held in advance, and generates a third rotation matrix on a basis of the selected two first rotation matrices and the selected one second rotation matrix. The head-related transfer function composition unit composes an input signal in a spherical harmonic domain, a head-related transfer function in the spherical harmonic domain, and the third rotation matrix to generate a headphone drive signal in a time-frequency domain.

The second rotation matrix can be a rotation matrix for rotation in an elevation angle direction, and on the basis of the rotation of the head of the user in the elevation angle direction indicated by the head direction, the rotation matrix generation unit can select the second rotation matrix for rotation equivalent to the rotation in the elevation angle direction.

The rotation matrix generation unit can select the second rotation matrix by determining that the rotation in the elevation angle direction is zero degrees in a case where an absolute value of the rotation of the head of the user in the elevation angle direction is equal to or smaller than a predetermined threshold.

The rotation matrix generation unit can generate the third rotation matrix only from the two first rotation matrices in a case where an absolute value of the rotation of the head of the user in the elevation angle direction is equal to or smaller than a predetermined threshold.

The head-related transfer function composition unit can obtain a product of the third rotation matrix and the input signal and obtain a sum of products of the product and the head-related transfer function to generate the headphone drive signal.

The head-related transfer function composition unit can obtain a product of the third rotation matrix and the head-related transfer function and obtain a sum of products of the product and the input signal to generate the headphone drive signal.

The sound processing apparatus can further include a head direction sensor unit that detects the rotation of the head of the user, and the head direction acquisition unit can acquire a detection result of the head direction sensor unit to acquire the head direction of the user.

The sound processing apparatus can further include a time-frequency inverse conversion unit that performs time-frequency inverse conversion of the headphone drive signal.

An aspect of the present technique provides a sound processing method or a program including the steps of acquiring a head direction of a user listening to sound, selecting two first rotation matrices on a basis of the head direction from a plurality of first rotation matrices for rotation in a first direction held in advance, selecting one second rotation matrix on a basis of the head direction from a plurality of second rotation matrices for rotation in a second direction held in advance, and generating a third rotation matrix on a basis of the selected two first rotation matrices and the selected one second rotation matrix, and composing an input signal in a spherical harmonic domain, a head-related transfer function in the spherical harmonic domain, and the third rotation matrix to generate a headphone drive signal in a time-frequency domain.

In the aspects of the present technique, the head direction of the user listening to sound is acquired, the two first rotation matrices are selected on the basis of the head direction from the plurality of first rotation matrices for rotation in the first direction held in advance, the one second rotation matrix is selected on the basis of the head direction from the plurality of second rotation matrices for rotation in the second direction held in advance, the third rotation matrix is generated on the basis of the selected two first rotation matrices and the selected one second rotation matrix, and the input signal in the spherical harmonic domain, the head-related transfer function in the spherical harmonic domain, and the third rotation matrix are composed to generate the headphone drive signal in the time-frequency domain.

**Advantageous Effect of Invention**

According to the aspects of the present technique, sound can be more efficiently reproduced.

Note that the advantageous effect described here may not be limited, and the advantageous effect can be any of the advantageous effects described in the present disclosure.

**BRIEF DESCRIPTION OF DRAWINGS**

FIG. 1 is a diagram describing simulation of stereophonic sound using head-related transfer functions.

FIG. 2 depicts a configuration of a general sound processing apparatus.

FIG. 3 is a diagram describing computation of drive signals based on a general method.

FIG. 4 depicts a configuration of the sound processing apparatus further provided with a head tracking function.

FIG. 5 is a diagram describing computation of drive signals in the case where the head tracking function is further provided.

FIG. 6 is a diagram describing computation of drive signals based on a first proposed method.

FIG. 7 is a diagram describing operations in the computation of the drive signals in the first proposed method and the general method.

FIG. 8 depicts a configuration example of a sound processing apparatus according to the present technique.

FIG. 9 is a flow chart describing a drive signal generation process.

FIG. 10 is a diagram describing computation of drive signals based on a second proposed method.

FIG. 11 is a diagram describing an amount of operation and a required amount of memory in the second proposed method.

FIG. 12 depicts a configuration example of a sound processing apparatus according to the present technique.

FIG. 13 is a flow chart describing a drive signal generation process.

FIG. 14 depicts a configuration example of a sound processing apparatus according to the present technique.

FIG. 15 is a flow chart describing a drive signal generation process.

FIG. 16 is a diagram describing computation of drive signals based on a third proposed method.

FIG. 17 depicts a configuration example of a sound processing apparatus according to the present technique.

FIG. 18 is a flow chart describing a drive signal generation process.

FIG. 19 depicts a configuration example of a sound processing apparatus according to the present technique.

FIG. 20 is a flow chart describing a drive signal generation process.

FIG. 21 is a diagram describing reduction in the amount of operation through reduction of orders.

FIG. 22 is a diagram describing reduction in the amount of operation through the reduction of orders.

FIG. 23 is a diagram describing the amount of operation and the required amount of memory in each proposed method and the general method.

FIG. 24 is a diagram describing the amount of operation and the required amount of memory in each proposed method and the general method.

FIG. 25 is a diagram describing the amount of operation and the required amount of memory in each proposed method and the general method.

FIG. 26 depicts a configuration of a general sound processing apparatus based on an MPEG 3D standard.

FIG. 27 is a diagram describing computation of drive signals by the general sound processing apparatus.

FIG. 28 depicts a configuration example of a sound processing apparatus according to the present technique.

FIG. 29 is a diagram describing computation of drive signals by the sound processing apparatus according to the present technique.

FIG. 30 is a diagram describing generation of a matrix of head-related transfer functions.

FIG. 31 depicts a configuration example of a sound processing apparatus according to the present technique.

FIG. 32 is a flow chart describing a drive signal generation process.

FIG. 33 depicts a configuration example of a sound processing apparatus according to the present technique.

FIG. 34 is a flow chart describing a drive signal generation process.

FIG. 35 depicts a configuration example of a computer.

**DESCRIPTION OF EMBODIMENTS**

Hereinafter, embodiments of the present technique will be described with reference to the drawings.

**First Embodiment**

In the present technique, a head-related transfer function is handled as a function of spherical coordinates, and spherical harmonic function conversion is similarly performed to compose an input signal that is a sound signal and the head-related transfer function in a spherical harmonic domain without decoding the input signal into a speaker array signal. In this way, the present technique realizes a reproduction system more efficient in terms of an amount of operation and a used amount of memory.

For example, the spherical harmonic function conversion for a function f(.theta., .PHI.) on the spherical coordinates is expressed by the following Formula (1). [Math. 1] F.sub.n.sup.m=.intg..sub.0.sup..pi..intg..sub.0.sup.2.pi.f(.theta.,.PHI.)- Y.sub.n.sup.m(.theta.,.PHI.)d.theta.s.PHI. (1)

In Formula (1), .theta. and .PHI. indicate an angle of elevation and a horizontal angle in the spherical coordinates, respectively, and Y.sub.n.sup.m(.theta., .PHI.) indicates a spherical harmonic function. Furthermore, the description of “-” above the spherical harmonic function Y.sub.n.sup.m(.theta., .PHI.) represents a complex conjugate of the spherical harmonic function Y.sub.n.sup.m(.theta., .PHI.).

Here, the spherical harmonic function Y.sub.n.sup.m(.theta., .PHI.) is represented by the following Formula (2).

.times..function..theta..PHI..times..times..times..times..pi..function..t- imes..function..times..times..theta..times..times..times..times..times..PH- I. ##EQU00001##

In Formula (2), n and m indicate an order and a degree of the spherical harmonic function Y.sub.n.sup.m(.theta., .PHI.) where -n.ltoreq.m.ltoreq.n. In addition, j indicates a pure imaginary number, and P.sub.n.sup.m(x) is an associated Legendre function.

The associated Legendre function P.sub.n.sup.m(x) is represented by the following Formula (3) or Formula (4) when n.gtoreq.0 and 0.ltoreq.m.ltoreq.n. Note that Formula (3) is a case where m=0.

.times..function..times..times..times..times..function..times..times..fun- ction. ##EQU00002##

Furthermore, the associated Legendre function P.sub.n.sup.m(x) is represented by the following Formula (5) in a case where -n.ltoreq.m.ltoreq.0.

.times..function..times..times..function. ##EQU00003##

Furthermore, inverse conversion, into the function f(.theta., .PHI.) on the spherical coordinates, from the function F.sub.n.sup.m after the spherical harmonic function conversion is as indicated in the following Formula (6).

.times..times..function..theta..PHI..infin..times..times..times..function- ..theta..PHI. ##EQU00004##

In this way, conversion from an input signal D’.sub.n.sup.m(.omega.) of sound after correction of the radial direction held in the spherical harmonic domain into a speaker drive signal S(x.sub.i, .omega.) of each of L speakers arranged on the sphere with a radius R is as indicated in the following Formula (7).

.times..times..function..omega..times..times.’.times..times..function..om- ega..times..function..beta..alpha. ##EQU00005##

Note that in Formula (7), x.sub.i indicates the position of the speaker, and .omega. indicates the time frequency of the sound signal. The input signal D’.sub.n.sup.m(.omega.) is a sound signal corresponding to each order n and degree m of the spherical harmonic function with respect to a predetermined time frequency .omega..

Furthermore, x.sub.i equals to (R sin .beta..sub.i cos .alpha..sub.i, R sin .beta..sub.i sin .alpha..sub.i, R cos .beta..sub.i), and i indicates a speaker index specifying the speaker. Here, i equals to 1, 2, … , and L, and .beta..sub.i and .alpha..sub.i represent an angle of elevation and a horizontal angle indicating the position of an ith speaker, respectively.

The conversion indicated by Formula (7) is spherical harmonic inverse conversion corresponding to Formula (6). Furthermore, in the case of using Formula (7) to obtain the speaker drive signal S(x.sub.i, .omega.), the number of speakers L that is the number of reproduction speakers and an order N of the spherical harmonic function, that is, a maximum value N of the order n, need to satisfy the relationship indicated in the following Formula (8). [Math. 8] L>(N+1).sup.2 (8)

Incidentally, an example of a general method of simulating stereophonic sound at the ears through headphone presentation includes a method using a head-related transfer function as illustrated in FIG. 1.

In the example illustrated in FIG. 1, an input Ambisonics signal is decoded to generate respective speaker drive signals of virtual speakers SP11-1 to SP11-8 that are a plurality of virtual speakers. The signal decoded in this case corresponds to, for example, the input signal D’.sub.n.sup.m(.omega.) described above.

Here, the virtual speakers SP11-1 to SP11-8 are annularly lined up and virtually arranged, and the speaker drive signal of each virtual speaker is obtained by the calculation of Formula (7) described above. Note that the virtual speakers SP11-1 to SP11-8 may also be simply referred to as virtual speakers SP11 in a case where the distinction is not particularly necessary.

When the speaker drive signal of each virtual speaker SP11 is obtained in this way, left and right drive signals (binaural signals) of a headphone HD11 that actually reproduces the sound are generated for each virtual speaker SP11 by convolution using the head-related transfer function. A sum of the drive signals of the headphone HD11 obtained for the virtual speakers SP11 is then handled as an ultimate drive signal.

Note that the method is described in detail in, for example, “ADVANCED SYSTEM OPTIONS FOR BINAURAL RENDERING OF AMBISONIC FORMAT (Gerald Enzner et. al. ICASSP 2013)” and the like.

A head-related transfer function H(x, .omega.) used to generate the left and right drive signals of the headphone HD11 is obtained by normalizing a transfer characteristic H.sub.1(x, .omega.) from a sound source position x to an eardrum position of the user as a listener, with the head of the user existing in a free space, by using a transfer characteristic H.sub.0(x, .omega.) from the sound source position x to a center O of the head, with the head not existing in the free space. Therefore, the head-related transfer function H(x, .omega.) for the sound source position x is obtained by the following Formula (9).

.times..function..omega..function..omega..function..omega. ##EQU00006##

Here, the head-related transfer function H(x, .omega.) can be convolved with an arbitrary sound signal and presented to the headphone or the like to create an illusion that the listener can feel as if the sound comes from the direction of the convolved head-related transfer function H(x, .omega.), that is, from the direction of the sound source position x.

In the example illustrated in FIG. 1, such a principle is used to generate the left and right drive signals of the headphone HD11.