Facebook Patent | Binaural Synthesis
Patent: Binaural Synthesis
Publication Number: 20170105083
Publication Date: 20170413
Applicants: Facebook
Abstract
Embodiments relate to obtaining filter coefficients for a binaural synthesis filter; and applying a compensation filter to reduce artefacts resulting from the binaural synthesis filter; wherein the filter coefficients and compensation filter are configured to be used to obtain binaural audio output from a monaural audio input. The filter coefficients and compensation filter may be applied to the monaural audio input to obtain the binaural audio output. The compensation filter may comprise a timbre compensation filter.
CROSS-REFERENCE TO RELATED APPLICATION
[0001] This application claims priority under 35 U.S.C. .sctn.119(a) to United Kingdom Patent Application No. 1517844.5 filed on Oct. 8, 2015, which are incorporated by reference herein in their entirety.
BACKGROUND
[0002] The present invention relates to binaural audio synthesis.
[0003] 3D audio or binaural synthesis may refer to a technique used to process audio in such a way that a sound may be positioned anywhere in 3D space. The positioning of sounds in 3D space may give a user the effect of being able to hear a sound over a pair of headphones, or from another source, as if it came from any direction (for example, above, below or behind). 3D audio or binaural synthesis may be used in applications such as games, virtual reality or augmented reality to enhance the realism of computer-generated sound effects supplied to the user.
[0004] When a sound comes from a source far away from a listener, the sound received by each of the listener’s ears may, for example, be affected by the listener’s head, outer ears (pinnae), shoulders and/or torso before entering the listener’s ear canals. For example, the sound may experience diffraction around the head and/or reflection from the shoulders.
[0005] If the source is to one side of the listener, the sound received from the source may be received at different times by the left and right ears. The time difference between the sound received at the left and right ears may be referred to as an Interaural Time Delay (ITD). The amplitude of the sound received by the left and right ears may also differ. The difference in amplitude may be referred to as an Interaural Level Difference (ILD).
[0006] Binaural synthesis may aim to process monaural sound (a single channel of sound) into binaural sound (a channel for each ear, for example a channel for each headphone of a set of headphones) such that it appears to a listener that sounds originate from sources at different positions relative to the listener, including sounds above, below and behind the listener.
[0007] A head-related transfer function (HRTF) is a transfer function that may capture the effect of the human head (and optionally other anatomical features) on sound received at each ear. The information of the HRTF may be expressed in the time domain through the head-related impulse response (HRIR). Binaural sound may be obtained by applying an HRIR to a monaural sound input.
[0008] It is known to obtain an HRTF (and/or an HRIR) by measuring sound using two microphones placed at ear positions of an acoustic manikin. The acoustic manikin may provide a representative head shape and ear spacing and, optionally, the shape of representative pinnae, shoulders and/or torso.
[0009] Methods are known in which finite impulse response (FIR) filter coefficients are generated from HRIR measurements. The HRIR-generated FIR coefficients are convolved with an input audio signal to synthesise binaural sound. A FIR filter generated from HRIR measurements may be a high-order filter, for example a filter of between 128 and 512 taps. An operation of convolving the FIR filter with an input audio signal may be computationally intensive, particularly when the relative positions of the source and the listener change over time.
[0010] It has been suggested to approximate an HRIR using a computational model, for example a structural model. A structural model may simulate the effect of a listener’s body on sound received by the listener’s ears. In one such structural model, effects of the head, pinnae and shoulders are modeled. The structural model combines an infinite impulse response (IIR) head-shadow model with an FIR pinna-echo model and an FIR shoulder-echo model.
SUMMARY
[0011] In a first aspect of the invention, there is provided a method comprising obtaining filter coefficients for a binaural synthesis filter; and applying a compensation filter to reduce artefacts resulting from the binaural synthesis filter; wherein the filter coefficients and compensation filter are configured to be used to obtain binaural audio output from a monaural audio input. The filter coefficients and compensation filter may be applied to the monaural audio input to obtain the binaural audio output. The compensation filter may comprise a timbre compensation filter.
[0012] The artefacts may be artefacts that are introduced by the binaural synthesis filter itself. By reducing artefacts resulting from the binaural synthesis filter, binaural processing may result in a better quality output that may be the case if the artefacts were not reduced. By reducing artefacts resulting from the binaural synthesis filter, the binaural audio output may be more similar to the monaural audio input and/or more similar to that of an original audio source than would otherwise be the case. A user’s perception of the binaural audio output may be more similar to the user’s perception of the monaural audio input than would otherwise be the case.
[0013] The artefacts may comprise a reduction in quality of a binaural audio output. The reduction in quality of the binaural audio output may comprise the quality of the binaural audio output being lower than the quality of the monaural audio input. The artefacts may comprise at least one of a change in amplitude of a binaural audio output, a change in delay of a binaural audio output, a change in frequency of a binaural audio output. The artefacts may comprise at least one of a change in amplitude of a binaural audio output with respect to an amplitude of the monaural audio input, a change in delay of a binaural audio output with respect to a delay of the monaural audio input, a change in frequency of a binaural audio output with respect to a frequency of the monaural audio input.
[0014] The timbre of a sound may comprise a property or properties of the sound that is experienced by the user as imparting a particular tone or colour to the sound. Thus for example, two sounds may have the same pitch and loudness but may have different timbres and thus may sound different, for example to a human listener. Timbre for example may comprise one or more of at least one spectral envelope property, at least one time envelope property, at least one modulation or shift in time envelope, fundamental frequency or time envelope, at least one variation of amplitude with time and/or frequency. By reducing artefacts resulting from the binaural synthesis filter, a timbre of the binaural audio output may be more similar to a timbre of the monaural audio input than would otherwise be the case. A user may experience the timbre of the binaural audio output to be similar to a timbre of the monaural audio output.
[0015] In some audio systems, timbre may be particularly relevant. For example, in high quality audio systems, it may be preferable that binaural processing does not make any discernible change in the timbre of the sound that is experienced by a user. A change in timbre may be experienced by the user as distortion and/or poor quality audio reproduction.
[0016] In some systems, it may be preferable for a user to experience accurate timbre reproduction even at the expense of decreased accuracy of binaural effects, for example decreased localisation.
[0017] The timbre compensation filter may be determined independently of physical properties of at least part of the audio system. The timbre compensation filter may be determined independently of physical properties of headphones. The timbre compensation filter may be determined independently of physical characteristics of a user. Thus, for example, physical properties of at least part of the audio system and/or physical characteristics of a user may be not used as inputs in determining the timbre compensation filter.
[0018] The binaural audio output may occupy a frequency range. The artefacts may be present in a sub-region of the frequency range. The sub-region may comprise audible frequencies of the human voice. The sub-region may comprise frequencies that are relevant to the perceived timbre of the human voice.
[0019] The sub-region of the frequency may be a portion of the frequency range that is above a lower portion of the frequency range. The artefacts may be not present in the lower portion of the frequency range. The artefacts may be more severe in the sub-region than in a portion of the frequency range that is lower in frequency than the sub-region. The artefacts may be more severe in the sub-region than in a further portion of the frequency range that is higher in frequency than the sub-region. The sub-region may comprise a range of frequencies in which the artefacts are greater than are the artefacts in other parts of the frequency range.
[0020] The artefacts may comprise an increase in gain in the sub-region. Reducing the artefacts may comprise reducing the gain in the sub-region, such as to at least partially compensate for the artefacts. The gain may be substantially unchanged by the timbre compensation in at least one region of the frequency range that is outside the sub-region.
[0021] The sub-region may comprise a range of frequencies from 500 Hz to 10 kHz, optionally from 1 kHz to 6 kHz, further optionally from 1 kHz to 3 kHz. The sub-region may comprise frequencies above 500 Hz, optionally frequencies above 1 kHz, further optionally frequencies above 2 kHz, further optionally frequencies above 3 kHz. Frequencies between 1 kHz and 6 kHz may be important for speech intelligibility.
[0022] The sub-region may comprise a range of frequencies from 80 Hz to 400 Hz. A range from 80 Hz to 400 Hz may be important for good low frequency reproduction which may be useful for music.
[0023] In professional audio, a range of frequencies between 20 Hz to 20 kHz may be of importance. The timbre compensation filter may be such that the binaural system may change the frequency spectrum between 20 Hz and 20 kHz as little as possible.
[0024] Applying the compensation filter to reduce artefacts may comprise a greater reduction in artefacts in the sub-region than in other parts of the frequency range.
[0025] Applying the compensation filter may comprise applying the compensation filter to the filter coefficients to obtain adjusted coefficients for the binaural synthesis filter.
[0026] Applying the compensation filter to the filter coefficients may provide a computationally efficient method of reducing artefacts. Applying the compensation filter to the filter coefficients may be faster and/or more computationally efficient than applying a filter to the binaural audio output.
[0027] The method may further comprise receiving a monaural audio input corresponding to at least one audio source, each audio source having an associated position. The method may further comprise synthesising binaural audio output from the monaural audio input using the binaural synthesis filter. The synthesising may be in dependence on the position or positions of each audio source. By performing binaural synthesis in dependence on audio source positions, a user may experience sound from each of the audio sources as coming from the position of that audio source.
[0028] The synthesising of the binaural audio output may use the adjusted filter coefficients.
[0029] The filter coefficients may be adjusted by the timbre compensation filter such that binaural audio output synthesised using the adjusted coefficients has a different timbre from binaural audio output synthesised using the filter coefficients, thereby reducing the effect of the artefacts.
[0030] The synthesising may be performed in real time. The position of each audio source may change with time. The synthesising of the binaural audio output may be updated with the changing position of the audio source or sources.
[0031] By performing synthesis in real time, the synthesis may respond to changes in the scene. For example, in a computer game, a user may experience an effect of moving through the scene. The binaural audio output may be synthesised in response to changing positions, for example changing positions, optionally relative positions, of the user and/or the audio sources.
[0032] The method may further comprise generating the timbre compensation filter from the filter coefficients. Generating the timbre compensation filter from the filter coefficients may comprise applying a filter defined by the filter coefficients to a test audio input to obtain an impulse response; obtaining a transfer function by applying a Fourier transfer to the impulse response; and generating the timbre compensation filter from the transfer function.
[0033] Generating the timbre compensation filter may comprise generating coefficients for the timbre compensation filter. The timbre compensation filter may comprise a finite impulse response filter.
[0034] Generating the timbre compensation filter from the transfer function may comprise inverting the transfer function to obtain an inverse transfer function. Generating the timbre compensation filter may comprise smoothing at least one of the transfer function, the inverse transfer function, the impulse response. Generating the timbre compensation filter may comprise obtaining a new impulse response from the inverse transfer function.
[0035] Generating the timbre compensation filter may further comprise reducing the effect of the timbre compensation filter at low frequencies, optionally wherein the low frequencies comprise frequencies below 400 Hz. The timbre compensation filter may be altered such that the low frequencies remain substantially unchanged by the timbre compensation filter. The low frequencies may comprise frequencies below 1 kHz, optionally frequencies below 500 Hz, further optionally frequencies below 300 Hz. Reducing the effect of the timbre compensation at low frequencies may mean that the original low frequency response of the binaural synthesis filter is retained.
[0036] The timbre compensation filter may correct frequencies below 400 Hz. The binaural synthesis filter may result in a boost in low frequencies. Such a boost in low frequencies may be corrected by the timbre compensation filter.
[0037] Generating the timbre compensation filter may comprise generating the timbre compensation filter for each of a plurality of sampling rates. By generating the timbre compensation filter for a plurality of sampling rates, the timbre compensation filter may be used in a range of different audio systems, even if the different audio systems have different sampling rates. In some circumstances, having a plurality of sampling rates may make any resampling of coefficients of the timbre compensation filter easier, since it may be more likely that a resampling will comprise resampling to an integer multiple of a sampling rate that has already been calculated.
[0038] Generating the timbre compensation filter may comprise truncating the timbre compensation filter. Generating the timbre compensation filter may comprise truncating the timbre compensation filter to an order no higher than an order of the binaural synthesis filter.
[0039] The binaural synthesis filter may comprise a first number of taps. The binaural synthesis filter may comprise 32 taps. The binaural synthesis filter may comprise between 20 and 40 taps.
[0040] The timbre compensation filter may comprise a second number of taps. The second number of taps may be fewer than or equal to the first number of taps. The second number of taps may be fewer than the first number of taps. The timbre compensation filter for a first sampling rate may have a different number of taps than the timbre compensation filter for a second sampling rate. A timbre compensation filter for a first sampling rate may have 27 taps and a timbre compensation filter for a second sampling rate may have 31 taps.
[0041] By providing a timbre compensation filter having fewer taps than the binaural synthesis filter, the application of the timbre compensation filter to the binaural synthesis filter may be performed in a way that is computationally efficient.
[0042] Adjusted coefficients obtained by applying the timbre compensation filter to the binaural synthesis filter may have a number of taps that is the same as the number of taps of the binaural synthesis filter. Computations performed using the adjusted coefficients may require no more computational resources than computations performed using the filter coefficients. Computations performed using the adjusted coefficients may be as fast as computations performed using the filter coefficients.
[0043] The test audio input may comprise an audio input having a known frequency profile. The generating of the timbre compensation filter may be in dependence on a difference between a frequency profile of the binaural audio output and the known frequency profile of the test audio input.
[0044] The test audio input may comprise white noise. The test audio input may have a frequency profile that is flat with frequency for at least a portion of the frequency range. The generating of the timbre compensation may comprise determining a difference between a frequency profile of the binaural output and a flat frequency profile for at least a portion of the frequency range.
[0045] The binaural synthesis filter may comprise a pinna model filter. Synthesising the binaural audio output may comprise applying the pinna model filter; applying an interaural time delay; and applying a head shadow filter.
[0046] The method may comprise determining values for the interaural time delay using the equation:
T ( .theta. , .phi. ) = { - a c * cos ( .theta. ) * cos ( .phi. ) , 0 .ltoreq. .theta. < .pi. 2 a c * ( .theta. - .pi. 2 ) * cos ( .phi. ) , .pi. 2 .ltoreq. .theta. .ltoreq. .pi. ##EQU00001##
[0047] wherein T(.theta., .phi.) is the interaural time delay, a is an average head size, c is the speed of sound, .theta. is azimuth angle in radians and .phi. is elevation angle in radians.
[0048] The method may comprise determining values for the head shadow filter using the equation:
H ( .omega. , .theta. ) = 1 + j ( .alpha. * .omega. ) 1 + ( j .omega. 2 .omega. 0 ) , 0 .ltoreq. .alpha. .ltoreq. 2 ##EQU00002##
[0049] wherein H(.omega., .theta.) is a head shadow filter value, .theta. is azimuth angle in degrees, .omega. is radian frequency, a is an average head size, c is the speed of sound, .omega..sub.0=c/a,* and*
.alpha. ( .theta. ) = 1.05 + 0.95 * cos ( .theta. * .pi. 180 ) . ##EQU00003##
[0050] Obtaining filter coefficients may comprise obtaining filter coefficients for each of a plurality of angular positions. Each angular position may comprise an azimuth angle and an elevation angle. Applying the timbre compensation filter may comprise, for each angular position, applying the timbre compensation filter to the filter coefficients for that angular position to obtain adjusted filter coefficients for that angular position. Filter coefficients for the plurality of angular positions may be stored in a look up table. By storing the filter coefficients in a look up table, the filter coefficients may be quickly accessed in a real time process.
[0051] The filter coefficients may be obtained as part of an initialisation process.
[0052] In a further aspect of the invention, which may be provided independently, there is provided a method comprising obtaining filter coefficients for a binaural synthesis filter; and generating a compensation filter from the filter coefficients, wherein the compensation filter is configured to reduce artefacts resulting from the binaural synthesis filter. The compensation filter may comprise a timbre compensation filter. The filter coefficients and compensation filter may be configured to be applied to a monaural audio input to obtain binaural audio output.
[0053] The compensation filter may be generated from filter coefficients for a single angular position. The generating of the compensation filter may be performed offline.
[0054] In a further aspect of the invention, which may be provided independently, there is provided a method comprising receiving a monaural audio signal corresponding to at least one audio source, each audio source having an associated position; and synthesising binaural audio output from the monaural audio signal using a binaural synthesis filter, wherein the synthesising is in dependence on the position or positions of each audio source. The binaural synthesis filter may use filter coefficients that have been adjusted using a compensation filter to reduce artefacts resulting from the binaural synthesis filter. The compensation filter may comprise a timbre compensation filter.
[0055] The synthesising of the binaural audio output may be performed in real time.
[0056] In a further aspect of the invention, which may be provided independently, there is provided an apparatus comprising: means for obtaining filter coefficients for a binaural synthesis filter; and means for applying a timbre compensation filter to reduce artefacts resulting from the binaural synthesis filter; wherein the filter coefficients and timbre compensation filter are configured to be applied to a monaural audio input to obtain binaural audio output.
[0057] In a further aspect of the invention, which may be provided independently, there is provided an apparatus comprising a processor configured to: obtain filter coefficients for a binaural synthesis filter; and apply a timbre compensation filter to reduce artefacts resulting from the binaural synthesis filter; wherein the filter coefficients and timbre compensation filter are configured to be applied to a monaural audio input to obtain binaural audio output.
[0058] In another aspect of the invention, which may be provided independently, there is provided a method comprising obtaining a monaural audio input representative of an audio source, selecting at least two binaural synthesis models, obtaining a respective binaural audio output for each of the binaural synthesis models by applying coefficients of each binaural synthesis model to the monaural audio input, and obtaining a combined binaural audio output by combining the respective binaural audio outputs from each of the at least two models.
[0059] In a further aspect of the invention, which may be provided independently, there is provided a method comprising: obtaining a monaural audio input representative of audio input from a plurality of audio sources; for each audio source, selecting at least one binaural synthesis model from a plurality of binaural synthesis models and applying the at least one binaural synthesis model to audio input from that audio source to obtain at least one binaural audio output; and obtaining a combined binaural audio output by combining binaural audio outputs from each of the plurality of binaural synthesis models.
[0060] The plurality of binaural synthesis models may comprise at least one of an HRIR binaural synthesis model, a structural model, and a virtual speakers model.
[0061] A first (for example, higher-quality) binaural synthesis model may be selected for a first (for example, higher-priority) audio source. A second (for example, lower-quality) binaural synthesis model may be selected for a second (for example, lower-priority) audio source. A first more computationally intensive binaural synthesis model may be selected for a first higher-priority audio source. A second (for example, less computationally intensive) binaural synthesis model may be selected for a second (for example, lower-priority) audio source.
[0062] By providing different binaural synthesis models, different trade-offs may be made in computation. For example, a high-quality, computationally intensive binaural synthesis method may always be selected for a very important audio source. For some other audio sources, a high-quality, computationally intensive binaural synthesis method may be used only when the audio source is close to the position with respect to which the binaural synthesis is performed. When the audio source is further away, a lower quality and less computationally intensive method of binaural synthesis may be used.
[0063] Selecting binaural synthesis methods may result in improved or more efficient use being made of the available resources. Where computational resources are not able to synthesise all audio sources at the highest possible quality, it is possible to select which audio sources use the highest-quality binaural synthesis, while performing a lower-quality binaural synthesis for the other audio sources. The user may not notice that a lower-quality binaural synthesis may be used on, for example, sounds that are fainter, farther away, or less interesting to the user.
[0064] The selecting of the binaural synthesis models may be dependent on a distance, or other property, of each audio source from a position, for example with respect to which the binaural synthesis is performed.
[0065] For an audio source of the plurality of audio sources, selecting at least one binaural synthesis model for the audio source may comprise selecting a first binaural synthesis model and a second, different binaural synthesis model. The combined audio output may comprise a first proportion of an audio output for the audio source from the first binaural synthesis model and a second proportion of an audio output for the audio source from the second binaural synthesis model.
[0066] The position of the audio source may change over time, and the first proportion and second proportion may change with time in accordance with the changing position of the audio source.
[0067] In some circumstances, the position of an audio source may change such that it is desirable to change the binaural synthesis model that is used to synthesise that audio source. For example, a source may move from being nearer (in which a case higher-quality synthesis model is selected) to being further away (in which case a lower-quality synthesis method is selected). However, if a change between synthesis methods were performed very quickly (for example, between one frame and the next), the change may be perceptible to the user. By using two synthesis methods at once, the output of one may be faded down and the output of the other faded up, so that the change in synthesis method is not perceptible to the user.
[0068] Each of the plurality of binaural synthesis models may comprise a respective timbre compensation filter. The timbre compensation filters may being configured to match timbre between the binaural synthesis models.
[0069] The binaural synthesis models are selected in dependence on at least one of: a CPU frequency, a computational resource limit, a computational resource parameter, a quality requirement.
[0070] The binaural synthesis models may be selected in dependence on a priority of each audio source, a distance associated with each audio source, a quality requirement of each audio source, an amplitude of each audio source.
[0071] In another aspect of the invention, which may be provided independently, there is provided an apparatus comprising a processing resource configured to perform a method as claimed or described herein.
[0072] The apparatus may further comprise an input device configured to receive audio input representing sound from at least one audio source, wherein the processing resource is configured to obtain binaural audio output by processing the audio input using the binaural synthesis filter and the timbre compensation filter, and wherein the apparatus may further comprise an output device configured to output the binaural audio output.
[0073] In another aspect of the invention, which may be provided independently, there is provided a computer program product comprising computer readable instructions that are executable by a processor to perform a method as claimed or described herein.
[0074] There may also be provided an apparatus or method substantially as described herein with reference to the accompanying drawings.
[0075] Any feature in one aspect of the invention may be applied to other aspects of the invention, in any appropriate combination. For example, apparatus features may be applied to method features and vice versa.
BRIEF DESCRIPTION OF DRAWINGS
[0076] Embodiments of the invention are now described, by way of non-limiting examples, and are illustrated in the following figures, in which:
[0077] FIG. 1 is a schematic diagram of an audio system according to an embodiment; FIG. 2 is a flow chart illustrating in overview the process of an embodiment;
[0078] FIG. 3 is a plot of an exemplary frequency response of a pinna FIR filter;
[0079] FIG. 4 is a plot of an inverted frequency response;
[0080] FIG. 5 is a flow chart illustrating in overview the process of an embodiment;
[0081] FIG. 6 is a flow chart illustrating in overview the process of an embodiment;
[0082] FIG. 7 is a flow chart illustrating in overview the process of an embodiment.
DETAILED DESCRIPTION OF EMBODIMENTS
[0083] An audio system 10 according to an embodiment is illustrated schematically in FIG. 1. The audio system comprises a computing apparatus 12 that is configured to receive monaural audio input from an input device, for example in the form of external source or data store 14, process the audio input to obtain a binaural output comprising a left output and a right output, and to deliver the binaural output to an output device, for example headphones 16a, 16b. The left output is delivered to left headphone 16a and the right output is delivered to right headphone 16b. In other embodiments, the binaural output may be delivered to at least two loudspeakers. For example, the left output may be delivered to a left loudspeaker and the right output may be delivered to a right loudspeaker. In some embodiments, the monaural audio input may be generated by or stored in computing apparatus 12 rather than being received from an external source or data store 14.
[0084] The computing apparatus 12 comprises a processor 18 for processing audio data and a memory 20 for storing data, for example for storing filter coefficients. The computing apparatus 12 also includes a hard drive and other components including RAM, ROM, a data bus, an operating system including various device drivers, and hardware devices including a graphics card. Such components are not shown in FIG. 1 for clarity.
[0085] In the embodiment of FIG. 1, a single computing apparatus 12 is configured to calculate and store filter coefficients of a structural model, calculate and store timbre filter coefficients, perform an initialisation by applying the timbre filter to the filter coefficients to obtain adjusted filter coefficients, and synthesise binaural audio output from monaural audio input using the adjusted filter coefficients. The processes performed by the computing apparatus 12 may include some offline processes and some real time processes. For example, calculation of timbre filter coefficients may be performed offline. Initialisation may be performed on start-up of an application. The synthesising of the binaural output may be performed in real time.
[0086] In other embodiments, audio system 10 may comprise a plurality of computing apparatuses. For example, a first computing apparatus may perform the calculation of timbre filter coefficients and a second, different computing apparatus may use the timbre filter coefficients to obtain adjusted filter coefficients and synthesise binaural audio output.
[0087] The system of FIG. 1 is configured to perform the method of an embodiment as described below with reference to FIGS. 2, 5 and 6.
[0088] A structural model is used to model the effect of the head and pinnae of a listener on sound received by the listener, so as to simulate binaural effects in audio channels supplied to a user’s left and right ear. By providing different input to the left ear than to the right ear, the user is given the impression that an audio source originates from a particular position in space, or that each of a plurality of audio sources originates from a respective position in space. For example, the user may perceive that they are hearing sound from one source that is in front and to the right of them, and from another source that is directly behind them.
[0089] The structural model comprises a pinna filter, left and right interaural time delay (ITD) filters, and left and right head shadow filters. In the present embodiment, the pinna filter is applied to the audio input before the time delay filters and head shadow filters. In alternative embodiments, the pinna, ITD, and head shadow filters may be applied in any order.
[0090] The pinna filter is a FIR (finite impulse response) filter. Initial pinna FIR coefficients are obtained offline as described below with reference to stage 30 of FIG. 2 and stage 60 of FIG. 5. The initial pinna FIR coefficients are used to determine coefficients for a timbre filter as described below with reference to FIG. 2, the determining of the coefficients for a timbre filter being an offline process.
[0091] The initial pinna FIR coefficients and timbre filter are used as input to an initialisation process for a real-time binaural synthesis method. The initialisation process is described below with reference to FIG. 5. In the initialisation process, the initial pinna FIR coefficients and timbre filter are used to obtain adjusted pinna FIR coefficients at angular increments. The adjusted pinna FIR coefficients are stored in a look up table for use in a real-time binaural synthesis process.
[0092] The real-time binaural synthesis process is described below with reference to FIG. 6. Monaural audio input is processed using a pinna filter, left and right ITD filters, and left and right head shadow filters to produce binaural audio output. The binaural audio output is supplied to headphones 16a, 16b.
[0093] FIG. 2 is a flow chart showing in overview a method for determining timbre filter coefficients from initial pinna FIR coefficients. The timbre filter coefficients may be generated in such a way that a timbre filter using those coefficients may at least partially compensate for artefacts resulting from the initial pinna FIR coefficients.
[0094] At stage 30, initial pinna FIR coefficients are calculated offline by the processor 18. The initial pinna FIR coefficients are calculated from six pinna events in similar fashion to that described, for example, in Section IV-B of Brown, C. Phillip and Duda, Richard O., A structural model for binaural sound synthesis
, IEEE Transactions on Speech and Audio Processing, Vol. 6, No. 5, September 1998, which is incorporated by reference herein in its entirety. In the present embodiment, the initial pinna FIR coefficients are calculated for each ear and for each of a plurality of angular positions. In the present embodiment, the method of calculating initial pinna FIR coefficients comprises resampling values based on the system sample rate. In other embodiments, any suitable method of calculating initial pinna FIR coefficients may be used.
[0095] Angular positions are described using a (r, .theta., .phi.) coordinate system. An interaural axis connects the ears of a notional listener. The origin of the (r, .theta., .phi.) coordinate system is on the interaural axis, equidistant from the left ear and the right ear. r is the distance from the origin. The elevation coordinate, .phi., is zero at a position directly in front of the listener and increases with height. The azimuth coordinate, .phi., is zero at a position directly in front of the listener. The azimuth .phi. increases with angle to the listener’s right and becomes more negative with angle to the listener’s left. In the present embodiment, the initial pinna FIR coefficients are calculated at every 5.degree. in azimuth and in elevation at stage 30. In other embodiments, initial pinna FIR coefficients are calculated only for one angular position, for example at (.theta.=0, .phi.=0) at stage 30 and initial pinna FIR coefficients for further angular positions are calculated at stage 60 of the process of FIG. 5.
[0096] A reflection coefficient and a time delay are associated with each of the six pinna events. .rho..sub.pn is the reflection coefficient for the nth pinna event, and .tau..sub.pn is the time delay for the nth pinna event. The reflection coefficients .rho..sub.pn are assigned constant values as shown in Table 1 below. Equation 1 is used to determine the time delays .tau..sub.pn, which vary with azimuth and elevation.
.tau. pn ( .theta. , .phi. ) = A n cos ( .theta. 2 ) sin [ D n ( 90 .smallcircle. - .phi. ) ] + B n , - 90 .smallcircle. .ltoreq. .theta. .ltoreq. 90 .smallcircle. , - 90 .smallcircle. .ltoreq. .phi. .ltoreq. 90 .smallcircle. ( Equation 1 ) ##EQU00004##
[0097] where A.sub.n is an amplitude, B.sub.n is an offset, and D.sub.n is a scaling factor.
[0098] The coefficients for the left ear for an azimuth angle .theta. are the same as those for the right ear for an azimuth angle -.theta.. Equation 1 is given in a general form. For the left ear, the coefficients are calculated with .theta. and for the right ear with -.theta..
[0099] In the present embodiment the values of D.sub.n are constant and do not change for different users. In other embodiments, different values of D.sub.n may be used for different users.
[0100] In the present embodiment, the coefficient values used are those given in Table 1 below. Table 1 gives coefficients for 5 of the 6 pinna events. The 6.sup.th pinna event (n=1) is an unaltered version of the input. In other embodiments, different coefficient values may be used. A different number of pinna events or different pinna model may be used. Equation 1 above assumes a sampling rate of 44100 Hz. Other equations may be used for different sampling rates.
TABLE-US-00001 TABLE 1 n .rho..sub.pn A.sub.n B.sub.n D.sub.n 2 0.5 1 2 1 3 -1 5 4 0.5 4 0.5 5 7 0.5 5 -0.25 5 11 0.5 6 0.25 5 13 0.5
[0101] The calculation of the initial pinna FIR coefficients is performed at a sampling rate of 44100 Hz. The time delays calculated may not coincide exactly with sample times. The processor 18 uses linear interpolation to split the amplitudes .rho..sub.pn between adjacent sample points. The resulting pinna FIR filter is a 32 tap filter. In other embodiments, a pinna FIR filter having a different number of taps may be used.
[0102] The initial pinna FIR coefficient generation process of stage 30 produces a set of FIR coefficients to model the pinna. It has been found that pinna FIR filters derived using the method of stage 30 may change the timbre of an audio input when applied to that audio input.
[0103] The timbre of a sound may comprise a property or properties of the sound that is experienced by the user as imparting a particular tone or colour to the sound. In some circumstances, the timbre of a sound may indicate to a user which musical instrument or instruments produced that sound. For example, the timbre of a note produced by a violin may be different from the timbre of the same note produced by a trumpet. The timbre may comprise properties of the frequency spectrum of a sound, for example the harmonics within the sound. The timbre may comprise amplitude properties. The timbre may comprise a profile of the sound over time, for example properties of the attack or fading of a particular note.
[0104] It has been found in some known systems that a user listening to a monaural audio signal, and then to a binaural output signal that has been obtained from the monaural audio signal, is likely to experience the binaural audio output as having a different timbre from the monaural audio signal.
[0105] In many applications, it may be preferable for the timbre of a binaural sound to be perceived as similar to the timbre of the monaural sound from which the binaural sound was processed. For example, it may be more important that the user perceives the sound as having the expected timbre than that user perceives the sound as issuing from its precise position. In the method described below, a timbre compensation filter is used to make the binaural sound more similar to the original monaural sound, while retaining at least part of the effects of binaural processing.
[0106] The timbre of an audio input may relate to the frequency spectrum of that audio input. It has been found that if the initial pinna FIR coefficients of stage 30 are used for binaural synthesis without being modified, the resulting binaural sound output may exhibit a change in timbre that comprises a change in frequency spectrum. The change in timbre may be described as an unnatural boost to the high frequencies. Amplitudes at certain frequency ranges may be increased such that the timbre of sound to which a pinna filter using the initial pinna FIR coefficients has been applied is different to the timbre of the monaural audio input.
[0107] The human ear may be particularly sensitive to sounds in the range of 1 kHz to 6 kHz. Sounds in the range of 1 kHz to 6 kHz may be important in the human voice. It has been found that the initial pinna FIR coefficients of stage 80 may cause an increase in amplitude within the range of 1 kHz to 6 kHz. The increase in amplitude may be at a level that is perceptible by a user. For example, a user may not be aware of a 1 or 2 dB difference in amplitude, but may be aware of a greater difference in amplitude. If the increase in amplitude were not compensated for, a user may experience the binaural sound output of being of poor quality. Artefacts associated with the initial pinna FIR coefficients may cause the user to experience the binaural sound quality as being distorted.
[0108] In other embodiments, the use of unmodified binaural synthesis filter coefficients may cause artefacts in a binaural audio output that may comprise changes in timbre, changes in amplitude, changes in frequency, changes in delay, changes in quality (for example, changes in noise level or signal to noise) or changes in any other relevant parameter. The binaural synthesis coefficients may be any coefficients of any binaural synthesis model.
[0109] At stages 32 to 48 of the process of FIG. 2, initial pinna FIR coefficients for (.theta.=0, .phi.=0) are used to determine coefficients for a timbre compensation filter using an offline analysis. In other embodiments, respective timbre filter coefficients may be determined for each of a plurality of angular positions. Timbre filter coefficients may be generated using any appropriate method. Although in the present embodiment initial pinna FIR coefficients are used to determine coefficients for a timbre compensation filter, in other embodiments, any coefficients of a binaural synthesis model may be used to determine coefficients for a timbre compensation filter.
[0110] In the present embodiment, the timbre compensation filter is monaural, because at (.theta.=0, .phi.=0) the initial pinna FIR coefficients are the same for the left ear as for the right ear. In other embodiments, a timbre compensation filter may be generated for each ear. The timbre compensation filter for the left ear may be different from the timbre compensation filter for the right ear.
[0111] In the present embodiment, timbre filter coefficients are calculated at two sampling rates. The first sampling rate is 44100 Hz and the second sampling rate is 48000 Hz. In other embodiments, different sampling rates may be used. Timbre filter coefficients may be calculated for any number of sampling rates.
[0112] The flow chart of FIG. 2 shows corresponding stages (32a and 32b, 34a and 34b, 36a and 36b etc.) for each of the sampling rates. Stages for the first sampling rate (32a, 34a, 36a etc.) are described below in detail. The description of stages for the first sampling rate also applies to the stages for the second sampling rate (32b, 34b, 36b etc.) if the sampling rate referred to is changed accordingly.
[0113] At stage 32a, the initial pinna FIR coefficients obtained at stage 30 for angular position (.theta.=0, .phi.=0) are resampled if required. In the present embodiment, the initial pinna FIR coefficients are calculated at a sampling rate of 44100 Hz, which is the same as the first sampling rate. Therefore at stage 32a of the present embodiment, no resampling is performed.
[0114] At stage 34a, the processor 18 determines an impulse response, h(n), for the pinna filter using the initial pinna FIR coefficients for (.theta.=0, .phi.=0). n represents sample number (which may be described as a discretized measure of time). The processor determines the impulse response by inputting white noise into the pinna filter and plotting the output of the pinna filter.
[0115] The impulse response is found in order to correct for the boost to the high frequencies caused by the pinna model. White noise is used because it has constant amplitude with frequency. Any frequency effects seen in the impulse response may be due to the pinna FIR filter and not an effect of the input, since the white noise input does not vary with frequency. In other embodiments, any suitable method of obtaining the impulse response h(n) may be used.
[0116] At stage 36a, a frequency domain transfer function, H(.omega.), is determined from the impulse response, h(n). .omega. is angular frequency in radians per second, .omega.=2.pi.f, where f is frequency. The frequency domain transfer function, H(.omega.), is found by application of a Fourier transform to the impulse response, h(n). In the present embodiment, a fast Fourier transform (FFT) is used.
[0117] FIG. 3 is a plot of the frequency domain transfer function H(.omega.). The horizontal axis of the FIG. 3 is frequency f in Hz. The vertical axis of FIG. 3 is gain in dBFS. The input signal level is 0 dBFS.
[0118] Line 50 of FIG. 3 is an averaged and smoothed version of the frequency domain transfer function H(.omega.). The averaged and smoothed response is calculated using a piecewise linear approximation algorithm. The linear piecewise approximation results in a continuous piecewise linear function which is defined on a set of points in the function’s domain which are not necessarily regularly spaced. The points on which the function is defined may be irregularly spaced in order to minimise the number of line segments whilst maintaining an effective approximation. In other embodiments, any method of averaging and/or smoothing may be used.
[0119] If the pinna FIR filter did not change the frequency response of the audio input, line 50 would be expected to be flat with frequency. It may be seen that in FIG. 3, line 50 is fairly flat with frequency in the 0 Hz to 1000 Hz range. However, in FIG. 3, the transfer function H(.omega.) displays a clear boost in the high frequencies. Line 50 increases with frequency between 1000 Hz and 6000 Hz and then decreases at higher frequencies. In this example, gain increases from around 13 dBFS at low frequencies (for example, up to about 500 kHz) to around 20 dBFS at around 4 kHz.
[0120] Frequencies between 1000 Hz and 6000 Hz may be particularly relevant to the reproduction of the human voice, for example for speech intelligibility. FIG. 3 illustrates the presence of artefacts in the output of the pinna filter as described above. The artefacts affect the timbre of the output. The artefacts comprise an increase in gain in a sub-region of the frequency range, the sub-region comprising frequencies from 1000 Hz to 6000 Hz. In other embodiments, artefacts may be present in a different frequency range. Different artefacts may occur.
……
……
……