Sony Patent | Signal processing apparatus and method, and program

编辑：映维 | 分类：Sony | 2021年11月12日

Patent: Signal processing apparatus and method, and program

Drawings: Click to check drawins

Publication Number: 20210352408

Publication Date: 20211111

Applicant: Sony

Assignee: Sony Corporation

Sony Patent | Signal processing apparatus and method, and program

Abstract

The present technology relates to a signal processing apparatus and method, and a program that make it possible to reduce an arithmetic operation amount. The signal processing apparatus performs, on the basis of audio object mute information indicative of whether or not a signal of an audio object is a mute signal, at least either one of a decoding process or a rendering process of an object signal of the audio object. The present technology can be applied to a signal processing apparatus.

Claims

A signal processing apparatus, wherein, on a basis of audio object mute information indicative of whether or not a signal of an audio object is a mute signal, at least either one of a decoding process or a rendering process of an object signal of the audio object is performed.
The signal processing apparatus according to claim 1, wherein, in at least either one of the decoding process or the rendering process, either at least part of arithmetic operation is omitted or a value determined in advance is outputted as a value corresponding to a result of predetermined arithmetic operation according to the audio object mute information.
The signal processing apparatus according to claim 1, further comprising: an HRTF processing section that performs an HRTF process on a basis of a virtual speaker signal obtained by the rendering process and used to reproduce sound by a virtual speaker and virtual speaker mute information indicative of whether or not the virtual speaker signal is a mute signal.
The signal processing apparatus according to claim 3, wherein the HRTF processing section omits, from within the HRTF process, arithmetic operation for convoluting the virtual speaker signal determined to be a mute signal by the virtual speaker mute information and a transfer function.
The signal processing apparatus according to claim 3, further comprising: a mute information generation section configured to generate the audio object mute information on a basis of information regarding a spectrum of the object signal.
The signal processing apparatus according to claim 5, further comprising: a decoding processing section configured to perform the decoding process including decoding of spectral data of the object signal encoded by a context-based arithmetic encoding method, wherein the decoding processing section does not perform calculation of a context of the spectral data determined as a mute signal by the audio object mute information but decodes the spectral data by using a value determined in advance as a result of calculation of the context.
The signal processing apparatus according to claim 6, wherein the decoding processing section performs the decoding process including decoding of the spectral data and an IMDCT process for the decoded spectral data and outputs zero data without performing the IMDCT process for the decoded spectral data determined as a mute signal by the audio object mute information.
The signal processing apparatus according to claim 5, wherein the mute information generation section generates, on a basis of a result of the decoding process, another audio object mute information different from the audio object mute information used in the decoding process, and the signal processing apparatus further includes a rendering processing section configured to perform the rendering process on a basis of the another audio object mute information.
The signal processing apparatus according to claim 8, wherein the rendering processing section performs a gain calculation process of obtaining a gain of the virtual speaker for each object signal obtained by the decoding process and a gain application process of generating the virtual speaker signal on a basis of the gain and the object signal as the rendering process.
The signal processing apparatus according to claim 9, wherein the rendering processing section omits, in the gain application process, at least either one of arithmetic operation of the virtual speaker signal determined as a mute signal by the virtual speaker mute information or arithmetic operation based on the object signal determined as a mute signal by the another audio object mute information.
The signal processing apparatus according to claim 9, wherein the mute information generation section generates the virtual speaker mute information on a basis of a result of the calculation of the gain and the another audio object mute information.
The signal processing apparatus according to claim 1, wherein at least either one of the decoding process or the rendering process is performed on a basis of a priority degree of the audio object and the audio object mute information.
A signal processing method, wherein a signal processing apparatus performs, on a basis of audio object mute information indicative of whether or not a signal of an audio object is a mute signal, at least either one of a decoding process or a rendering process of an object signal of the audio object.
A program for causing a computer to process comprising a step of: performing, on a basis of audio object mute information indicative of whether or not a signal of an audio object is a mute signal, at least either one of a decoding process or a rendering process of an object signal of the audio object.

Description

TECHNICAL FIELD

[0001] The present technology relates to a signal processing apparatus and method, and a program, and particularly to a signal processing apparatus and method, and a program that make it possible to reduce an arithmetic operation amount.

BACKGROUND ART

[0002] In the past, an object audio technology has been used in a movie, a game and so forth, and an encoding method capable of handling an object audio has also been developed. In particular, for example, the MPEG (Moving Picture Experts Group)-H Part 3:3D audio standard that is an international standard and like standards are known (for example, refer to NPL 1).

[0003] Together with an existing 2-channel stereo method or multichannel stereo method for 5.1 channels or the like, in such an encoding method as described above, it is possible to treat a moving sound source or the like as an independent audio object and to encode position information of an object as metadata together with signal data of the audio object.

[0004] This makes it possible to perform reproduction in various viewing environments in which the number or the arrangement of speakers is different. Further, it makes it possible to easily process, upon reproduction of sound of a specific sound source, the sound of the specific sound source in volume adjustment of the sound of the specific sound source or addition of an effect to the sound of the specific sound source, which have been difficult by the existing encoding methods.

[0005] In such encoding methods as described above, decoding of a bit stream is performed by the decoding side such that an object signal that is an audio signal of an audio object and metadata including object position information indicative of the position of the audio object in a space are obtained.

[0006] Then, a rendering process for rendering the object signal to a plurality of virtual speakers that is virtually arranged in the space is performed on the basis of the object position information. For example, in the standard of NPL 1, a method called three-dimensional VBAP (Vector Based Amplitude Panning) (hereinafter referred to simply as VBAP) is used for the rendering process.

[0007] Further, after a virtual speaker signal corresponding to each virtual speaker is obtained by the rendering process, an HRTF (Head Related Transfer Function) process is performed on the basis of the virtual speaker signals. In the HRTF process, an output audio signal for allowing sound to be outputted from an actual headphone or speaker such that it sounds as if the sound were reproduced from the virtual speakers is generated.

CITATION LIST

Non Patent Literature

[0008] [NPL 1]

[0009] INTERNATIONAL STANDARD ISO/IEC 23008-3 First edition 2015-10-15 Information technology–High efficiency coding and media delivery in heterogeneous environments–Part 3: 3D audio

SUMMARY

Technical Problem

[0010] Incidentally, if the rendering process and the HRTF process are performed for the virtual speakers regarding the audio object described above, then audio reproduction can be implemented such that the sound sounds as if it were reproduced from the virtual speakers, and therefore, a high sense of presence can be obtained.

[0011] However, in the object audio, a great amount of arithmetic operation is required for a process for audio reproduction such as a rendering process and an HRTF process.

[0012] Especially, in the case where it is tried to reproduce an object audio with a device such as a smartphone, since increase of the arithmetic operation amount accelerates consumption of a battery, it is demanded to reduce the arithmetic operation amount without impairing the sense of presence.

[0013] The present technology has been made in view of such a situation as described above and makes it possible to reduce the arithmetic operation amount.

Solution to Problem

[0014] In a signal processing apparatus according to one aspect of the present technology, on the basis of audio object mute information indicative of whether or not a signal of an audio object is a mute signal, at least either one of a decoding process or a rendering process of an object signal of the audio object is performed.

[0015] A signal processing method or a program according to the one aspect of the present technology includes a step of performing, on the basis of audio object mute information indicative of whether or not a signal of an audio object is a mute signal, at least either one of a decoding process or a rendering process of an object signal of the audio object.

[0016] In the one aspect of the present technology, at least either one of a decoding process or a rendering process of an object signal of the audio object is performed on the basis of the audio object mute information indicative of whether or not the signal of the audio object is a mute signal.

BRIEF DESCRIPTION OF DRAWINGS

[0017] FIG. 1 is a view illustrating a process for an input bit stream.

[0018] FIG. 2 is a view illustrating VBAP.

[0019] FIG. 3 is a view illustrating an HRTF process.

[0020] FIG. 4 is a view depicting an example of a configuration of a signal processing apparatus.

[0021] FIG. 5 is a flow chart illustrating an output audio signal generation process.

[0022] FIG. 6 is a view depicting an example of a configuration of a decoding processing section.

[0023] FIG. 7 is a flow chart illustrating an object signal generation process.

[0024] FIG. 8 is a view depicting an example of a configuration of a rendering processing section.

[0025] FIG. 9 is a flow chart illustrating a virtual speaker signal generation process.

[0026] FIG. 10 is a flow chart illustrating a gain calculation process.

[0027] FIG. 11 is a flow chart illustrating a smoothing process.

[0028] FIG. 12 is a view depicting an example of metadata.

[0029] FIG. 13 is a view depicting an example of a configuration of a computer.

DESCRIPTION OF EMBODIMENTS

[0030] In the following, embodiments to which the present technology are applied are described with reference to the drawings.

First Embodiment

[0031] The present technology makes it possible to reduce an arithmetic operation amount without causing an error of an output audio signal by omitting at least part of processing during a mute interval or by outputting a predetermined value determined in advance as a value corresponding to an arithmetic operation result without actually performing arithmetic operation during a mute interval. This makes it possible to obtain a high sense of presence while reducing the arithmetic operation amount.

[0032] First, a general process is described which is performed when decoding (decoding) is performed for a bit stream obtained by encoding using an encoding method of the MPEG-H Part 3:3D audio standard to generate an output audio signal of an object audio.

[0033] For example, if an input bit stream obtained by encoding is inputted as depicted in FIG. 1, then a decoding process is performed for the input bit stream.

[0034] By the decoding process, an object signal that is an audio signal for reproducing sound of an audio object and metadata including object position information indicative of a position in a space of the audio object are obtained.

[0035] Then, a rendering process for rendering an object signal to virtual speakers virtually arranged in the space on the basis of the object position information included in the metadata is performed such that a virtual speaker signal for reproducing sound to be outputted from each virtual speaker is generated.

[0036] Further, an HRTF process is performed on the basis of the virtual speaker signal for each virtual speaker, and an output audio signal for causing sound to be outputted from a headphone set mounted on the user or a speaker arranged in the actual space is generated.

[0037] If sound is outputted from the actual headphone or speaker on the basis of the output audio signal obtained in such a manner as described above, then audio reproduction can be implemented such that the sound sounds as if it were reproduced from the virtual speaker. It is to be noted that, in the following description, a speaker actually arranged in an actual space is specifically referred to also as an actual speaker.

[0038] When such an object audio as described above is to be reproduced actually, in the case where a great number of actual speakers can be arranged in a space, an output of the rendering process can be reproduced as it is from the actual speakers. In contrast, in the case where a great number of actual speakers cannot be arranged in a space, the HRTF process is performed such that reproduction is performed by a small number of actual speakers such as a headphone or a sound bar. Generally, in most cases, reproduction is performed by a headphone or a small number of actual speakers.

[0039] Here, the general rendering process and HRTF process are further described.

[0040] For example, at the time of rendering, a rendering process of a predetermined method such as VBAP described above is performed. The VBAP is one of rendering methods generally called panning, and a gain is distributed, from among virtual speakers existing on a spherical surface having the origin at a position of a user, to three virtual speakers positioned nearest to an audio object existing on the same spherical surface to perform rendering.

[0041] It is assumed that, for example, as depicted in FIG. 2, a user U11 who is a hearing person is in a three-dimensional space and three virtual speakers SP1 to SP3 are arranged in front of the user U11.

[0042] Here, it is assumed that a position of the head of the user U11 is determined as an origin O and the virtual speakers SP1 to SP3 are positioned on the surface of a sphere centered at the origin O.

[0043] It is assumed now that an audio object exists in a region TR11 surrounded by the virtual speakers SP1 to SP3 on the spherical surface and a sound image is localized at a position VSP1 of the audio object.

[0044] In such a case as just described, according to the VBAP, a gain regarding the audio object is distributed to the virtual speakers SP1 to SP3 existing around the position VSP1.

[0045] In particular, in a three-dimensional coordinate system whose reference (origin) is the origin O, the position VSP1 is represented by a three-dimensional vector P that starts from the origin O and ends at the position VSP1.

[0046] Further, if three-dimensional vectors starting from the origin and ending at positions of the virtual speakers SP1 to SP3 are determined as vectors L1 to L3, respectively, then the vector P can be represented by a linear sum of the vectors L.sub.1 to L.sub.3 as indicated by the following expression (1).

[Math. 1]

P=g.sub.1L.sub.1+g.sub.2L.sub.2+g.sub.3L.sub.3 (1)

[0047] Here, if coefficients g.sub.1 to g.sub.3 multiplied to the vectors L.sub.1 to L.sub.3 in the expression (1) are calculated and such coefficients g.sub.1 to g.sub.3 are determined as gains of sound to be outputted from the virtual speakers SP1 to SP3, respectively, then a sound image can be localized at the position VSP1.

[0048] For example, if a vector having the coefficients gi to g.sub.3 as elements thereof is given as g.sub.123=[g.sub.1, g.sub.2, g.sub.3] and a vector having vectors L.sub.1 to L.sub.3 as elements thereof is given as L.sub.123=[L.sub.1, L.sub.2, L.sub.3], then the following expression (2) can be obtained by transforming the expression (1) given hereinabove.

[Math. 2]

g.sub.123=P.sup.TL.sup.-1.sub.123 (2)

[0049] If sound based on the object signal is outputted from the virtual speakers SP1 to SP3 by using, as gains, the coefficients g.sub.1 to g.sub.3 obtained by calculation of such an expression (2) as given above, then a sound image can be localized at the position VSP1.

[0050] It is to be noted that, since the arrangement positions of the virtual speakers SP1 to SP3 are fixed and information indicative of the positions of the virtual speakers is already known, L.sub.123.sup.-1 that is an inverse matrix can be determined in advance.

[0051] A triangular region TR11 surrounded by three virtual speakers on the spherical surface depicted in FIG. 2 is called mesh. By combining a great number of virtual speakers arranged in a space to configure plural meshes, sound of an audio object can be localized at any position in the space.

[0052] In such a manner, if a gain for the virtual speaker is determined with respect to each audio object, then a virtual speaker signal for each virtual speaker can be obtained by performing arithmetic operation of the following expression (3).

.times. [ Math . .times. 3 ] [ .times. SP .function. ( 0 , t ) SP .function. ( 1 , t ) SP .function. ( M - 1 , t ) ] = [ .times. G .function. ( 0 , 0 ) G .function. ( 0 , 1 ) G .function. ( 0 , N - 1 ) G .function. ( 1 , 0 ) G .function. ( 1 , 1 ) G .function. ( 1 , N - 1 ) G .function. ( M - 1 , 0 ) G .function. ( M - 1 , 1 ) G .function. ( M - 1 , N - 1 ) ] .function. [ S .function. ( 0 , t ) S .function. ( 1 , t ) S .function. ( N - 1 , t ) ] ( 3 ) ##EQU00001##

[0053] It is to be noted that, in the expression (3), SP(m,t) indicates a virtual speaker signal at time t of an mth (where, m=0, 1, … , M-1) virtual speaker from among M virtual speakers. Further, in the expression (3), S(n,t) indicates an object signal at time t of an nth (where, n=0, 1, … , N-1) audio object from among N audio objects.

[0054] Further, in the expression (3), G(m,n) indicates a gain to be multiplied to the object signal S(n,t) of the nth audio object for obtaining the virtual speaker signal SP(m,t) regarding the mth virtual speaker. In particular, the gain G(m,n) indicates a gain distributed to the mth virtual speaker regarding the nth audio object calculated in accordance with the expression (2) given hereinabove.

[0055] In the rendering process, calculation of the expression (3) is a process that requires the highest calculation cost. In other words, arithmetic operation of the expression (3) is a process in which the arithmetic operation amount is greatest.

[0056] Now, an example of the HRTF process performed in the case where sound based on the virtual speaker signal obtained by the arithmetic operation of the expression (3) is reproduced by a headphone or a small number of actual speakers is described with reference to FIG. 3. It is to be noted that, in FIG. 3, the virtual speakers are arranged on a two-dimensional horizontal plane in order to simplify the description.

[0057] In FIG. 3, five virtual speakers SP11-1 to SP11-5 are arranged side by side on a circular line in a space. In the following description, in the case where there is no necessity to specifically distinguish the virtual speakers SP11-1 to SP11-5 from one another, each of the virtual speakers SP11-1 to SP11-5 is sometimes referred to simply as virtual speaker SP11.

[0058] Further, in FIG. 3, a user U21 who is a sound receiving person is positioned at a position surrounded by the five virtual speakers SP11, namely, at a central position of the circular line on which the virtual speakers SP11 are arranged. Accordingly, In the HRTF process, an output audio signal for implementing audio reproduction is generated such that the sound sounds as if the user U21 were enjoying the sound outputted from the respective virtual speakers SP11.

[0059] Especially, it is assumed that, in the present example, a listening position is given by the position at which the user U21 is and sound based on the virtual speaker signals obtained by rendering to the five virtual speakers SP11 is reproduced by a headphone.

[0060] In such a case as just described, for example, sound outputted (emitted) from the virtual speaker SP11-1 on the basis of the virtual speaker signal follows a path indicated by an arrow mark Q11 and reaches the eardrum of the left ear of the user U21. Therefore, the characteristic of sound outputted from the virtual speaker SP11-1 should be varied by the spatial transfer characteristic from the virtual speaker SP11-1 to the left ear of the user U21, the shape of the face or the ear of the user U21, the reflection absorption characteristic and so forth.

[0061] Therefore, if a transfer function H_L_SP11 obtained by taking a spatial transfer characteristic from the virtual speaker SP11-1 to the left ear of the user U21, a shape of the face or the ear of the user U21, a reflection absorption characteristic and so forth into account is convoluted into a virtual speaker signal for the virtual speaker SP11-1, then an output audio signal for reproducing sound from the virtual speaker SP11-1 to be heard by the left ear of the user U21 can be obtained.

[0062] Similarly, sound outputted from the virtual speaker SP11-1 on the basis of a virtual speaker signal follows a path indicated by an arrow mark Q12 and reaches the eardrum of the right ear of the user U21. Accordingly, if a transfer function H_R_SP11 obtained by taking a spatial transfer characteristic from the virtual speaker SP11-1 to the right ear of the user U21, a shape of the face or the ear of the user U21, a reflection absorption characteristic and so forth into account is convoluted into a virtual speaker signal for the virtual speaker SP11-1, then an output audio signal for reproducing sound from the virtual speaker SP11-1 to be heard by the right ear of the user U21 can be obtained.

[0063] From those, when sound based on virtual speaker signals for the five virtual speakers SP11 is finally reproduced by a headphone, it is sufficient if, for the left channel, a transfer function for the left ear for the respective virtual speakers is convoluted into the virtual speaker signals and signals obtained as a result of the convolution are added to form an output audio signal for the left channel.

[0064] Similarly, for the right channel, it is sufficient if a transfer function for the right ear for the respective virtual speakers is convoluted into the virtual speaker signals and signals obtained as a result of the convolution are added to form an output audio signal for the right channel.

[0065] It is to be noted that, also in the case where the device to be used for reproduction is not a headphone but an actual speaker, an HRTF process similar to that in the case of a headphone is performed. However, in this case, since sound from the speaker reaches the left and right ears of the user by spatial propagation, a process that takes crosstalk into consideration is performed as an HRTF process. Such an HRTF process as just described is also called transaural processing.

[0066] Generally, if a frequency-expressed output audio signal for the left ear, namely, for the left channel, is represented by L(.omega.) and a frequency-expressed output audio signal for the right ear, namely, for the right channel, is represented by R(.omega.), then L(.omega.) and R(.omega.) can be obtained by calculating the following expression (4).

.times. [ Math . .times. 4 ] [ .times. L .function. ( .omega. ) R .function. ( .omega. ) ] = .times. .times. [ .times. H_L .times. ( 0 , .omega. ) H_L .times. ( 1 , .omega. ) H_L .times. ( M - 1 , .omega. ) H_R .times. ( 0 , .omega. ) H_R .times. ( 1 , .omega. ) H_R .times. ( M - 1 , .omega. ) ] [ .times. SP .function. ( 0 , .omega. ) SP .function. ( 1 , .omega. ) S .times. P ( M - 1 , .omega. ) .times. ] ( 4 ) ##EQU00002##

[0067] It is to be noted that, in the expression (4), .omega. indicates a frequency, and SP(m,.omega.) indicates a virtual speaker signal of the frequency .omega. for the mth (where m=0, 1, … , M-1) virtual speaker among M virtual speakers. The virtual speaker signal SP(m,.omega.) can be obtained by time frequency conversion of the virtual speaker signal SP(m,t) described hereinabove.

[0068] Further, in the expression (4), H_L(m,.omega.) indicates a transfer function for the left ear that is multiplied to the virtual speaker signal SP(m,.omega.) for the mth virtual speaker in order to obtain an output audio signal L(.omega.) of the left channel. Similarly, H_R(m,.omega.) indicates a transfer function for the right ear.

[0069] In the case where such HRTF transfer function H_L(m,.omega.) and transfer function H_R(m,.omega.) are expressed as impulse responses in the time domain, at least approximately one second is required. Therefore, in the case where, for example, the sampling frequency of the virtual speaker signals is 48 kHz, convolution of 48000 taps must be performed, and even if a high-seed calculation method that uses FFT (Fast Fourier Transform) is used for convolution of the transfer functions, a lot of arithmetic operation amount is still required.

[0070] In the case where a decoding process, a rendering process, and an HRTF process are performed to generate an output audio signal and an object audio is reproduced using a headphone or a small number of actual speakers, a lot of arithmetic operation amount is required as described above. Further, as the number of audio objects increases, this arithmetic operation amount increases that much.

[0071] Incidentally, although a stereo bit stream includes a very small number of mute intervals, generally it is very rare that an audio object bit stream includes a signal in all intervals of all audio objects.

[0072] In many audio object bit streams, approximately 30% of intervals are mute intervals, and in some cases, 60% of all intervals are mute intervals.

[0073] Therefore, in the present technology, information an audio object in a bit stream has is used to make it possible to reduce the arithmetic operation amount of a decoding process, a rendering process, and an HRTF process during mute intervals with a small arithmetic operation amount without calculating the energy of an object signal.

[0074] Now, an example of a configuration of a signal processing apparatus to which the present technology is applied is described.

[0075] FIG. 4 is a view depicting an example of a configuration of an embodiment of the signal processing apparatus to which the present technology is applied.

[0076] A signal processing apparatus 11 depicted in FIG. 4 includes a decoding processing section 21, a mute information generation section 22, a rendering processing section 23, and an HRTF processing section 24.

[0077] The decoding processing section 21 receives and decodes (decodes) an input bit stream transmitted thereto and supplies an object signal and metadata of an audio object obtained as a result of the decoding to the rendering processing section 23.

[0078] Here, the object signal is an audio signal for reproducing sound of the audio object, and the metadata includes at least object position information indicative of a position of the audio objected in a space.

[0079] More particularly, at the time of a decoding process, the decoding processing section 21 supplies information regarding a spectrum in each time frame extracted from the input bit stream and the like to the mute information generation section 22 and receives supply of information indicative of a mute or non-mute state from the mute information generation section 22. Then, the decoding processing section 21 performs a decoding process while performing omission or the like of processing of a mute interval on the basis of the information indicative of a mute or non-mute state supplied from the mute information generation section 22.

[0080] The mute information generation section 22 receives supply of various kinds of information from the decoding processing section 21 and the rendering processing section 23, generates information indicative of a mute or non-mute state on the basis of the information supplied thereto, and supplies the information to the decoding processing section 21, the rendering processing section 23, and the HRTF processing section 24.

[0081] The rendering processing section 23 performs transfer of information to and from the mute information generation section 22 and performs a rendering process based on an object signal and metadata supplied from the decoding processing section 21 according to the information indicative of a mute or non-mute state supplied from the mute information generation section 22.

[0082] In the rendering process, a process for a mute interval is omitted or the like on the basis of the information indicative of a mute or non-mute state. The rendering processing section 23 supplies a virtual speaker signal obtained by the rendering process to the HRTF processing section 24.

[0083] The HRTF processing section 24 performs an HRTF process on the basis of the virtual speaker single supplied from the rendering processing section 23 according to the information indicative of a mute or non-mute state supplied from the mute information generation section 22 and outputs an output audio signal obtained as a result of the HRTF process to a later stage. In the HRTF process, a process for a mute interval is omitted on the basis of the information indicative of a mute or non-mute state.

[0084] It is to be noted that an example is described here in which omission or the like of arithmetic operation is performed for a portion of mute signal (mute interval) in the decoding process, the rendering process, and the HRTF process. However, only it is necessary that omission or the like of arithmetic operation (process) is performed in at least either one of the decoding process, the rendering process, or the HRTF process, and also in such a case as just described, the arithmetic operation amount can be reduced as a whole.

[0085] Now, operation of the signal processing apparatus 11 depicted in FIG. 4 is described. In particular, an output audio signal generation process by the signal processing apparatus 11 is described below with reference to a flow chart of FIG. 5.

[0086] In step S11, the decoding processing section 21 performs, while performing transmission and reception of information to and from the mute information generation section 22, a decoding process for an input bit stream supplied thereto to generate an object signal and supplies the object signal and metadata to the rendering processing section 23.

[0087] For example, in step S11, the mute information generation section 22 generates spectral mute information indicative of whether or not each time frame (hereinafter referred to sometimes merely as frame) is mute, and the decoding processing section 21 executes a decoding process in which omission or the like of part of processing is performed on the basis of the spectral mute information. Further, in step S11, the mute information generation section 22 generates audio object mute information indicative of whether or not an object signal of each frame is a mute signal and supplies it to the rendering processing section 23.

[0088] In step S12, while the rendering processing section 23 performs transmission and reception of information to and from the mute information generation section 22, it performs a rendering process on the basis of the object signal and the metadata supplied from the decoding processing section 21 to generate a virtual speaker signal and supplies the virtual speaker signal to the HRTF processing section 24.

[0089] For example, in step S12, virtual speaker mute information indicative of whether or not the virtual speaker signal of each frame is a mute signal is generated by the mute information generation section 22. Further, a rendering process is performed on the basis of the audio object mute information and the virtual speaker mute information supplied from the mute information generation section 22. Especially, in the rendering process, omission of processing is performed during a mute interval.

[0090] In step S13, the HRTF processing section 24 generates an output audio signal by performing an HRTF process by which processing is omitted during a mute interval on the basis of the virtual speaker mute information supplied from the mute information generation section 22 and outputs the output audio signal to a later stage. After the output audio signal is outputted in such a manner, the output audio signal generation process is ended.

[0091] The signal processing apparatus 11 generates spectral mute information, audio object mute information, and virtual speaker mute information as information indicative of a mute or non-mute state in such a manner as described and performs, on the basis of the information, a decoding process, a rendering process, and an HRTF process to generate an output audio signal. Especially here, the spectral mute information, the audio object mute information, and the virtual speaker mute information are generated on the basis of information that can be obtained directly or indirectly from an input bit stream.

[0092] By this, the signal processing apparatus 11 performs omission or the like of processing during a mute interval and can reduce the arithmetic operation amount without damaging the presence. In other words, reproduction of an object audio can be performed with high presence while the arithmetic operation amount is reduced.

[0093] Here, the decoding process, the rendering process, and the HRTF process are described in more detail.

[0094] For example, the decoding processing section 21 is configured in such a manner as depicted in FIG. 6.

[0095] In the example depicted in FIG. 6, the decoding processing section 21 includes a demultiplexing section 51, a sub information decoding section 52, a spectral decoding section 53, and an IMDCT (Inverse Modified Discrete Cosine Transform) processing section 54.

[0096] The demultiplexing section 51 demultiplexes an input bit stream supplied thereto to extract (separate) audio object data and metadata from the input bit stream, and supplies the obtained audio object data to the sub information decoding section 52 and supplies the metadata to the rendering processing section 23.

[0097] Here, the audio object data is data for obtaining an object signal and includes sub information and spectral data.

[0098] In the present embodiment, on the encoding side, namely, on the generation side of an input bit stream, MDCT (Modified Discrete Cosine Transform) is performed for an object signal that is a time signal, and an MDCT coefficient obtained as a result of the MDCT is spectral data that is a frequency component of the object signal.

[0099] Further, on the encoding side, encoding of spectral data is performed by a context-based arithmetic encoding method. Then, the encoded spectral data and encoded sub information that is required for decoding of the spectral data are placed as audio object data into an input bit stream.

[0100] Further, as described hereinabove, the metadata includes at least object position information that is spatial position information indicative of a position of an audio object in a space.

[0101] It is to be noted that, generally, metadata is also encoded (compressed) frequently. However, since the present technology can be applied to metadata irrespective of whether or not the metadata is in an encoded state, namely, whether or not the metadata is in a compressed state, the description is continued here assuming that the metadata is not in an encoded state in order to simplify the description.

[0102] The sub information decoding section 52 decodes sub information included in audio object data supplied from the demultiplexing section 51 and supplies the decoded sub information and spectral data included in the audio object data supplied thereto to the spectral decoding section 53.

[0103] In other words, the audio object data including the decoded sub information and the spectral data in an encoded state to the spectral decoding section 53. Especially here, data other than spectral data from within data included in audio object data of each audio object included in a general input bit stream is the sub information.

[0104] Further, the sub information decoding section 52 supplies max_sfb that is information regarding a spectrum of each frame from within the sub information obtained by the decoding to the mute information generation section 22.

[0105] For example, the sub information includes information required for an IMDCT process or decoding of spectral data such as information indicative of a type of a transform window selected at the time of MDCT processing for an object signal and the number of scale factor bands with which encoding of spectral data has been performed.

[0106] In the MPEG-H Part 3:3D audio standard, in ics_info( ) max_sfb is encoded with 4 bits or 6 bits corresponding to a type of a transform window selected at the time of MDCT processing, namely, corresponding to window sequence. This max_sfb is information indicative of a quantity of encoded spectral data, namely, information indicative of the number of scale factor bands with which encoding of spectral data has been performed. In other words, the audio object data includes spectral data by an amount corresponding to the number of scale factor bands indicated by max_sfb.

[0107] For example, in the case where the value of max_sfb is 0, there is no encoded spectral data, and since all of spectral data in the frame are regarded as 0, the frame can be determined as a mute frame (mute interval).

[0108] The mute information generation section 22 generates spectral mute information of each audio object for each frame on the basis of max_sfb of each audio object for each frame supplied from the sub information decoding section 52 and supplies the spectral mute information to the spectral decoding section 53 and the IMDCT processing section 54.

[0109] Especially here, in the case where the value of max_sfb is 0, spectral mute information is generated which indicates that the target frame is a mute interval, namely, that the object signal is a mute signal. In contrast, in the case where the value of max_sfb is not 0, spectral mute information indicating that the target frame is a sounded interval, namely, that the object signal is a sounded signal, is generated.

[0110] For example, in the case where the value of the spectral mute information is 1, this indicates that the spectral mute information is a mute interval, but in the case where the value of the spectral mute information is 0, this indicates that the spectral mute information is a sounded interval, namely, that the spectral mute information is not a mute interval.

[0111] In such a manner, the mute information generation section 22 performs detection of a mute interval (mute frame) on the basis of max_sfb that is sub information and generates spectral mute information indicative of a result of the detection. This makes it possible to specify a mute frame with a very small processing amount (arithmetic operation amount) with which it is decided whether or not max_sfb extracted from an input bit stream is 0 without the necessity for calculation for obtaining energy of the object signal.

[0112] It is to be noted that, for example, “U.S. Pat. No. U59,905,232 B2, Hatanaka et al.” proposes an encoding method that does not use max_sfb and separately adds, in the case where a certain channel can be deemed mute, a flag such that encoding is not performed for the channel.

[0113] According to the encoding method, the encoding efficiency can be improved by 30 to 40 bits per channel from that by encoding according to the MPEG-H Part 3:3D audio standard, and in the present technology, such an encoding method as just described may also be applied. In such a case as just described, the sub information decoding section 52 extracts a flag that is included as sub information and indicates whether or not a frame of an audio object can be deemed mute, namely, whether or not encoding of spectral data has been performed, and supplies the flag to the mute information generation section 22. Then, the mute information generation section 22 generates spectral mute information on the basis of the flag supplied from the sub information decoding section 52.

[0114] Further, in the case where increase of the arithmetic operation amount at the time of decoding processing is permissible, the mute information generation section 22 may calculate the energy of spectral data to decide whether or not the frame is a mute frame and generate spectral mute information according to a result of the decision.

[0115] The spectral decoding section 53 decodes spectral data supplied from the sub information decoding section 52 on the basis of sub information supplied from the sub information decoding section 52 and spectral mute information supplied from the mute information generation section 22. Here, the spectral decoding section 53 performs decoding of the spectral data by a decoding method corresponding to the context-based arithmetic encoding method.

[0116] For example, according to the MPEG-H Part 3:3D audio standard, context-based arithmetic encoding is performed for spectral data.

[0117] Generally, according to arithmetic encoding, not one output encoded data exists for one input data, but final output encoded data is obtained by transition of a plurality of input data.

[0118] For example, in non-context-based arithmetic encoding, since the appearance frequency table to be used for encoding of input data becomes huge or plural appearance frequency tables are switchably used, it is necessary to encode an ID representative of an appearance frequency table and transmit the ID to the decoding side separately.

[0119] In contrast, context-based arithmetic encoding, a characteristic (contents) of a frame preceding frame to a noticed spectral data or a characteristic of spectral data of a frequency lower than the frequency of the noticed spectral data is obtained by calculation as a context. Then, an appearance frequency table to be used is automatically determined on the basis of a calculation result of the context.

[0120] Therefore, in the context-based arithmetic encoding, although also the decoding side must always perform calculation of the context, there are advantages that the appearance frequency table can be made compact and besides that the ID of the appearance frequency table need not be transmitted to the decoding side.

[0121] For example, in the case where the value of the spectral mute information supplied from the mute information generation section 22 is 0 and the frame of the processing target is a sounded interval, the spectral decoding section 53 performs calculation of a context suitably using sub information supplied from the sub information decoding section 52 and a result of decoding of other spectral data.

[0122] Then, the spectral decoding section 53 selects an appearance frequency table indicated by a value determined with respect to a result of the calculation of a context, namely, by the ID, and uses the appearance frequency table to decode the spectral data. The spectral decoding section 53 supplies the decoded spectral data and the sub information to the IMDCT processing section 54.

[0123] In contrast, in the case where the spectral mute information is 1 and the frame of the processing target is a mute interval (interval of a mute signal), namely, in the case where the value of max_sfb described hereinabove is 0, since the spectral data in this frame is 0 (zero data), the ID indicative of an appearance frequency table obtained by the context calculation indicates a same value without fail. In other words, the same appearance frequency table is selected without fail.

[0124] Therefore, in the case where the value of the spectral mute information is 1, the spectral decoding section 53 does not perform context calculation, but selects an appearance frequency table indicated by an ID of a specific value determined in advance and uses the appearance frequency table to decode spectral data. In this case, for spectral data determined as data of a mute signal, context calculation is not performed. Then, the ID of the specific value determined in advance as a value corresponding to a calculation result of a context, namely, as a value indicative of a calculation result of a context, is used as an output to select an appearance frequency table, and a subsequent process for decoding is performed.

[0125] By not performing calculation of a context according to spectral mute information in such a manner, namely, by omitting calculation of a contest and outputting a value determined in advance as a value indicative of a calculation result, the arithmetic operation amount of processing at the time of decoding (decoding) can be reduced. Besides, in this case, as a decoding result of spectral data, a result quite same as that when the calculation of a context is not omitted can be obtained.

[0126] The IMDCT processing section 54 performs IMDCT (inverse modified discrete cosine transform) on the basis of spectral data and sub information supplied from the spectral decoding section 53 according to the spectral mute information supplied from the mute information generation section 22 and supplies an object obtained as a result of the IMDCT to the rendering processing section 23.

[0127] For example, in the IMDCT, processing is performed in accordance with an expression described in “INTERNATIONAL STANDARD ISO/IEC 23008-3 First edition 2015-10-15 Information technology–High efficiency coding and media delivery in heterogeneous environments–Part 3: 3D audio.”

[0128] In the case where the value of max_sb is 0 and the frame of the target is a mute interval, all of the values of samples of a time signal of an output (processing result) of the IMDCT are 0. That is, the signal obtained by the IMDCT is zero data.

[0129] Therefore, in the case where the value of the spectral mute information supplied from the mute information generation section 22 is 1 and the target frame is a mute interval (interval of a mute signal), the IMDCT processing section 54 outputs zero data without performing IMDCT processing for the spectral data.

[0130] In particular, IMDCT processing is not performed actually, and zero data is outputted as a result of the IMDCT processing. In other words, as a value indicative of a processing result of the IMDCT, “0” (zero data) that is a value determined in advance is outputted.

[0131] More particularly, the IMDCT processing section 54 overlap synthesizes a time signal objected as a processing result of the IMDCT of the current frame of the processing target and a time signal obtained as a processing result of the IMDCT of a frame immediately preceding to the current frame to generate an object signal of the current frame and outputs the object signal.

[0132] The IMDCT processing section 54 can reduce the overall arithmetic operation amount of the IMDCT without giving rise to any error of the object signal obtained as an output by omitting the IMDCT processing during a mute interval. In other words, while the overall arithmetic operation amount of the IMDCT is reduced, an object signal quite same as that in the case where the IMDCT processing is not omitted can be obtained.

[0133] Generally, in the MPEG-H Part 3:3D audio standard, since decoding of spectral data and IMDCT processing in a decoding process of an audio object occupy most of the decoding process, that the IMDCT processing can be reduced leads to significant reduction of the arithmetic operation amount.

[0134] Further, the IMDCT processing section 54 supplies mute frame information indicative of whether or not a time signal of the current frame obtained as a processing result of the IMDCT is zero data, that is, whether or not the time signal is a signal of a mute interval, to the mute information generation section 22.

[0135] Consequently, the mute information generation section 22 generates audio object mute information on the basis of mute frame information of the current frame of the processing target and mute frame information of a frame immediately preceding in time to the current frame supplied from the IMDCT processing section 54 and supplies the audio object mute information to the rendering processing section 23. In other words, the mute information generation section 22 generates audio object mute information on the basis of mute frame information obtained as a result of the decoding process.

[0136] Here, in the case where both the mute frame information of the current frame and the mute frame information of the immediately preceding frame are information that they are signals during a mute interval, the mute information generation section 22 generates audio object mute information representing that the object signal of the current frame is a mute signal.

[0137] In contrast, in the case where at least either one of the mute frame information of the current frame or the mute frame information of the immediately preceding frame is information that it is not a signal during a mute interval, the mute information generation section 22 generates audio object mute information representing that the object signal of the current frame is a sounded signal.

[0138] Especially, in this example, in the case where the audio object mute information is 1, it is determined that this indicates that the object signal of the current frame a mute signal, and in the case where the audio object mute information is 0, it is determined that this indicates that the object signal is a sounded signal, namely, is not a mute signal.

[0139] As described hereinabove, the IMDCT processing section 54 generates an object signal of a current frame by overlapping synthesis with a time signal obtained as a processing result of the IMDCT of an immediately preceding frame. Accordingly, since the object signal of the current frame is influenced by the immediately preceding frame, at the time of generation of audio object mute information, it is necessary to take a result of the overlapping synthesis, namely, a processing result of the IMDCT of the immediately preceding frame, into account.

[0140] Therefore, only in the case where the value of max_sfb is 0 in both the current frame and the immediately preceding frame, namely, only in the case where zero data is obtained as a processing result of the IMDCT, the mute information generation section 22 determines that the object signal of the current frame is a frame of a mute interval.

[0141] By generating audio object mute information indicative of whether or not the object signal is mute taking the IMDCT processing into consideration in such a manner, the rendering processing section 23 at the later stage can correctly recognize whether the object signal of the frame of the processing target is mute.

[0142] Now, the process in step S11 in the output audio signal generation process described with reference to FIG. 5 is described in more detail. In particular, the object signal generation process that corresponds to step S11 of FIG. 5 and is performed by the decoding processing section 21 and the mute information generation section 22 is described below with reference to a flow chart of FIG. 7.

[0143] In step S41, the demultiplexing section 51 demultiplexes the input bit stream supplied thereto and supplies audio object data and metadata obtained as a result of the demultiplexing to the sub information decoding section 52 and the rendering processing section 23, respectively.

[0144] In step S42, the sub information decoding section 52 decodes sub information included in the audio object data supplied from the demultiplexing section 51 and supplies the sub information after the decoding and spectral data included in the audio object data supplied thereto to the spectral decoding section 53. Further, the sub information decoding section 52 supplies max_sfb included in the sub information to the mute information generation section 22.

[0145] In step S43, the mute information generation section 22 generates spectral mute information on the basis of max_sfb supplied thereto from the sub information decoding section 52 and supplies the spectral mute information to the spectral decoding section 53 and the IMDCT processing section 54. For example, in the case where the value of max_sfb is 0, spectral mute information whose value is 1 is generated, but in the case where the value of max_sfb is not 0, spectral mute information whose value is 0 is generated.

……
……
……

本文链接：https://patent.nweon.com/21018

Sony Patent | Signal processing apparatus and method, and program

您可能还喜欢...

分类

最新AR/VR行业分享

最新AR/VR论文

最新AR/VR行业招聘

Sony Patent | Signal processing apparatus and method, and program

您可能还喜欢...

Sony Patent | Wearable Apparatus, Electronic Apparatus, Image Control Apparatus, And Display Control Method

Sony Patent | Method and system for generating an image of a subject in a scene

Sony Patent | Display element and electronic device

分类

最新AR/VR行业分享

最新AR/VR论文

最新AR/VR行业招聘