Apple Patent | Audio bandwidth reduction
Patent: Audio bandwidth reduction
Drawings: Click to check drawins
Publication Number: 20210035597
Publication Date: 20210204
Applicant: Apple
Abstract
A first device obtains, from the array, several audio signals and processes the audio signals to produce a speech signal and one or more ambient signals. The first device processes the ambient signals to produce a sound-object sonic descriptor that has metadata describing a sound object within an acoustic environment. The first device transmits, over a communication data link, the speech signal and the descriptor to a second electronic device that is configured to spatially reproduce the sound object using the descriptor mixed with the speech signal, to produce several mixed signals to drive several speakers.
Claims
-
A method comprising: obtaining, from a microphone array of a first electronic device, a plurality of audio signals; processing the plurality of audio signals to produce a speech signal and one or more ambient signals that contain ambient sound from an acoustic environment in which the first electronic device is located; processing the ambient signals to produce a sound-object sonic descriptor that has metadata that describes a sound object within the acoustic environment; and transmitting, over a communication data link, the speech signal and the sound-object sonic descriptor to a second electronic device that is configured to spatially reproduce the sound object using the sound-object sonic descriptor mixed with the speech signal, to produce a plurality of mixed signals to drive a plurality of speakers.
-
The method of claim 1, wherein processing the ambient signals to produce the sound-object sonic descriptor comprises identifying a sound source within the acoustic environment, the sound source being associated with the sound object, and producing spatial sound-source data that spatially represents the sound source with respect to the first electronic device.
-
The method of claim 2, wherein the spatial sound-source data parametrically represents the sound source as a high order ambisonic (HOA) format of the sound source.
-
The method of claim 2, wherein the spatial sound-source data comprises an audio signal and position data that indicates the position of the sound source with respect to the first electronic device.
-
The method of claim 4, wherein the audio signal comprises a directional beam pattern that includes the sound source.
-
The method of claim 2, further comprising processing the spatial sound-source data to determine a distributed numerical representation of the sound object, wherein the metadata comprises the numerical representation of the sound object.
-
The method of claim 2, further comprising identifying the sound object by performing a table lookup into a sound library that has one or more entries, each entry is for a corresponding predefined sound object using the spatial sound-source data to identify the sound object as a matching predefined sound object contained therein.
-
The method of claim 7, wherein at least some of the entries comprises metadata that describes sound characteristics of the corresponding predefined sound object, wherein performing the table lookup into the sound library comprises comparing sound characteristics of the spatial-sound source data with the sound characteristics of the at least some of the entries in the sound library and selecting the predefined sound object with matching sound characteristics.
-
The method of claim 7, further comprising capturing image data using a camera of the audio source device; performing an object recognition algorithm upon the image data to identify an object contained therein, wherein at least some of the entries in the sound library comprises metadata that describes physical characteristics of the corresponding predefined sound object, wherein performing the table lookup into the sound library comprises comparing physical characteristics of the identified object with the physical characteristics of the at least some of the entries in the sound library and selecting the predefined sound object with matching physical characteristics.
-
The method of claim 7, wherein each entry of the sound table includes metadata corresponding to a predefined sound object, wherein the metadata of each entry comprises at least an index identifier for a corresponding sound object of the entry, wherein producing the sound-object sonic descriptor comprises finding the matching predefined sound object; and adding the index identifier that corresponds to the matching predefined sound object to the sound object sonic descriptor.
-
The method of claim 10, wherein producing the sound-object sonic descriptor comprises determining position data that indicates a position of the sound object within the acoustic environment and loudness data that indicates a sound level of the sound object at the microphone array from the spatial sound-source data and adding the position data and the loudness data to the sonic descriptor.
-
The method of claim 7, wherein, in response determining that the sound library does not include the matching predefined sound object, the method further comprises creating an index identifier for uniquely identifying the sound object; and creating an entry into the sound library for the sound object that includes the created index identifier.
-
The method of claim 12, wherein the spatial sound-source data comprises an audio signal of the sound object, wherein the sound object sonic descriptor further comprises the audio signal of the sound object, wherein upon receiving the sound-object sonic descriptor the second electronic device is configured to store the audio signal and the index identifier in a new entry in a local sound library.
-
The method of claim 1, wherein the first electronic device is a head-mounted device (HMD).
-
A method comprising: obtaining, from a microphone array of an audio source device, a plurality of audio signals; processing the plurality of audio signals to produce a speech signal and one or more ambient signals; identifying, from the ambient signals, a background or diffuse ambient sound as part of a sound bed that is associated with an acoustic environment in which the audio source device is located; producing a sound-bed sonic descriptor that has metadata describing the sound bed, wherein the metadata includes 1) an index identifier that uniquely identifies the background or diffuse ambient sound and 2) loudness data that indicates a sound level of the background or diffuse ambient sound at the microphone array; and transmitting, over a communication data link, the speech signal and the sound-bed sonic descriptor to an audio receiver device that is configured to spatially reproduce the sound bed that includes the background or diffuse ambient sound using the sound-bed sonic descriptor as a plurality of audio signals that are mixed with the speech signal, to produce a plurality of mixed signals to drive a plurality of speakers.
-
The method of claim 15, wherein identifying the background or diffuse ambient sound comprises identifying a sound source within the acoustic environment; and determining that the sound source produces sound within the environment at least two times within a threshold period of time.
-
The method of claim 16, wherein the audio receiver device is configured to periodically use the plurality of audio signals to drive the plurality of speakers, subsequent to driving the plurality of speakers with the plurality of mixed signals.
-
The method of claim 17, wherein the audio receiver device periodically uses the plurality of audio signals to drive the plurality of speakers according to a predefined period of time.
-
The method of claim 15 further comprising determining bandwidth or an available throughput of the communication data link for transmitting data from the audio source device to the audio receiver device.
-
The method of claim 19, wherein in response to the bandwidth or the available throughput being below a first threshold, preventing the audio source device from transmitting future sound-bed sonic descriptors, while continuing to transmit the speech signal to the audio receiver device.
-
The method of claim 20 further comprising using the speech signal to produce a phoneme sonic descriptor that represents the speech signal as phoneme data, wherein in response to the bandwidth or available throughput being below a second threshold that is below the first threshold, transmitting the phoneme sonic descriptor in lieu of the speech signal.
-
A method comprising: obtaining, from a microphone array of a first electronic device, a plurality of audio signals that contains sound from an acoustic environment in which the first electronic device is located; processing at least some of the plurality of audio signals to produce a sound-object sonic descriptor that has metadata that describes a sound object within the acoustic environment, wherein the metadata comprises 1) an index identifier that uniquely identifies the sound object, 2) position data that indicates a position of the sound object within the acoustic environment, 3) loudness data that indicates a sound level of the sound object at the microphone array; and transmitting, over a communication data link, the sound-object sonic descriptor to a second electronic device that is configured to spatially reproduce the sound object using the sound-object sonic descriptor to produce a plurality of binaural audio signals to drive a plurality of speakers.
-
The method of claim 22, wherein processing the at least some of the plurality of audio signals comprises identifying a sound source within the acoustic environment, the sound source being associated with the sound object; and producing spatial sound-source data that spatially represents the sound source with respect to the first electronic device.
-
The method of claim 23 further comprising identifying the spatial sound-source data as the sound object by performing a table lookup into a sound library that has one or more entries, each entry is for a corresponding predefined sound object using the spatial sound-source data to identify the sound object as a matching predefined sound object contained therein.
-
The method of claim 24, wherein at least some of the entries comprises metadata that describes sound characteristics of the corresponding predefined sound object, wherein performing the table lookup into the sound library comprises comparing sound characteristics of the spatial-sound source data with the sound characteristics of the at least some of the entries in the sound library and selecting the predefined sound object with matching sound characteristics.
-
The method of claim 25, wherein the index identifier is a first index identifier, wherein the method further comprises processing at least some of the plurality of audio signals to produce a sound-bed sonic descriptor that has metadata describing a sound bed of the acoustic environment, wherein the metadata includes 1) a second index identifier that uniquely identifies the sound bed and 2) loudness data that indicates a sound level of the sound bed at the microphone array; and transmitting, over the communication data link, the sound-bed sonic descriptor to the second electronic device that is configured to spatially reproduce the sound bed as a plurality of audio signals that are mixed with the binaural signals to produce a plurality of mixed signals to drive the plurality of speakers.
-
The method of claim 22 further comprising: processing at least some of the plurality of audio signals to produce a speech signal that contains speech of a user of the first electronic device; and transmitting, over the communication data link, the speech signal to the second electronic device that is configured to mix the speech signal with the plurality of binaural signals to produce a plurality of mixed signals to drive the plurality of speakers.
Description
CROSS REFERENCE TO RELATED APPLICATION
[0001] This application claims the benefit of U.S. Provisional Application No. 62/880,559 filed on Jul. 30, 2019, the disclosure of which is incorporated herein by reference in its entirety.
TECHNICAL FIELD
[0002] An aspect of the disclosure relates to an electronic device that performs bandwidth-reduction operations to reduce an amount of data to be transmitted to another electronic device over a computer network.
BACKGROUND
[0003] Headphones are an audio device that includes a pair of speakers, each of which is placed on top of a user’s ear when the headphones are worn on or around the user’s head. Similar to headphones, earphones (or in-ear headphones) are two separate audio devices, each having a speaker that is inserted into the user’s ear. Headphones and earphones are normally wired to a separate playback device, such as a digital audio player, that drives each of the speakers of the devices with an audio signal in order to produce sound (e.g., music). Headphones and earphones provide a convenient method by which the user can individually listen to audio content without having to broadcast the audio content to others who are nearby.
SUMMARY
[0004] An aspect of the disclosure is a system that performs bandwidth-reduction operations to reduce an amount of audio data that is transmitted between two electronic devices (e.g., an audio source device and an audio receiver device) that are engaged in a communication session (e.g., a Voice Over IP (VoIP) phone call). For instance, both devices may engage in the session via a wireless communication data link (e.g., over a wireless network, such as a local area network (LAN)), whose bandwidth or available throughput may vary depending on several factors. For instance, the bandwidth may vary depending on how many other devices are wirelessly communicating over the wireless network and the distance between the source device and a wireless access point (or wireless router). The present disclosure provides a system for reducing an amount of bandwidth required to conduct a communication session by reducing an amount of audio data that is exchanged between both devices. The system includes an audio source device and an audio receiver device, both of which may be head-mounted devices (HMDs) that are communicating over a computer network (e.g., the Internet). The source device obtains several microphone audio signals that are captured by a microphone array of the device. The source device processes the audio signals to separate a speech signal (e.g., that contains speech of a user of the source device) from one or more ambient signals that contain ambient sound from an acoustic environment in which the source device is located. The source device processes the audio signals to produce a sound-object sonic descriptor that has metadata describing one or more sound objects within the acoustic environment, such as a dog bark or a helicopter flying in the air. The metadata may include an index identifier that uniquely identifies the sound object as a member or entry within a sound library that is previously known to the source device and/or the receiver device. The metadata may also include position data that indicates the position of the sound object (e.g., the dog bark is to the left of the source device) and loudness data that indicates a sound level of the sound object at the microphone array. The source device transmits the sonic descriptor, which has a reduced file size relative to audio data that may be associated with the sound object, and the speech signal to the audio receiver device. The receiver device uses the sonic descriptor to spatially reproduce the sound object, and mixes the reproduced sound object with the speech signal to produce several mixed signals to drive several speakers.
[0005] In one aspect, the system uses the metadata of the sonic descriptor to produce a reproduction of the sound object that includes an audio signal and position data that indicates a position of a virtual sound source of the sound object. For instance, the receiver device may use the index identifier to perform a table lookup into the sound library that has one or more entries of predefined sound objects, each entry having a corresponding unique identifier, using the unique identifier to identify a predefined sound object that has a matching unique identifier. Upon identifying the predefined sound object, the receiver device retrieves the sound object from the sound library that includes an audio signal that is stored within the sound library. The receiver device spatially renders the audio signal according to the position data to produce several binaural audio signals, which are mixed with the speech signal to drive the several speakers.
[0006] In one aspect, the system may produce other sonic descriptors that describe other types of sounds. For example, the system may produce a sound-bed sonic descriptor that describes an ambient or diffuse background noise or sound that is a part of a sound bed of the environment. As another example, the system may produce a phoneme sonic descriptor that includes phoneme data that may be a textual representation of the speech signal. Each of these sonic descriptors, including the sound-object sonic descriptors may have a reduced file size than corresponding audio signals that contain similar sounds. As a result, the system may transmit any number of combinations of the sonic descriptors in lieu of the actual audio signals based on the bandwidth or available throughput. For instance, if the bandwidth or available throughput is limited, the sound source device may transmit the phoneme sonic descriptor instead of the speech signal, which would otherwise require more bandwidth. The audio receiver device may synthesize a speech signal based on the phoneme sonic descriptor for output through at least one speaker, in lieu of the speech signal that is produced by the audio source device.
[0007] In one aspect, system may update or build a sound library, when an existing sound library does not include an entry that corresponds to an identified sound object. For instance, upon identifying a sound object within the acoustic environment, the audio source device may perform a table lookup into the existing sound library to determine whether the library includes a matching predefined sound object. If there is no matching predefined sound object, the source device may create an entry within the sound library, assigning metadata that is associated with the identified sound object to the entry. For example, the source device may create a unique identifier for the sound object. The source device may transmit the entry, which includes the sound object (e.g., audio data and/or metadata associated with the sound object) to the audio receiver device for storage in the receiver device’s local library. As a result, the next time the sound object is identified by the source device, rather than transmitting the sound object, the source device may transmit the sound object sonic descriptor that includes the unique index identifier. In turn, the receiver device may retrieve the corresponding sound object for spatial rendering through two or more speakers, as described herein.
[0008] The above summary does not include an exhaustive list of all aspects of the present disclosure. It is contemplated that the disclosure includes all systems and methods that can be practiced from all suitable combinations of the various aspects summarized above, as well as those disclosed in the Detailed Description below and particularly pointed out in the claims filed with the application. Such combinations have particular advantages not specifically recited in the above summary.
BRIEF DESCRIPTION OF THE DRAWINGS
[0009] The aspects of the disclosure are illustrated by way of example and not by way of limitation in the figures of the accompanying drawings in which like references indicate similar elements. It should be noted that references to “an” or “one” aspect of the disclosure are not necessarily to the same aspect, and they mean at least one. Also, in the interest of conciseness and reducing the total number of figures, a given figure may be used to illustrate the features of more than one aspect of the disclosure, and not all elements in the figure may be required for a given aspect.
[0010] FIG. 1 shows a block diagram of an audio source device according to one aspect of the disclosure
[0011] FIG. 2 shows a block diagram of operations performed by a sound object & sound bed identifier to identify a sound object according to one aspect of the disclosure.
[0012] FIG. 3 shows a sound-object sonic descriptor produced by the audio source device according to one aspect of the disclosure.
[0013] FIG. 4 shows a block diagram of an audio receiver device according to one aspect of the disclosure.
[0014] FIG. 5 is a flowchart of one aspect of a process to reduce bandwidth that is required to transmit audio data.
[0015] FIG. 6 is a signal diagram of a process for an audio source device to transmit lightweight sound representations of sound objects and for an audio receiver device to use the representations to reproduce and playback the sound objects according to one aspect of the disclosure.
[0016] FIG. 7 is a signal diagram of a process for building and updating a sound library.
DETAILED DESCRIPTION
[0017] Several aspects of the disclosure with reference to the appended drawings are now explained. Whenever the shapes, relative positions, and other aspects of the parts described in the aspects are not explicitly defined, the scope of the disclosure is not limited only to the parts shown, which are meant merely for the purpose of illustration. Also, while numerous details are set forth, it is understood that some aspects of the disclosure may be practiced without these details. In other instances, well-known circuits, structures, and techniques have not been shown in detail so as not to obscure the understanding of this description. In one aspect, ranges disclosed herein may include any value (or quantity) between end point values and/or the end point values.
[0018] A physical environment (or setting) refers to a physical world that people can sense and/or interact with without aid of electronic systems. Physical environments, such as a physical park, include physical articles, such as physical trees, physical buildings, and physical people. People can directly sense and/or interact with the physical environment, such as through sight, touch, hearing, taste, and smell.
[0019] In contrast, a computer-generated reality (CGR) environment (setting) refers to a wholly or partially simulated environment that people sense and/or interact with via an electronic system. In CGR, a subset of a person’s physical motions, or representations thereof, are tracked, and, in response, one or more characteristics of one or more virtual objects simulated in the CGR environment are adjusted in a manner that comports with at least one law of physics. For example, a CGR system may detect a person’s head turning and, in response, adjust graphical content and an acoustic field presented to the person in a manner similar to how such views and sounds would change in a physical environment. In some situations (e.g., for accessibility reasons), adjustments to characteristic(s) of virtual object(s) in a CGR environment may be made in response to representations of physical motions (e.g., vocal commands).
[0020] A person may sense and/or interact with a CGR object using any one of their senses, including sight, sound, touch, taste, and smell. For example, a person may sense and/or interact with audio objects that create 3D or spatial audio environment that provides the perception of point audio sources in 3D space. In another example, audio objects may enable audio transparency, which selectively incorporates ambient sounds from the physical environment with or without computer-generated audio. In some CGR environments, a person may sense and/or interact only with audio objects.
[0021] Examples of CGR include virtual reality and mixed reality. A virtual reality (VR) environment refers to a simulated environment that is designed to be based entirely on computer-generated sensory inputs for one or more senses. A VR environment comprises a plurality of virtual objects with which a person may sense and/or interact. For example, computer-generated imagery of trees, buildings, and avatars representing people are examples of virtual objects. A person may sense and/or interact with virtual objects in the VR environment through a simulation of the person’s presence within the computer-generated environment, and/or through a simulation of a subset of the person’s physical movements within the computer-generated environment.
[0022] In contrast to a VR environment, which is designed to be based entirely on computer-generated sensory inputs, a mixed reality (MR) environment refers to a simulated environment that is designed to incorporate sensory inputs from the physical environment, or a representation thereof, in addition to including computer-generated sensory inputs (e.g., virtual objects). On a virtuality continuum, a mixed reality environment is anywhere between, but not including, a wholly physical environment at one end and virtual reality environment at the other end.
[0023] In some MR environments, computer-generated sensory inputs may respond to changes in sensory inputs from the physical environment. Also, some electronic systems for presenting an MR environment may track location and/or orientation with respect to the physical environment to enable virtual objects to interact with real objects (that is, physical articles from the physical environment or representations thereof). For example, a system may account for movements so that a virtual tree appears stationery with respect to the physical ground.
[0024] Examples of mixed realities include augmented reality and augmented virtuality. An augmented reality (AR) environment refers to a simulated environment in which one or more virtual objects are superimposed over a physical environment, or a representation thereof. For example, an electronic system for presenting an AR environment may have a transparent or translucent display through which a person may directly view the physical environment. The system may be configured to present virtual objects on the transparent or translucent display, so that a person, using the system, perceives the virtual objects superimposed over the physical environment. Alternatively, a system may have an opaque display and one or more imaging sensors that capture images or video of the physical environment, which are representations of the physical environment. The system composites the images or video with virtual objects, and presents the composition on the opaque display. A person, using the system, indirectly views the physical environment by way of the images or video of the physical environment, and perceives the virtual objects superimposed over the physical environment. As used herein, a video of the physical environment shown on an opaque display is called “pass-through video,” meaning a system uses one or more image sensor(s) to capture images of the physical environment, and uses those images in presenting the AR environment on the opaque display. Further alternatively, a system may have a projection system that projects virtual objects into the physical environment, for example, as a hologram or on a physical surface, so that a person, using the system, perceives the virtual objects superimposed over the physical environment.
[0025] An augmented reality environment also refers to a simulated environment in which a representation of a physical environment is transformed by computer-generated sensory information. For example, in providing pass-through video, a system may transform one or more sensor images to impose a select perspective (e.g., viewpoint) different than the perspective captured by the imaging sensors. As another example, a representation of a physical environment may be transformed by graphically modifying (e.g., enlarging) portions thereof, such that the modified portion may be representative but not photorealistic versions of the originally captured images. As a further example, a representation of a physical environment may be transformed by graphically eliminating or obfuscating portions thereof.
[0026] An augmented virtuality (AV) environment refers to a simulated environment in which a virtual or computer generated environment incorporates one or more sensory inputs from the physical environment. The sensory inputs may be representations of one or more characteristics of the physical environment. For example, an AV park may have virtual trees and virtual buildings, but people with faces photorealistically reproduced from images taken of physical people. As another example, a virtual object may adopt a shape or color of a physical article imaged by one or more imaging sensors. As a further example, a virtual object may adopt shadows consistent with the position of the sun in the physical environment.
[0027] There are many different types of electronic systems that enable a person to sense and/or interact with various CGR environments. Examples include head mounted systems (or head mounted devices (HMDs)), projection-based systems, heads-up displays (HUDs), vehicle windshields having integrated display capability, windows having integrated display capability, displays formed as lenses designed to be placed on a person’s eyes (e.g., similar to contact lenses), headphones/earphones, speaker arrays, input systems (e.g., wearable or handheld controllers with or without haptic feedback), smartphones, tablets, and desktop/laptop computers. A head mounted system may have one or more speaker(s) and an integrated opaque display. Alternatively, a head mounted system may be configured to accept an external opaque display (e.g., a smartphone). The head mounted system may incorporate one or more imaging sensors to capture images or video of the physical environment, and/or one or more microphones to capture audio of the physical environment. Rather than an opaque display, a head mounted system may have a transparent or translucent display. The transparent or translucent display may have a medium through which light representative of images is directed to a person’s eyes. The display may utilize digital light projection, OLEDs, LEDs, uLEDs, liquid crystal on silicon, laser scanning light source, or any combination of these technologies. The medium may be an optical waveguide, a hologram medium, an optical combiner, an optical reflector, or any combination thereof. In one embodiment, the transparent or translucent display may be configured to become opaque selectively. Projection-based systems may employ retinal projection technology that projects graphical images onto a person’s retina. Projection systems also may be configured to project virtual objects into the physical environment, for example, as a hologram or on a physical surface.
[0028] With the proliferation of electronic devices in homes and businesses that are interconnected with each other over the Internet (such as in an Internet of Things (IoT) system), the speed and rate of data transmission (or data transfer rate) over the Internet (e.g., to a remote server) via a computer network (e.g., a Local Area Network (LAN)) becomes an important issue. For instance, electronic devices that are on one LAN may each share the same internet connection via an access point, such as a cable modem that exchanges data (e.g., transmits and receives Internet Protocol (IP) packets) with other remote devices via an internet service provider (ISP). The internet connection with the ISP may have a limited Internet bandwidth based on several factors, such as the type of cable modem that is being used. For instance, different cable modems may support different connection speeds (e.g., over 150 Mbps) depending on which Data Over Cable Service Interface Specification (or DOC SIS) standard is supported by the cable modem.
[0029] Bandwidth is also an issue with wireless electronic devices that communicate with each other over a wireless local area network (WLAN), such as multimedia gaming systems, security devices, and portable personal devices (e.g., smart phones, computer tablets, laptops, etc.). For instance, along with having a shared limited Internet bandwidth (when these devices communicate with other devices over the Internet), the wireless electronic devices may share a wireless bandwidth, which is the rate of data transmission between a wireless router and devices within the WLAN. This bandwidth may vary between devices based on several additional factors, such as the type of IEEE 802.11x standard supported by the wireless router that is supplying the WLAN and the distance between the wireless electronic devices and the wireless router. Since the number of wireless electronic devices that are in homes and businesses are increasing, each vying for a portion of the available wireless bandwidth (and/or Internet bandwidth), the bandwidth requirement for these devices may exceed the availability. In this case, each device may be allocated a smaller portion of available bandwidth resulting in a slower data-transfer rate.
[0030] Applications executing on electronic devices that rely on close-to-real-time data transmission may be most affected by a slower data rate (or slower throughput). For instance, applications that cause the electronic device to engage in a communication session (e.g., a Voice of Internet Protocol (VoIP) phone call) may require a certain amount of bandwidth (or throughput). For example, to engage in a communication session, the electronic device (e.g., source device) may capture audio data (e.g., using a microphone integrated therein) and (e.g., wirelessly) transmit the audio data to another electronic device (e.g., receiving device) as an uplink. In order to preserve a real-time user experience on the receiving device, a certain minimum threshold of bandwidth may be necessary. As another example, both devices may engage in a video conference in which both devices transmit audio/video data in real time. When bandwidth is exceeded, the electronic device may adjust application settings (e.g., sound quality, video quality, etc.) in order to reduce the amount of bandwidth required to conduct the video conference. In some cases, however, the adjustment may be insufficient and the application may be forced to terminate data transmission entirely (e.g., by ending the phone call or video conference).
[0031] As another example, an electronic (e.g., a wireless earphone) may experience bandwidth or throughput issues while communicatively coupled or paired with media playback device (e.g., smart phone) that is engaged in a communication session. For instance, a user may participate in a handsfree phone call that is initiated by a media playback device, but conducted through the wireless earphone. In this case, the wireless earphone may establish a communication link, via a wireless personal area network (WPAN) using any wireless protocol, such as BLUETOOTH protocol. During the phone call, the throughput of data packets may reduce (e.g., based on the distance between the wireless earphone and the media playback device). As a result, the media playback device may drop the phone call. Therefore, there is a need for a reduction in the bandwidth (or throughput) requirement for applications that transmit audio data to other devices.
[0032] To accomplish this, the present disclosure describes an electronic device (e.g., an audio source device) that is capable of performing bandwidth-reduction operations to reduce the amount of (e.g., audio) data to be transmitted to another electronic device (e.g., an audio receiver device) via a communication data link. Specifically, the audio source device is configured to obtain several audio signals produced by an array of microphones and process the audio signals to produce a speech signal and a set of ambient signals. The device processes the set of ambient signals to produce a plurality of sound-object sonic descriptors that have metadata that describe sound objects or sound assets (e.g., a sound within the ambient environment in which the device is located, such as a car honk) within the ambient signals. For instance, the metadata may include an index identifier that uniquely identifies the sound object, as well as other information (or data) regarding the sound object, such as its position with respect to the source device. In one aspect, the sound-object sonic descriptor may have a lower file size than the ambient signals. Rather than transmit the speech signal and the ambient signals, the device transmits the speech signal and the sound-object sonic descriptor, which may have a significantly lower file size than the ambient signals to an audio receiver device. The receiver device then is configured to use the sound-object sonic descriptor to spatially reproduce sound object with the speech signal, to produce several mixed signals to drive speakers. Thus, instead of transmitting the ambient signals or the sound object (which may include an audio signal), the audio source device may reduce the bandwidth requirement (or necessary throughput) for transmitting the audio data to the audio receiver device by transmitting the sound-object sonic descriptor instead of at least one of the ambient signals.
[0033] In one aspect, “bandwidth” may correspond to an amount of data that can be sent from the audio source device to the audio receiver device in a certain period of time. In another aspect, as described herein, bandwidth or available throughput may correspond to a data rate (or throughput) that is necessary for a source device to transmit audio data to a receiver device in order for the receiver device to render and output the audio data at a given level of audio quality. This data rate, however, may exceed bandwidth that is available at either a source device and/or a receiver device. Thus, as described herein, in order to maintain audio quality, the source device may adjust an amount of audio data for transmission based on the bandwidth or available throughput at either side. More about this process is described herein.
[0034] As used herein, a “sound object” may refer to a sound that is captured by at least one microphone of an electronic device within an acoustic environment in which the electronic device is located. The sound object may include audio data (or an audio signal) that contains the sound and/or metadata that describes the sound. For instance, the metadata may include position data of the sound within the acoustic environment, with respect to the electronic device and other data that describes the sound (e.g., loudness data, etc.) In one aspect, the metadata may include a physical description of the sound object (e.g., size, shape, color, etc.).
[0035] FIG. 1 shows a block diagram illustrating an audio source device 1 for performing audio data bandwidth reduction operations according to one aspect of the disclosure. In one aspect, the audio source device 1 may be any electronic device that is capable of capturing, using at least one microphone, the sound of an ambient acoustic environment as audio data (or one or more audio signals), and (wirelessly) transmitting a sonic descriptor (e.g., a data structure) that includes metadata describing the audio data to another electronic device. Examples of such devices may include a headset, a head-mounted device (HMD), such as smart glasses, or a wearable device (e.g., a smart watch, headband, etc.). Other examples of such devices may include headphones, such as in-ear (e.g., wireless earphones or earbuds), on-ear, or over-the-ear headphones. Thus, “headphones” may include a pair of headphones (e.g., with two earcups) or at least one earphone (or earbud).
[0036] As described herein, the device 1 may be a wireless electronic device that is configured to establish a wireless communication data link via a network interface 6 with another electronic device, over a wireless computer network (e.g., a wireless personal area network (WPAN)) using e.g., BLUETOOTH protocol or a WLAN in order to exchange data. In one aspect, the network interface 6 is configured to establish a wireless communication link with a wireless access point in order to exchange data with a remote electronic server (e.g., over the Internet). In another aspect, the network interface 6 may be configured to establish a communication link via a mobile voice/data network that employs any type of wireless telecom protocol (e.g., a 4G Long Term Evolution (LTE) network).
[0037] In one aspect, the audio source device 1 may be a part of a computer system that includes a separate (e.g., companion) device, such as a smart phone or laptop, with which the source device 1 establishes a (e.g., wired and/or wireless) connection in order to pair both devices together. In one aspect, the (e.g., programmed processor of the) companion device may perform one or more of the operations described herein, such as bandwidth reduction operations. For instance, the companion device may obtain microphone signals from the source device 1, and perform the reduction operations, as described herein. In another aspect, at least some of the elements of the source device 1 may be a part of the companion device (or another electronic device) within the system. More about the elements of the source device 1 is described herein.
[0038] The audio source device 1 includes a microphone array 2 that has “n” number of microphones 3, one or more cameras 4, a controller 5, and the network interface 6. Each microphone 3 may be any type of microphone (e.g., a differential pressure gradient micro-electromechanical system (MEMS) microphone) that is configured to convert acoustic energy caused by sound waves propagating in the acoustic (e.g., physical) environment into an audio (or microphone) signal. The camera 4 is configured to capture image data (e.g., digital images) and/or video data (which may be represented as a series of digital images) that represents a scene of the physical environment in the field of view of the camera 4. In one aspect, the camera 4 is a Complementary Metal-Oxide-Semiconductor (CMOS) image sensor. In another aspect, the camera may be a Charged-Coupled Device (CCD) camera type. In some aspects, the camera may be any type of digital camera.
[0039] The controller 5 may be a special-purpose processor such as an Application-Specific Integrated Circuit (ASIC), a general purpose microprocessor, a Field-Programmable Gate Array (FPGA), a digital signal controller, or a set of hardware logic structures (e.g., filters, arithmetic logic units, and dedicated state machines). The controller 5 is configured to perform audio data bandwidth-reduction operations, as described herein. In one aspect, the controller 5 may perform other operations, such as audio/image processing operations, networking operations, and/or rendering operations. More about how the controller 5 may perform these operations is described herein.
[0040] In one aspect, the audio source device may include more or less components as described herein. For instance, the audio source device 1 may include more or less microphones 3 and/or cameras 4. As another example, the audio source device 1 may include other components, such as one or more speakers and/or one or more display screens. More about these other components is described herein.
[0041] The controller 5 includes a speech & ambient separator 7, a sound library 9, and a sound object & sound bed identifier 10. In one aspect, the controller may optionally include a phoneme identifier 12. More about this operational block is described herein. In one aspect, although illustrated as being separate, (a portion of) the network interface 6 may be a part of the controller 5.
[0042] The process in which the audio source device 1 may perform audio bandwidth-reduction operations, while transmitting audio data to an audio receiver device 20 for presentation will now be described. The audio device 1 captures, using one or more n microphones 3 of the microphone array 2 sounds from within the acoustic environment as one or more (microphone) audio signals. Specifically, the audio signals include speech 16 that is spoken by a person (e.g., a user of the device 1) and other ambient sounds, such as a dog barking 17 and wind noise 18 (which may include leaves rustling). The speech & ambient separator 7 is configured to obtain (or receive) the at least some of the audio (or microphone) signals produced by the n microphones and to process the audio signals to separate the speech 16 from the ambient sounds (e.g., 17 and 18). Specifically, the separator produces a speech signal (or audio signal) that contains mostly (or only) the speech 13 captured by the microphones of the array 2. The separator also produces one or more (or a set of) ambient signals that include mostly (or only) the ambient sound(s) from within the acoustic environment in which the source device 1 is located. In one aspect, the each of the “n” number of ambient signals corresponds to a particular microphone 3 in the array 2. In another aspect, the set of ambient signals may be more (or less) than a number of audio signals produced by each of the microphones 3 in the array 2. In some aspects, the separator 7 separates the speech by performing a speech (or voice) detection algorithm upon the microphone signals to detect the speech 16. The separator 7 may then produce a speech signal according to the detected speech. In one aspect, the separator 7 may perform noise suppression operations on one or more of the audio signals to produce the speech signal (which may be one audio signal from one microphone or a mix of multiple audio signals. The separator 7 may produce the ambient signals by suppressing the speech contained in at least some of the microphone signals. In one aspect, the separator 7 may perform noise suppression operations upon the microphone signals in order to improve Signal-to-Noise Ratio (SNR). For instance, the separator 7 may spectrally shape at least some of the signals (e.g., the speech signal) to reduce noise. In one aspect, the separator 7 may perform any method to separate the speech signal from the audio signals and/or to suppress the speech in the audio signals to produce the ambient signals. In one aspect, the ambient signals may include at least some speech (e.g., from a different talker, than the user of the device 1).
[0043] The sound object & sound bed identifier 10 is configured to identify a sound object contained within (e.g., the ambient signals containing) the acoustic environment and/or identify an ambient or diffuse background sound as (at least part of) a sound bed of the acoustic environment. As described herein, a sound object is a particular sound that is captured by the microphone array 2, such as the dog bark 17. In one aspect, a sound object is a sound that may occur aperiodically within the environment. In another aspect, a sound object is a particular or specific sound produced by a sound source (or object) within the environment. An example of a sound object may be the dog bark 17, which may be made by a particular breed of dog as the sound source. A sound that is a part of a sound bed, however, may be an ambient or diffuse background sound or noise that may occur continuously or may be reoccurring sound(s) that are associated with a particular environment. An example may be the sound of a refrigerator’s condenser that periodically turns on and off. In one aspect, ambient background noise that is diffuse within the environment, and thus does not have a particular sound source may be a part of the sound bed, such as the wind noise 18. In another aspect, general ambient sounds (e.g., sounds that may sound the same at multiple locations) may be a part of the sound bed. Specifically, sounds that contain audio content that is indistinguishable from other similar sounds may be associated with the sound bed. For example, as opposed to a dog bark, which may change between breeds of dogs, the sound of wind noise 18 may be the same (e.g., the spectral content of different wind noise may be similar to one another), regardless of location. In one aspect, sound objects may be associated or a part of the sound bed.
[0044] The sound object & sound bed identifier 10 identifies sound objects and sound beds as follows. The identifier is configured to obtain and process at least one of the set of ambient signals to 1) identify a sound source (e.g., a position of the sound source within the acoustic environment) in at least one of the ambient signals and 2) produce spatial sound-source data that spatially represents the sound of the sound source (e.g., having data that indicates the position of the sound source with respect to the device 1). For instance, the spatial sound-source data may be an angular/parametric representation of the sound source with respect to the audio source device 1. Specifically, the sound-source data indicates a three-dimensional (3D) position of the sound source with respect to the device (e.g., located on a virtual sphere surrounding the device) as position data (e.g., elevation, azimuth, distance, etc.). In one aspect, any method may be performed to produce the angular/parametric representation of the sound source, such as a Higher Order Ambisonics (HOA) representation of the sound source by encoding the sound source into HOA B-Format by panning and/or upmixing the at least one of ambient signals. In another aspect, the spatial sound-source data may include an audio data (or an audio signal) of the sound and metadata associated with the sound (e.g., position data). For example, the audio data may be digital audio data (e.g., pulse-code modulation (PCM) digital audio information, etc.) of sound that is projected from an identified sound source. Thus, in some aspects, the spatial sound-source data may include position data of the sound source (e.g., as metadata) and/or audio data associated with the sound source. As an example, spatial sound-source data of the dog bark 17 may include an audio signal that contains the bark 17 and position data of the source (e.g., the dog’s mouth) of the bark 17, such as azimuth and elevation with respect to the device 1 and/or distance between the source and the device 1. In one aspect and as described herein, the identified sound source may be associated with a sound object, which may be identified using the spatial sound-source data.
[0045] In one aspect, the identifier 10 may include a sound pickup microphone beamformer that is configured to process the ambient audio signals (or the microphone signals) to form at least one directional beam pattern in a particular direction, so as to be more sensitive to a sound source in the environment. In one aspect, the identifier 10 may use position data of the sound source to direct a beam pattern towards the source. In one aspect, the beamformer may use any method to produce a beam pattern, such as time delay of arrival and delay and sum beamforming to apply beamforming weights (or weight vectors) upon the audio signals to produce at least one sound pickup output beamformer signal that includes the directional beam pattern aimed towards the sound source. Thus, the spatial sound source data may include at least one sound pickup output beamformer signal that includes the produced beam pattern that includes at least one sound source. More about using beamformers is described herein.
[0046] The sound library 9 may be a table (e.g., in a data structure that is stored in local memory) having an entry for one or more (e.g., predefined) sound objects. Each entry may include metadata that describes the sound object of a corresponding entry. For instance, the metadata may include a unique index identifier (e.g., a text identifier) that is associated with a sound object, such as the dog bark 17. In addition, the metadata of an entry may include descriptive data that describes (or includes) physical characteristics of a sound object (or of the source of the sound object). For instance, returning to the previous example, when the sound source is a dog and the sound object is the bark 17, the descriptive data may include the type (or breed) of dog, the color of the dog, the shape/size of the dog, the position of the dog (with respect to the device 1), and any other physical characteristics of the dog. In some aspects, the metadata may include position data, such as global positioning system coordinates or position data that is relative to the audio source device 1, for example azimuth, elevation, distance, etc. In one aspect, the metadata may include sound characteristics of the sound object, such as (at least a portion of) audio data containing the sound object (e.g., PCM digital audio, etc.), samples of spectral content of the sound object, loudness data (e.g., a sound pressure level (SPL) measurement, a loudness, K-weighted, relative to full scale (LKFS) measurement, etc.), and other sound characteristics such as tone, timbre, etc. Thus, with respect to dog barks, the library 9 may include a dog bark entry for each type of dog. In some aspects, some entries may include more (or less) metadata than other entries in the library 9.
[0047] In one aspect, at least some of the entries may be predefined in a controlled setting (e.g., produced in a laboratory and stored in memory of the device 1). As described herein, at least some of the entries may be created by the audio source device 1 (or another device, such as the audio receiver device 20). For example, if it is determined that a sound object is not contained within the sound library 9, an entry for the sound object may be created by the identifier 10 and stored within the library 9. More about creating entries in the library 9 is described herein.
[0048] The sound object & sound bed identifier 10 is configured to use (or process) the spatial sound-source data to identify the source’s associated sound object. In one aspect, the identifier 10 may use a sound identification algorithm to identify the sound object. Continuing with the previous example, to identify the bark 17, the identifier 10 may analyze the audio data within the spatial sound-source data to identify one or more sound characteristics of the audio data (e.g., spectral content, etc.) that is associated with a bark, or more particularly with the specific bark 17 (e.g., from that specific breed of dog). In another aspect, the identifier 10 may perform a table lookup into the sound library 9 using the spatial sound-source data to identify the sound object as a matching sound object (or entry) contained therein. Specifically, the identifier 10 may perform the table lookup to compare the spatial sound-source data (e.g., the audio data and/or metadata) with at least some of the (e.g., metadata of the) entries contained within the library 9. For instance, the identifier 10 may compare the audio data and/or position data of the spatial sound-source data with stored audio data and/or stored position data of each sound object of the library 9. Thus, the identifier 10 identifies a matching predefined sound object within the library 9, when the audio data and/or position data of the sound-source data matches at least some of the sound characteristics of a sound object (or entry) within the library 9. In one aspect, to identify a sound object, the identifier 10 can match the spatial sound-source data to at least some of the stored metadata up to a tolerance (e.g., 5%, 10%, 15%, etc.). In other words, a matching predefined sound object in the library 9 does not necessarily need to be an exact match.
[0049] In one aspect, in addition to (or in lieu) of using sound characteristics (or metadata) of the spatial sound-source data to identify the sound object, the identifier 10 may use image data captured by the camera 4 to (help) identify the sound object within the environment. The identifier 10 may perform an object recognition algorithm upon the image data to identify an object within the field of view of the camera. For instance, the algorithm may determine (or identify) descriptive data that describes physical characteristics of an object, such as shape, size, color, movement, etc. The identifier 10 may perform the table lookup into the sound library 9 using the determined descriptive data to identify the sound object with (at least partially) matching descriptive data. For instance, the identifier 10 may compare physical characteristics of an object (such as hair color of a dog) with the hair color of at least some of the entries in the sound library that relate to dogs. In another aspect, the identifier 10 may perform a separate table lookup into a data structure that associates descriptive data with predefined objects. Once matching physical characteristics are found (which may be within a tolerance threshold), the identifier 10 identifies an object within the field of view of the camera as at least one of the predefined objects.
[0050] In one aspect, the identifier 10 is configured to use (or process) the spatial sound-source data to identify the sound (or sound object) associated with the source data as (a part of) a sound bed of the acoustic environment. In one aspect, a sound object that is determined to be an ambient or diffuse background noise sound is determined by the identifier 10 to be a part of the sound bed of the environment. In one aspect, the identifier 10 may perform similar operations as those performed to identify the source’s associated sound object. In one aspect, upon identifying a matching entry in the sound library, the metadata of the entry may indicate that the sound is a part of the sound bed. In another aspect, the identifier may determine that a sound (object) associated with the spatial sound-source data is a part of the sound bed based on a determination that the sound occurs at least two times within a threshold period of time (e.g., ten seconds), indicating that the sound is an ambient background sound. In another aspect, the identifier 10 may determine a sound to be a part of the sound bed if the sound is continuous (e.g., constant, such as being above a sound level, for a period of time, such as ten seconds). In another aspect, the identifier 10 may determine that a sound of the spatial sound-source data is a part of the sound bed based on the diffusiveness of the sound. As another example, the identifier 10 may determine whether a sound is similar to multiple (e.g., more than one) entries within the library 9, indicating that the sound is more generic and therefore may be a part of the sound bed.
[0051] In some aspects, the identifier 10 may employ other methods to identify a sound object. For instance, the source device 1 may leverage audio data (or audio signals) produced by the microphone array 2 and image data produced by the camera 4 to identify sound objects within the environment in which the device 1 is located. Specifically, the device 1 may identify a sound object (or object) within the environment through the use of object recognition algorithms and use the identification of the sound object to better steer (or produce) directional sound patterns towards the object, thereby reducing noise that may otherwise be captured using conventional pre-trained beamformers. FIG. 2 shows a block diagram of operations performed by a sound object & sound bed identifier 10 to identify and produce a sound object (and/or of a sound bed), according to one aspect of the disclosure. Specifically, this figure illustrates operations that may be performed by the identifier 10 of the (controller 5 of the) audio source device 1. As shown, the diagram includes a parameter estimator 70, a source separator 71, and a directivity estimator 72.
……
……
……