空 挡 广 告 位 | 空 挡 广 告 位

Facebook Patent | Optimizations For Dynamic Object Instance Detection, Segmentation, And Structure Mapping

Patent: Optimizations For Dynamic Object Instance Detection, Segmentation, And Structure Mapping

Publication Number: 10565729

Publication Date: 20200218

Applicants: Facebook

Abstract

In one embodiment, a method includes a system accessing an image and generating a feature map using a first neural network. The system identifies a plurality of regions of interest in the feature map. A plurality of regional feature maps may be generated for the plurality of regions of interest, respectively. Using a second neural network, the system may detect at least one regional feature map in the plurality of regional feature maps that corresponds to a person depicted in the image, and generate a target region definition associated with a location of the person using the regional feature map. Based on the target region definition associated with the location of the person, a target regional feature map may be generated by sampling the feature map for the image. The system may process the target regional feature map to generate a keypoint mask and an instance segmentation mask.

TECHNICAL FIELD

This disclosure generally relates to computer vision.

BACKGROUND

Machine learning may be used to enable machines to automatically detect and process objects appearing in images. In general, machine learning typically involves processing a training data set in accordance with a machine-learning model and updating the model based on a training algorithm so that it progressively “learns” the features in the data set that are predictive of the desired outputs. One example of a machine-learning model is a neural network, which is a network of interconnected nodes. Groups of nodes may be arranged in layers. The first layer of the network that takes in input data may be referred to as the input layer, and the last layer that outputs data from the network may be referred to as the output layer. There may be any number of internal hidden layers that map the nodes in the input layer to the nodes in the output layer. In a feed-forward neural network, the outputs of the nodes in each layer–with the exception of the output layer–are configured to feed forward into the nodes in the subsequent layer.

Machine-learning models may be trained to recognize object features that have been captured in images. Such models, however, are typically large and require many operations. While large and complex models may perform adequately on high-end computers with fast processors (e.g., multiple central processing units (“CPUs”) and/or graphics processing units (“GPUs”)) and large memories (e.g., random access memory (“RAM”) and/or cache), such models may not be operable on computing devices that have much less capable hardware resources. The problem is exacerbated further by applications that require near real-time results from the model (e.g., 10, 20, or 30 frames per second), such as augmented reality applications that dynamically adjust computer-generated components based on features detected in live video.

SUMMARY OF PARTICULAR EMBODIMENTS

Embodiments described herein relate to machine-learning models and various optimization techniques that enable computing devices with limited system resources (e.g., mobile devices such as smartphones, tablets, and laptops) to recognize objects and features of objects captured in images or videos. To enable computing devices with limited hardware resources (e.g., in terms of processing power and memory size) to perform such tasks and to do so within acceptable time constraints, embodiments described herein provide a compact machine-learning model with an architecture that is optimized for efficiency performing various image-feature recognition tasks. For example, particular embodiments are directed to real-time or near real-time detection, segmentation, and structure mapping of people captured in images or videos (e.g., satisfying a video’s frame rate requirements). These real-time computer vision technologies may be used to enable a variety of mobile applications, such as dynamically replacing a video capture of a person with an avatar, detecting gestures, and performing other dynamic image processing related to particular objects (e.g., persons) appearing in the scene.

The embodiments disclosed herein are only examples, and the scope of this disclosure is not limited to them. Particular embodiments may include all, some, or none of the components, elements, features, functions, operations, or steps of the embodiments disclosed above. Embodiments according to the invention are in particular disclosed in the attached claims directed to a method, a storage medium, a system and a computer program product, wherein any feature mentioned in one claim category, e.g. method, can be claimed in another claim category, e.g. system, as well. The dependencies or references back in the attached claims are chosen for formal reasons only. However, any subject matter resulting from a deliberate reference back to any previous claims (in particular multiple dependencies) can be claimed as well, so that any combination of claims and the features thereof are disclosed and can be claimed regardless of the dependencies chosen in the attached claims. The subj ect-matter which can be claimed comprises not only the combinations of features as set out in the attached claims but also any other combination of features in the claims, wherein each feature mentioned in the claims can be combined with any other feature or combination of other features in the claims. Furthermore, any of the embodiments and features described or depicted herein can be claimed in a separate claim and/or in any combination with any embodiment or feature described or depicted herein or with any of the features of the attached claims.

In an embodiment according to the invention, a method may comprise, by a computing system: accessing an image; generating a feature map for the image using a first neural network; identifying a plurality of regions of interest in the feature map; generating a plurality of regional feature maps for the plurality of regions of interest, respectively, by sampling the feature map for the image; processing the plurality of regional feature maps using a second neural network to: detect at least one regional feature map in the plurality of regional feature maps that corresponds to a person depicted in the image; and generate a target region definition associated with a location of the person using the regional feature map; generating, based on the target region definition associated with the location of the person, a target regional feature map by sampling the feature map for the image; and generating: a keypoint mask associated with the person by processing the target regional feature map using a third neural network; or an instance segmentation mask associated with the person by processing the target regional feature map using a fourth neural network.

The instance segmentation mask and keypoint mask may be generated concurrently.

The first neural network may comprise four or fewer convolutional layers.

Each of the convolutional layers may use a kernel size of 3.times.3 or less.

The first neural network may comprise a total of one pooling layer.

The first neural network may comprise three or fewer inception modules.

Each of the inception modules may perform convolutional operations with kernel sizes of 5.times.5 or less.

Each of the second neural network, third neural network, and fourth neural network may be configured to process an input regional feature map using a total of one inception module.

In an embodiment according to the invention, a system may comprise: one or more processors and one or more computer-readable non-transitory storage media coupled to one or more of the processors, the one or more computer-readable non-transitory storage media comprising instructions operable when executed by one or more of the processors to cause the system to perform operations comprising: accessing an image; generating a feature map for the image using a first neural network; identifying a plurality of regions of interest in the feature map; generating a plurality of regional feature maps for the plurality of regions of interest, respectively, by sampling the feature map for the image; processing the plurality of regional feature maps using a second neural network to: detect at least one regional feature map in the plurality of regional feature maps that corresponds to a person depicted in the image; and generate a target region definition associated with a location of the person using the regional feature map; generating, based on the target region definition associated with the location of the person, a target regional feature map by sampling the feature map for the image; and generating: a keypoint mask associated with the person by processing the target regional feature map using a third neural network; or an instance segmentation mask associated with the person by processing the target regional feature map using a fourth neural network.

The instance segmentation mask and keypoint mask may be generated concurrently.

The first neural network may comprise four or fewer convolutional layers.

Each of the convolutional layers may use a kernel size of 3.times.3 or less.

The first neural network may comprise a total of one pooling layer.

The first neural network may comprise three or fewer inception modules.

In an embodiment according to the invention, one or more computer-readable non-transitory storage media may embody software that is operable when executed to cause one or more processors to perform operations comprising: accessing an image; generating a feature map for the image using a first neural network; identifying a plurality of regions of interest in the feature map; generating a plurality of regional feature maps for the plurality of regions of interest, respectively, by sampling the feature map for the image; processing the plurality of regional feature maps using a second neural network to: detect at least one regional feature map in the plurality of regional feature maps that corresponds to a person depicted in the image; and generate a target region definition associated with a location of the person using the regional feature map; generating, based on the target region definition associated with the location of the person, a target regional feature map by sampling the feature map for the image; and generating: a keypoint mask associated with the person by processing the target regional feature map using a third neural network; or an instance segmentation mask associated with the person by processing the target regional feature map using a fourth neural network.

The instance segmentation mask and keypoint mask may be generated concurrently.

The first neural network may comprise four or fewer convolutional layers.

Each of the convolutional layers may use a kernel size of 3.times.3 or less.

The first neural network may comprise a total of one pooling layer.

The first neural network may comprise three or fewer inception modules.

In an embodiment according to the invention, one or more computer-readable non-transitory storage media may embody software that is operable when executed to perform a method according to the invention or any of the above mentioned embodiments.

In an embodiment according to the invention, a system may comprise: one or more processors; and at least one memory coupled to the processors and comprising instructions executable by the processors, the processors operable when executing the instructions to perform a method according to the invention or any of the above mentioned embodiments.

In an embodiment according to the invention, a computer program product, preferably comprising a computer-readable non-transitory storage media, may be operable when executed on a data processing system to perform a method according to the invention or any of the above mentioned embodiments.

BRIEF DESCRIPTION OF THE DRAWINGS

The patent or application file contains at least one drawing executed in color. Copies of this patent or patent application publication with color drawing(s) will be provided by the Office upon request and payment of the necessary file.

FIGS. 1A-1B illustrate examples of images with bounding boxes, segmentation masks, and keypoints.

FIG. 2 illustrates an example architecture of a machine-learning model for predicting bounding boxes, segmentation masks, and keypoints.

FIG. 3 illustrates an example method for detecting objects of interests in an image and generating instance segmentation masks and keypoint masks.

FIG. 4** illustrates an example process for training the machine-learning model in accordance with particular embodiments**

FIG. 5A illustrates an example iterative process for performing convolutions on feature maps.

FIG. 5B illustrates an example process for performing convolutions on tiled feature maps.

FIG. 6 illustrates an example method for optimizing convolutional operations on feature maps of regions of interests.

FIGS. 7A-7C illustrate examples of how components of a low-dimensional representation of a pose may affect characteristics of the pose.

FIG. 8 illustrates an example method for generating a pose prediction.

FIGS. 9A and 9B illustrate an example of how keypoints generated by a machine-learning model may be adjusted using a pose model.

FIG. 10 illustrates an example network environment associated with a social-networking system.

FIG. 11 illustrates an example social graph.

FIG. 12 illustrates an example computer system.

DESCRIPTION OF EXAMPLE EMBODIMENTS

Embodiments described herein relate to machine-learning models and various optimization techniques that enable computing devices with limited system resources (e.g., mobile devices such as smartphones, tablets, and laptops) to recognize objects and features of objects captured in images or videos. To enable computing devices with limited hardware resources (e.g., in terms of processing power and memory size) to perform such tasks and to do so within acceptable time constraints, embodiments described herein provide a compact machine-learning model with an architecture that is optimized for performing various image-process tasks efficiently. For example, particular embodiments are directed to real-time detection (including classification), segmentation, and structure (e.g., pose) mapping of people captured in images or videos.

FIG. 1A illustrates an example of an image 100 with bounding boxes 110 and segmentation masks 120. In particular embodiments, a machine-learning model is trained to process an image, such as image 100, and detect particular objects of interest in the image. In the example shown, the machine-learning model is trained to recognize features of people. In particular embodiments, the machine-learning model may output a bounding box 110 that surrounds a detected instance of an object type, such as a person. A rectangular bounding box may be represented as four two-dimensional coordinates that indicate the four corners of the box. In particular embodiments, the machine-learning model may additionally or alternatively output a segmentation mask 120 that identifies the particular pixels that belong to the detected instance. For example, the segmentation mask may be represented as a two-dimensional matrix, with each matrix element corresponding to a pixel of the image and the element’s value corresponding to whether the associated pixel belongs to the detected person. Although particular data representations for detected persons and segmentation information are described, this disclosure contemplates any suitable data representations of such information.

FIG. 1B illustrates an example of an image 150 with segmentation masks 160 and structural keypoint 170, which may be used to represent the pose of a detected person. The segmentation mask 160, similar to the mask 120 shown in FIG. 1A, identify the pixels that belong to a detected person. In particular embodiments, a machine-learning model may additionally or alternatively map keypoints 170 to the detected person’s structure. The keypoints may map to the detected person’s shoulders, elbows, wrists, hands, hips, knees, ankles, feet, neck, jaw bones, or any other joints or structures of interest. In particular embodiments, the machine-learning model may be trained to map 19 keypoints to a person (e.g., neck, upper spinal joint, lower spinal joint, and left and right jaw bones, cheekbones, shoulders, elbows, wrists, hips, knees, and ankles). In particular embodiments, each keypoint may be represented as a two-dimensional coordinate, and the set of keypoints may be represented as an array or vector of coordinates. For example, 19 keypoints may be represented as a vector with 38 entries, such as [x.sub.1, y.sub.1, … x.sub.19, y.sub.19], where each pair of (x.sub.i, y.sub.i) represents the coordinate one keypoint i. The order of each coordinate in the vector may implicitly indicate the keypoint to which the coordinate corresponds. For example, it may be predetermined that (x.sub.1, y.sub.1) corresponds to the left-shoulder keypoint, (x.sub.2, y.sub.2) corresponds to the right-shoulder keypoint, and so on. Although particular data representations for keypoints are described, this disclosure contemplates any suitable data representations of such information.

FIG. 2 illustrates an example architecture of a machine-learning model 200 according to particular embodiments. The machine-learning model 200 is configured to take as input an image 201 or a preprocessed representation of the image, such as a three-dimensional matrix with dimensions corresponding to the image’s height, width, and color channels (e.g., red, green, and blue). The machine-learning model 200 is configured to extract features of the image 201 and output an object detection indicator 279 (e.g., coordinates of a bounding box surrounding a person), keypoints 289 (e.g., representing the pose of a detected person), and/or segmentation mask 299 (e.g., identifying pixels that correspond to the detected person). The machine-learning model’s 200 architecture is designed to be compact (thereby reducing storage and memory needs) and with reduced complexities (thereby reducing processing needs) so that it may produce sufficiently accurate and fast results on devices with limited resources to meet the demands of real-time applications (e.g., 10, 15, or 30 frames per second). Compared to conventional architectures, such as those based on ResNet or Feature Pyramid Networks (FPN), the architecture of the machine-learning model 200 is much smaller in size and could generate predictions much faster (e.g., roughly 100.times. faster).

In particular embodiments, the machine-learning model 200 includes several high-level components, including a backbone neural network, also referred to as a trunk 210, a region proposal network (RPN) 220, detection head 230, keypoint head 240, and segmentation head 250. Each of these components may be configured as a neural network. Conceptually, in the architecture shown, the trunk 210 is configured to process an input image 201 and prepare a feature map (e.g., an inception of convolutional outputs) that represents the image 201. The RPN 220 takes the feature map generated by the trunk 210 and outputs N number of proposed regions of interest (RoIs) that may include objects of interest, such as people, cars, or any other types of objects. The detection head 230 may then detect which of the N RoIs are likely to contain the object(s) of interest and output corresponding object detection indicators 279, which may define a smaller region, such as a bounding box, of the image 201 that contains the object of interest. In particular embodiments, a bounding box may be the smallest or near smallest rectangle (or any other geometric shape(s)) that is able to fully contain the pixels of the object of interest. For the RoIs deemed to be sufficiently likely to contain the object of interest, which may be referred to as target region definitions, the keypoint head 240 may determine their respective keypoint mappings 289 and the segmentation head 250 may determine their respective segmentation masks 299. In particular embodiments, the detection head 230, keypoint head 240, and segmentation head 250 may perform their respective operations in parallel. In other embodiments, the detection head 230, keypoint head 240, and segmentation head 250 may not perform their operations in parallel but instead adopt a multi-staged processing approach, which has the advantage of reducing computation and speeding up the overall operation. For example, the keypoint head 240 and segmentation head 250 may wait for the detection head 230 to identify the target region definitions corresponding to RoIs that are likely to contain the object of interest and only process those regions. Since the N number of RoIs initially proposed by the RPN 220 is typically much larger than the number of RoIs deemed sufficiently likely to contain the object of interest (e.g., on the order of 1000-to-1, 100-to-1, etc., depending on the image given), having such an architectural configuration could drastically reduce computations performed by the keypoint head 240 and segmentation head 250, thereby enabling the operation to be performed on devices that lack sufficient hardware resources (e.g., mobile devices).

FIG. 3 illustrates an example method for detecting objects of interests (e.g., persons) in an image and generating instance segmentation masks and keypoint masks, in accordance with particular embodiments. The method may begin at step 310, where a system performing operations based on a machine-learning model may access an image or a frame of a video (e.g., as captured by a camera of the system, which may be a mobile device).

At step 320, the system may generate a feature map for the image using a trunk 210. In particular embodiments, the trunk 210 may be considered as the backbone neural network that learns to represent images holistically and is used by various downstream network branches that may be independently optimized for different applications/tasks (e.g., the RPN 220, detection head 230, keypoint head 240, and segmentation head 250). Conceptually, the trunk 210 is shared with each of the downstream components (e.g., RPN 220, detection head 230, etc.), which significantly reduces computational cost and resources needed for running the overall model.

The trunk 210 contains multiple convolutional layers and generates deep feature representations of the input image. In particular embodiments, the trunk 210 may have a compact architecture that is much smaller compared to ResNet and/or other similar architectures. In particular embodiments, the trunk 210 may include four (or fewer) convolution layers 211, 212, 213, 214, three (or fewer) inception modules 215, 217, 218, and one pooling layer (e.g., max or average pooling) 216. In particular embodiments, each of the convolutional layers 211, 212, 213, 214 may use a kernel size of 3.times.3 or less. In particular, each input image to the trunk 210 may undergo, in order, a first convolution layer 211 (e.g., with 3.times.3 kernel or patch size, stride size of 2, and padding size of 1), a second convolution layer 212 (e.g., with 3.times.3 kernel or patch size, stride size of 2, and padding size of 2), a third convolution layer 213 (e.g., with 3.times.3 kernel or patch size and dimensionality reduction), another convolution layer 214 (e.g., with 3.times.3 kernel or patch size), a first inception module 215, a max or average pooling layer 216 (e.g., with 3.times.3 patch size and stride 2), a second inception module 217, and a third inception module 218.

In particular embodiments, each of the inception modules 215, 217, 218 may take the result from its previous layer, perform separate convolution operations on it, and concatenate the resulting convolutions. For example, in one inception module, which may include dimension reduction operations, the result from the previous layer may undergo: (1) a 1.times.1 convolution, (2) a 1.times.1 convolution followed by a 3.times.3 convolution, (3) a 1.times.1 convolution followed by a 5.times.5 convolution, and/or (4) a 3.times.3 max pooling operation followed a 1.times.1 dimensionality reduction filter. The results of each may then undergo filter concatenation to generate the output of the inception module. In the embodiment described above, the convolutions performed in the inception module use kernel sizes of 5.times.5 or less; no 7.times.7 or larger convolution is used in the inception module, which helps reduce the size of the neural net. By limiting the convolution in the inception modules to 5.times.5 or less, the resulting convolutions and feature maps would be smaller, which in turn means less computation for the subsequent networks (including the networks associated with the downstream components, such as the RPN 220, detection head 230, etc. Although no 7.times.7 convolution is used in this particular embodiment, 7.times.7 convolutions may be used in other embodiments.

Referring again to FIG. 3, at step 330, the system in accordance with particular embodiments may identify a plurality of RoIs in the feature map. In particular embodiments, the output of the trunk 210 may be provided to the RPN 220, which may be trained to output proposed candidate object bounding boxes or other types of indication of potential RoIs. In particular embodiments, the candidates may have predefined scales and aspect ratios (e.g., anchor points). The N number of proposed regions of interest (RoIs) output by the RPN 220 may be large (e.g., in the thousands or hundreds), as the RoIs may not necessarily be limited to those that relate to the type(s) of object of interest. For example, the RoIs may include regions that correspond to trees, dogs, cars, houses, and people, even though the ultimate object of interest is people. In particular embodiments, the N RoIs from the RPN 220 may be processed by the detection head 230 to detect RoIs that correspond to the object of interest, such as people.

Referring again to FIG. 3, at step 340, the system according to particular embodiments may generate, based on the feature map, a plurality of regional feature maps for the RoIs, respectively. For example, particular embodiments may extract features from the output of the trunk 210 for each RoI, as represented by block 225 in FIG. 2, to generate corresponding regional feature maps (i.e., a regional feature map is a feature map that correspond to a particular RoI). Conventionally, a technique called RoIPool may be used. RoIPool may first quantizes a floating-number RoI to the discrete granularity of the feature map. This quantized RoI may be then subdivided into spatial bins which are themselves quantized. The feature values covered by each bin may then be aggregated (usually by max pooling). Quantization may be performed, e.g., on a continuous coordinate x by computing [x/16], where 16 is a feature map stride and [.] is rounding; likewise, quantization is performed when dividing into bins (e.g., 7.times.7). In effect, quantizing the sampling region in this manner is conceptually similar to “snapping” the region to a uniform grid segmenting the feature map based on the stride size. For example, if an edge of a RoI is between gridlines, the corresponding edge of the actual region that is sampled may be “snapped” to the closest gridline (by rounding). These quantizations introduce misalignments between the RoI and the extracted features. While this may not impact classification, which is robust to small translations, it has a large negative effect on predicting pixel-accurate masks.

To address this, particular embodiments, referred to as RoIAlign, removes the harsh quantization of RoIPool by properly aligning the extracted features with the input. This may be accomplished by avoiding any quantization of the RoI boundaries or bins (i.e., use x/16 instead of [x/16]). Particular embodiments may use bilinear interpolation to compute the exact values of the input features at four regularly sampled locations in each RoI bin, and aggregate the result (using max or average). Through RoIAlign, the system may generate a regional feature map of a predefined dimension for each of the RoIs. Particular embodiments may sample four regular locations, in order to evaluate either max or average pooling. In fact, interpolating only a single value at each bin center (without pooling) is nearly as effective. One could also sample more than four locations per bin, which was found to give diminishing returns.

With RoIAlign, the bilinear interpolation used in the feature pooling 225 process is more accurate but requires more computation. In particular embodiments, the bilinear interpolation process may be optimized by precomputing the bilinear-interpolation weights at each position in the grid across batches.

Referring again to FIG. 3, at step 350, the system may process the plurality of regional feature maps (e.g., generated using RoIAlign) using the detection head to detect ones that correspond to objects of interest depicted in the input image and generate corresponding target region definitions (e.g., a bounding boxes) associated with locations of the detected objects. For example, after pooling features from the output of the trunk 210 for each RoI, the feature pooling process 225 (e.g., RoIAlign) may pass the results (i.e., regional feature maps of the RoIs) to the detection head 230 so that it may detect which RoIs correspond to the object of interest, such as people. The detection head 230 may be a neural network with a set of convolution, pooling, and fully-connected layers. In particular embodiments, the detection head 230 may take as input the pooled features of each RoI, or its regional feature map, and perform a single inception operation for each regional feature map. For example, each regional feature map may undergo a single inception module transformation, similar to those described above (e.g., concatenating 1.times.1 convolution, 3.times.3 convolution, and 5.times.5 convolution results), to produce a single inception block. In particular embodiments, the inception module may perform convolutional operations using kernel sizes of 5.times.5 or fewer, which is different from conventional modules where 7.times.7 convolutional operations are performed. Compared to other ResNet-based models that use multiple inception blocks, configuring the detection head 230 to use a single inception block significantly reduces the machine-learning model’s size and runtime.

In particular embodiments, the detection head 230 may be configured to process the inception block associated with a given RoI and output a bounding box and a probability that represents a likelihood of the RoI corresponding to the object of interest (e.g., corresponding to a person). In particular embodiments, the inception block may first be processed by average pooling, and the output of which may be used to generate (1) a bounding-box prediction (e.g., using a fully connected layer) that represents a region definition for the detected object (this bounding box coordinates may more precisely define the region in which the object appears), (2) a classification (e.g., using a fully connected layer), and/or (3) a probability or confidence score (e.g., using Softmax function). Based on the classification and/or probability, the detection head 230 may determine which of the RoIs likely correspond to the object of interest. In particular embodiments, all N RoI candidates may be sorted based on the detection classification/probability. The top MRoI, or their respective region definitions (e.g., which may be refined bounding boxes with updated coordinates that better surround the objects of interest), may be selected based on their respective score/probability of containing the objects of interest (e.g., people). The selected M region definitions may be referred to as target region definitions. In other embodiments, the RoI selection process may use non-maximal suppression (NMS) to help the selection process terminate early. Using NMS, candidate RoIs may be selected while they are being sorted, and once the desired M number of RoIs (or their corresponding region definitions) have been selected, the selection process terminates. This process, therefore, may further reduce runtime.

In particular embodiments, once the detection head 230 selects M target region definitions that are likely to correspond to instances of the object of interest (e.g., people), it may pass the corresponding target region definitions (e.g., the refined bounding boxes) to the keypoint head 240 and segmentation head 250 for them to generate keypoint maps 289 and segmentation masks 299, respectively. As previously mentioned, since the M number of region definitions that correspond to people is typically a lot fewer than the N number of initially-proposed RoIs (i.e., M N), filtering in this manner prior to having them processed by the keypoint head 240 and segmentation head 250 significantly reduces computation.

In particular embodiments, before processing the Mtarget region definitions using the keypoint head 240 and segmentation head 250, corresponding regional feature maps may be generated (e.g., using RoIAlign) since the Mtarget region definitions may have refined bounding box definitions that differ from the corresponding RoIs. Referring to FIG. 3, at step 360, the system may generate, based on the target region definitions, corresponding target regional feature maps by sampling the feature map for the image. For example, at the feature pooling process 235 shown in FIG. 2, the system may pool features from the feature map output by the trunk 210 for each of the M target region definitions selected by the detection head 230. The feature pooling block 235 may perform operations similar to those of block 225 (e.g., using RoIAlign), generating regional feature maps for the M target region definitions, respectively. In particular embodiments, the bilinear interpolation process may also be optimized by precomputing the bilinear-interpolation weights at each position in the grid across batches.

Referring to FIG. 3, at step 370, the system may then generate a keypoint mask associated with each detected person (or other object of interest) by processing the target regional feature map using a third neural network. For example, in FIG. 2, the feature pooling process 235 may pass the pooled features (the target regional feature maps) to the keypoint head 240 so that it may, for each of the M target region definitions, detect keypoints 289 that map to the structure of the detected instance of the object of interest (e.g., 19 points that map to a person’s joints, head, etc., which may represent the person’s pose). In particular embodiments, the keypoint head 240 may process each input target region definition using a single inception module transformation similar to those described above (e.g., concatenating 1.times.1 convolution, 3.times.3 convolution, and 5.times.5 convolution results) to produce a single inception block. Compared to other ResNet-based models that use multiple inception blocks, configuring the keypoint head 240 to use a single inception block significantly reduces the machine-learning model’s size and runtime. The inception block may then be further processed through the neural network of the keypoint head 240 to generate the keypoint masks.

Particular embodiments may model a keypoint’s location as a one-hot mask, and the keypoint head 240 may be tasked with predicting K masks, one for each of K keypoint types (e.g., left shoulder, right elbow, etc.). For each of the K keypoints of an instance, the training target may be a one-hot mxm binary mask in which a single pixel is labeled as a foreground and the rest being labeled as backgrounds (in which case the foreground would correspond to the pixel location of the body part, such as neck joint, corresponding to the keypoint). During training, for each visible ground-truth keypoint, particular embodiments minimize the cross-entropy loss over an m.sup.2-way softmax output (which encourages a single point to be detected). In particular embodiments, the K keypoints may still be treated independently. In particular embodiments, the inception block may be input into a deconvolution layer and 2.times. bilinear upscaling, producing an output resolution of 56.times.56. In particular embodiments, a relatively high-resolution output (compared to masks) may be required for keypoint-level localization accuracy. In particular embodiments, the keypoint head 240 may output the coordinates of predicted body parts (e.g., shoulders, knees, ankles, head, etc.) along with a confidence score of the prediction. In particular embodiments, the keypoint head 240 may output respective keypoint masks and/or heat maps for the predetermined body parts (e.g., one keypoint mask and/or heat map for the left knee joint, another keypoint mask and/or heat map for the right knee, and so forth). Each heat map may include a matrix of values corresponding to pixels, with each value in the heat map representing a probability or confidence score that the associated pixel is where the associated body part is located.

Referring to FIG. 3, at step 380, the system may additionally or alternatively generate an instance segmentation mask associated with each detected person (or other object of interest) by processing the target regional feature map using a fourth neural network. For example, the feature pooling process 235 shown in FIG. 2 may additionally or alternatively pass the pooled features (i.e., the target regional feature maps) to the segmentation head 250 so that it may, for each of the M RoIs, generate a segmentation mask 299 that identifies which pixels correspond to the detected instance of the object of interest (e.g., a person). In particular embodiments, depending on the needs of an application using the model 200, only the keypoint head 240 or the segmentation head 250 may be invoked. In particular embodiments, the keypoint head 240 and the segmentation head 250 may perform operations concurrently to generate their respective masks. In particular embodiments, the segmentation head may be configured to process each input regional feature map using a single inception module. For example, the pooled features (or regional feature map for RoIAlign) of each of the M region definitions may undergo a single inception module transformation similar to those described above (e.g., concatenating 1.times.1 convolution, 3.times.3 convolution, and 5.times.5 convolution results) to produce a single inception block. Compared to other ResNet-based models that use multiple inception blocks, configuring the keypoint head 240 to use a single inception block significantly reduces the machine-learning model’s size and runtime. The inception block may then be further processed through the neural network of the segmentation head 250 to generate the segmentation mask.

In particular embodiments, a segmentation mask encodes a detected object’s spatial layout. Thus, unlike class labels or box offsets that are inevitably collapsed into short output vectors by fully connected (fc) layers, extracting the spatial structure of masks can be addressed naturally by the pixel-to-pixel correspondence provided by convolutions. Particular embodiments may predict an m.times.m mask from each RoI using a fully convolutional neural network (FCN). This may allow each layer in the segmentation head 250 to maintain the explicit m.times.m object spatial layout without collapsing it into a vector representation that lacks spatial dimensions. Unlike previous methods that resort to fc layers for mask prediction, particular embodiments may require fewer parameters and may be more accurate. This pixel-to-pixel behavior may require RoI features, which themselves are small feature maps, to be well aligned to faithfully preserve the explicit per-pixel spatial correspondence. The aforementioned feature pooling process termed RoIAlign (e.g., used in the feature pooling layers 225 and 235) may address this need.

Particular embodiments may repeat one or more steps of the process of FIG. 3, where appropriate. Although this disclosure describes and illustrates particular steps of the process of FIG. 3 as occurring in a particular order, this disclosure contemplates any suitable steps of the process of FIG. 3 occurring in any suitable order. Moreover, although this disclosure describes and illustrates an example method for processing an image for objects of interests, including the particular steps of the process shown in FIG. 3, this disclosure contemplates any suitable process for doing so, including any suitable steps, which may include all, some, or none of the steps of the process shown in FIG. 3, where appropriate. Furthermore, although this disclosure describes and illustrates particular components, devices, or systems carrying out particular steps of the process of FIG. 3, this disclosure contemplates any suitable combination of any suitable components, devices, or systems carrying out any suitable stages of the process of FIG. 3.

FIG. 4 illustrates an example process for training the machine-learning model in accordance with particular embodiments. In particular embodiments, a multi-stage training process may be used to train the machine-learning model, with each stage focusing on training different components of the model. The training process may begin at stage 410, where the trunk model (referenced as Trunk.sub.1 in FIG. 4) is pre-trained to perform a classification task. For example, the trunk may be trained to classify images into any number of categories (e.g., 100, 200 categories). The training dataset may include image samples with labeled/known categories. In particular embodiments, the training process, including the training dataset, may be similar to those used for training ResNet or other similar networks for generating feature map representations of images. This pre-training process helps the trunk model obtain initialization parameters.

At stage 420, a temporary trunk (referenced as Trunk.sub.temp in FIG. 4) and a temporary RPN (referenced as RPN.sub.temp in FIG. 4) may be trained together to generate a temporary functional model for generating RoI candidates, in accordance with particular embodiments. Once trained, Trunk.sub.temp and RPN.sub.temp in particular embodiments are used to assist with the subsequent training process and are not themselves included in the machine-learning model 200. In particular embodiments, the temporary Trunk.sub.temp may be initialized to have the same parameters as those of Trunk.sub.1 from stage 410. Rather than initializing Trunk.sub.1 in stage 410 and using the result to initialize Turnk.sub.temp, one skilled in the art would recognize that the order may be switched (i.e., Trunk.sub.temp may be initialized in stage 410 and the initialized Trunk.sub.temp may be used to initialize Trunk.sub.1). The training dataset at stage 420 may include image samples. Each image sample may have a corresponding ground truth or label, which may include bounding boxes (e.g., represented by anchors) or any other suitable indicators for RoIs that contain foreground/background objects in the image sample. In particular embodiments, the RPN may be trained in the same manner as in Faster R-CNN. For example, the RPN may be trained to generate k anchors (e.g., associated with boxes of predetermined aspect ratios and sizes) for each sampling region and predict a likelihood of each anchor being background or foreground. Once trained, Trunk.sub.temp and RPN.sub.temp would be configured to process a given image and generate candidate RoIs.

In particular embodiments, at stage 430, Trunk.sub.1 and the various downstream heads (e.g., the detection head, keypoint head, and segmentation head), referred to as Heads.sub.1 in FIG. 4. The training dataset for this stage may include image samples, each having ground truths or labels that indicate (1) known bounding boxes (or other indicator types) for object instances of interest (e.g., people) in the image for training the detection head, (2) known keypoints (e.g., represented as one-hot masks) for object instances of interest in the image for training the keypoint head, and (3) known segmentation masks for object instances of interest in the image for training the segmentation head.

In particular embodiments, each training image sample, during training, may be processed using the temporary Trunk.sub.temp and RPN.sub.temp trained in stage 420 to obtain the aforementioned N candidate RoIs. These N RoIs may then be used for training the Trunk.sub.1 and the various Heads.sub.1. For example, based on the N RoI candidates, the detection head may be trained to select RoI candidates that are likely to contain the object of interest. For each RoI candidate, the machine-learning algorithm may use a bounding-box regressor to process the feature map associated with the RoI and its corresponding ground truth to learn to generate a refined bounding-box that frames the object of interest (e.g., person). The algorithm may also use a classifier (e.g., foreground/background classifier or object-detection classifier for persons or other objects of interest) to process the feature map associated with the RoI and its corresponding ground truth to learn to predict the object’s class. In particular embodiments, for the segmentation head, a separate neural network may process the feature map associated with each RoI, generate a segmentation mask (e.g., which may be represented as a matrix or grid with binary values that indicate whether a corresponding pixel belongs to a detected instance of the object or not), compare the generated mask with a ground-truth mask (e.g., indicating the true pixels belonging to the object), and use the computed errors to update the network via backpropagation. In particular embodiments, for the keypoint head, another neural network may process the feature map associated with each RoI, generate a one-hot mask for each keypoint of interest (e.g., for the head, feet, hands, etc.), compare the generated masks with corresponding ground-truth masks (e.g., indicating the true locations of the keypoints of interest), and use the computed errors to update the network via backpropagation. In particular embodiments, the different heads may be trained in parallel.

In particular embodiments, at stage 440, after Trunk.sub.1 and the various Heads.sub.1 of the machine-learning model have been trained in stage 430, the RPN.sub.1 of the model may be trained with Trunk.sub.1 being fixed (i.e., the parameters of Trunk.sub.1 would remain as they were after stage 430 and unchanged during this training stage). The training dataset at this stage may again include image samples, each having a corresponding ground truth or label, which may include bounding boxes or any other suitable indicators for RoIs appearing in the image sample. Conceptually, this training stage may refine or tailor the RPN.sub.1 to propose regions that are particularly suitable for human detection.

At stage 450, once RPN.sub.1 has been trained, the various Heads.sub.1 (e.g., detection head, keypoint head, and segmentation head) may be retrained with both Trunk.sub.1 and RPN.sub.1 fixed, in accordance with particular embodiments (i.e., the parameters of Trunk.sub.1 and RPN.sub.1 would remain as they were after stage 440 and unchanged during this training stage). The training dataset may be similar to the one used in stage 430 (e.g., each training image sample has known ground-truth bounding boxes, keypoints, and segmentation masks). The training process may also be similar to the process described with reference to stage 430, but now Trunk.sub.1 would be fixed and the N candidate RoIs would be generated by the trained (and fixed) Trunk.sub.1 and RPN.sub.1, rather than the temporary Trunk.sub.temp and RPN.sub.temp.

Referring back to FIG. 2, once the machine-learning model 200 has been trained, at inference time it may be given an input image 201 and output corresponding bounding boxes 279, keypoints 289, and segmentation masks 299 for instances of objects of interest appearing in the image 201. In particular embodiments, the trained model 200 may be included in applications and distributed to different devices with different system resources, including those with limited resources, such as mobile devices. Due to the compact model architecture and the various optimization techniques described herein, the model 200 would be capable of performing sufficiently despite the limited system resources to meet the real-time needs of the application, if applicable.

Particular embodiments may repeat one or more stages of the training process of FIG. 4, where appropriate. Although this disclosure describes and illustrates particular stages of the training process of FIG. 4 as occurring in a particular order, this disclosure contemplates any suitable steps of the training process of FIG. 4 occurring in any suitable order. Moreover, although this disclosure describes and illustrates an example method for training the machine-learning model including the particular stages of the process shown in FIG. 4, this disclosure contemplates any suitable process for training the machine-learning model including any suitable stages, which may include all, some, or none of the stages of the process shown in FIG. 4, where appropriate. Furthermore, although this disclosure describes and illustrates particular components, devices, or systems carrying out particular stages of the process of FIG. 4, this disclosure contemplates any suitable combination of any suitable components, devices, or systems carrying out any suitable stages of the process of FIG. 4.

In particular embodiments, at inference time, the trained machine-learning model may be running on a mobile device with limited hardware resources. Compared with mobile CPU, running operations on mobile GPU may provide significant speed-up. Certain mobile platforms may provide third-party GPU processing engines (e.g., Qualcomm.RTM. Snapdragon.TM. Neural Processing Engine (SNPE)) that allow the trained machine-learning model to utilize the various computing capabilities available (e.g., CPU, GPU, DSP). Such third-party processing engines, however, may have certain limitations that would result in suboptimal runtime performance of the machine-learning model.

One issue with third-party processing engines such as SNPE is that it may only support optimized processing for three-dimensional data (hereinafter referred to as 3D tensors). As previously discussed, the RPN is trained to process a given input image and generate N candidate RoIs, and the detection head is trained to select M RoIs. Each of the RoIs may have three-dimensional feature maps (i.e., channel C, height H, and width W). As such, the model needs to process four-dimensional data. Since processing engines such as SNPE only supports three-dimensional convolution processing, one way to process the feature maps is to process the feature maps of the N RoIs (or M, depending on the stage of the inference process) iteratively, such as using a FOR-loop. The FOR-loop for performing sequential convolutions, for instance, may be as follows: for i in range (0, N-1): conv (B[i, C, W, H], K); where B[i, C, W, H] represents the feature maps of the i-th RoI in the N RoIs and K represents the kernel or patch used in the convolution. FIG. 5A illustrates this process by showing three RoI feature maps 501, 502, 503 undergoing convolution operations with three identical kernel instances 511, 512, 513, respectively. In this case, the FOR-loop and the tensor splitting may be performed using the device’s CPU, and the convolution is performed using the device’s GPU. So in each iteration, data has to be copied between CPU memory and GPU memory, which creates significant overhead. Further, this iterative process (which also means launching the SNPE environment repeatedly), coupled with the relatively small feature maps of the RoIs, is not an efficient use of SNPE’s parallel processing features (SNPE or other third-party processing engines may have optimization features for larger feature maps).

To avoid the aforementioned issues and improve performance, particular embodiments utilize a technique that transforms the 4D tensor (i.e., the feature maps of the N or M RoIs) into a single 3D tensor so that SNPE may be used to perform a single optimized convolution operation on all the feature maps. FIG. 5B illustrates an example where the N=3 three-dimensional feature maps of the RoIs 501, 502, 503 are tiled together to form one large 3D tensor 550. In particular embodiments, padding data may be inserted between adjacent feature maps to prevent incorrect sampling (e.g., preventing, during convolution, the kernel being applied to two different but neighboring RoI feature maps). As illustrated in FIG. 5B, padding 571 may be inserted between feature map 501 and feature map 502, and padding 572 may be inserted between feature map 502 and feature map 503. In particular embodiments, the padding size may depend on the kernel 560 size. For example, the padding size may be equal to or larger than a dimension of the kernel 560 (e.g., if the kernel is 3.times.3, the padding may be 3 or more). In this particular tiling scenario, the resulting 3D tensor 550 may have the dimension C.times.H.times.(W*N+m*(N-1)), where m represents the width of the padding data. Thus, in this case, the C and H dimensions of the tensor 550 may remain the same relative to those of each of the regional feature maps 501-503, and the W dimension of tensor 550 may be greater than the combined W dimensions of the regional feature maps 501-503.

Particular manners of tiling may be more efficient than others, depending on how the combined tensor 550 is to be processed. For example, if the operation to be performed is convolution (e.g., in the initial inception module of each of the heads 230, 240, or 250), the feature maps may be tiled in a certain dimension to improve subsequent convolution efficiency (e.g., by improving cache-access efficiency and reducing cache misses). The regional feature map of an RoI can usually be thought of as being three-dimensional, with a height size (H), a width size (W), and a channel size (C). Since the RPN 220 outputs N RoIs, the dimensionality of the N RoIs would be four-dimensional (i.e., H, W, C, and N). The corresponding data representation may be referred to as 4D tensors. In particular embodiments, the 4D tensors may be stored in a data structure that is organized as NCHW (i.e., the data is stored in cache-first order, or in the order of batch, channel, height, and weight). This manner of data storage may provide the detection head with efficient cache access when performing convolution. Similarly, when the segmentation head and/or keypoint head performs convolutions on the regional feature maps of the M region definitions from the detection head, the data may be stored in MCHW order. However, when it comes to the aforementioned feature pooling 225/235 process (e.g., RoIAlign), cache access is more efficient in NHWC or MHWC order, because it can reduce cache miss and utilize SIMD (single instruction multiple data). Thus, in particular embodiments, the feature pooling 225 or 235 process may include a step that organizes or transforms a 4D tensor into NHWC format. This order switching could speed up the feature pooling 225 process significantly.

FIG. 5B illustrates an example where the feature maps of the RoIs are tiled together in a row to form a single long 3D tensor 550. In other words, only a single dimension is being expanded. However, the feature maps may also be tiled in any other configurations, so that two or three dimensions are expanded. To illustrates, in one example scenario, there may be N=12 feature maps. If no padding is inserted, arranging the feature maps in a row may yield a 3D tensor with the dimensions C*1.times.H*1.times.W*12. If the feature maps are arranged in a manner that expands two dimensions, the resulting 3D tensor may have the dimensions C*1.times.H*4.times.W*3. If the feature maps are arranged in a manner that expands three dimensions, the resulting 3D tensor may have the dimensions C*2.times.H*3.times.W*2.

In particular embodiments, it may be more desirable to generate a large 3D tensor with a larger aspect ratio, such as expanding in only one dimension (i.e., in a row), in order to minimize padding (which in turn minimizes the size of the resulting 3D tensor). Since padding is added between adjacent feature maps, minimizing the surface area of feature maps that are adjacent to other feature maps would result in a reduced need for padding. To illustrate, if N=4, tiling the four feature map tiles in a row may need 3*m padding (i.e., one between the first and second tiles, one between the second and third tiles, and one between the third and fourth tiles). However, if the four feature maps tiles are tiled in a 2.times.2 configuration, the number of paddings needed would be 4*m (i.e., one between the top-left tile and the top-right tile, one between the top-right tile and the bottom-right tile, one between the bottom-right tile and the bottom-left tile, and one between the bottom-left tile and the top-left tile). Thus, in particular embodiments, additional optimization may be gained by arranging the feature maps in a row (i.e., expanding in one dimension only).

FIG. 6 illustrates an example method for optimizing convolutional operations on feature maps of RoIs. The method may begin at step 610, where a computing system (e.g., a mobile device, laptop, or any other device used at inference time) may access an image of interest. The image, for example, may be a still image posted on a social network or a frame in a live video (e.g., captured in an augmented reality or virtual reality application). IPE, the system may need to obtain inference results in real-time or near real-time. For example, an augmented reality application or autonomous vehicle may need to determine the instance segmentation masks of people or vehicles captured in an image/video. The optimizations of the machine-learning model described herein enable computing devices, even those with relatively limited hardware resources (e.g., mobile phones), to generate results quickly to meet application requirements.

At step 620, the system may generate a feature map that represents the image. IPE, the system may use the backbone neural network, such as the trunk 210 described herein, to generate the feature map. While other types of backbones (e.g., ResNet, Feature Pyramid Network, etc.) may alternatively be used, embodiments of the trunk 210 provide the advantage of, e.g., not requiring significant hardware resources (e.g., CPU, GPU, cache, memory, etc.) to generate feature maps within stringent timing constraints. Embodiments of the trunk enables applications running on mobile platforms, for example, to take advantage of real-time or near real-time instance detection, classification, segmentation, and/or keypoint generation.

At step 630, the system may identify a regions of interest (RoIs) in the feature map. IPE, the RoIs may be identified by a region proposal network (RPN), as described herein.

At step 640, the system may generate regional feature maps for the RoIs, respectively. For example, the system may use sampling methods such as RoIPool or RoIAlign to sample an RoI and generate a representative regional feature map. Each of the M regional feature maps generated may have three dimensions (e.g., corresponding to the regional feature map’s height, width, and channels). IPE, the regional feature maps may have equal dimensions (e.g., same height, same width, and same channels).

您可能还喜欢...