Facebook Patent | Detecting Augmented-Reality Targets

Patent: Detecting Augmented-Reality Targets

Publication Number: 20200143238

Publication Date: 20200507

Applicants: Facebook

Abstract

In one embodiment, a method includes receiving deep-learning (DL)-feature representations and local-feature descriptors, wherein the DL-feature representations and the local-feature descriptors are extracted from an image that includes a first depiction of a real-world object; identifying a set of potentially matching DL-feature representations based on a comparison of the received DL-feature representations with stored DL-feature representations associated with a plurality of augmented-reality (AR) targets; determining, from a set of potentially matching AR targets associated with the set of potentially matching DL-feature representations, a matching AR target based on a comparison of the received one or more local-feature descriptors with stored local-feature descriptors associated with the set of potentially matching AR targets, wherein the stored local-feature descriptors are extracted from the set of potentially matching AR targets; and sending, to the client computing device, information configured to render an AR effect associated with the determined matching AR target.

TECHNICAL FIELD

[0001] This disclosure generally relates to augmented reality environments.

BACKGROUND

[0002] Artificial reality is a form of reality that has been adjusted in some manner before presentation to a user, which may include, e.g., a virtual reality (VR), an augmented reality (AR), a mixed reality (MR), a hybrid reality, or some combination and/or derivatives thereof. Artificial reality content may include completely generated content or generated content combined with captured content (e.g., real-world photographs). The artificial reality content may include video, audio, haptic feedback, or some combination thereof, and any of which may be presented in a single channel or in multiple channels (such as stereo video that produces a three-dimensional effect to the viewer). Additionally, in some embodiments, artificial reality may be associated with applications, products, accessories, services, or some combination thereof, that are, e.g., used to create content in an artificial reality and/or used in (e.g., perform activities in) an artificial reality. The artificial reality system that provides the artificial reality content may be implemented on various platforms, including a head-mounted display (HMD) connected to a host computer system, a standalone HMD, a mobile device or computing system, or any other hardware platform capable of providing artificial reality content to one or more viewers.

[0003] AR effects are computer-generated visual effects (e.g., images and animation) that are superimposed or integrated into a user’s view of a real-world scene. Certain AR effects may be configured to track objects in the real world. For example, a computer-generated unicorn may be placed on a real-world table as captured in a video. As the table moves in the captured video (e.g., due to the camera moving or the table being carried away), the generated unicorn may follow the table so that it continues to appear on top of the table. To achieve this effect, an AR application may use tracking algorithms to track the positions and/or orientations of objects appearing in the real-world scene and use the resulting tracking data to generate the appropriate AR effect. Since AR effects may augment the real-world scene in real-time or near real-time while the scene is being observed, tracking data may need to be generated in real-time or near real-time so that the AR effect appears as desired.

SUMMARY OF PARTICULAR EMBODIMENTS

[0004] In particular embodiments, a computing device may access an image of a real-world environment that may contain one or more depictions of real-world objects that are intended targets for AR effects. For example, this computing device may be a client computing device such as a smartphone that is viewing a real-world environment that includes a Cowboy RoboNinja movie poster for which an AR effect exists (e.g., on a server database). An initial feature map may be generated for the image using a first machine learning model. A separate machine learning model may then be used to identify likely regions of interest within the feature map and the corresponding image. Building on the previous example, if posters are targets of interest for an AR effect, a machine learning model residing on the smartphone may identify a region within the feature map corresponding to the portion of the image that includes the Cowboy RoboNinja movie poster as being a likely region of interest. The likely regions of interest in the feature map corresponding to the portion of the image that includes the Cowboy RoboNinja movie poster may then be sent to a second machine learning model that may be used to extract deep-learning (DL)-feature representations of the likely regions of interest. Local-feature descriptors may also be extracted from each of the likely regions of interest in the image. These local-feature descriptors may correspond to points of interest with a likely region of interest, and may be generated based on spatially bounded patches within the likely region of interest. For each likely region of interest, the extracted DL-feature representation may be compared against stored DL-feature representations (e.g., on a DL-feature database) to identify a set of potentially matching DL-feature representations. Each of these potentially matching DL-feature representations may be associated with an AR target of an AR effect. Building on the previous example, a DL-feature representation of the image portion including the Cowboy RoboNinja movie poster may be sent from the smartphone to a server that includes the stored DL-feature representations (e.g., a DL-feature database), and a comparison at the server may yield a set of 20 potentially matching DL-feature representations, each of which being associated with a potentially matching AR target corresponding to a movie poster. For each likely region of interest, the extracted local-feature descriptors may be compared against local-feature descriptors, which may be stored within a local-feature database, associated with the set of potentially matching DL-feature representations. The comparison of local-feature descriptors may further narrow the potentially matching DL-feature representations–for example, to a single matching DL-feature representation that may correspond to a single matching AR target. Building on the previous example, the comparison of local-feature descriptors may narrow the 20 potentially matching AR targets down to a single matching AR target–the Cowboy RoboNinja movie poster. An AR effect associated with the matching AR target may then be caused to be rendered. As an example and not by way of limitation, a user viewing the Cowboy RoboNinja movie poster on a display of a smartphone may see on the display, near the poster, an avatar of Cowboy RoboNinja.

[0005] By (1) comparing the extracted DL-feature representations against stored DL-feature representations (e.g., in a DL-feature database) and by (2) further comparing the extracted local-feature descriptors against stored local-feature descriptors (e.g., in a local-feature database), the accuracy and performance of the matching process may be vastly improved. It also serves to reduce the expenditure of computational resources, which may be particularly important when dealing with image comparisons, which is a computationally challenging task. The DL-feature comparison may be efficient at quickly narrowing down a large number of AR targets to a small subset of potentially matching AR targets. As an example and not by way of limitation, out of 12 million AR targets on the DL-feature database, 20 AR targets may be identified as potential candidates because of matching DL-feature representations. However, the DL-feature comparison process may, in some cases, lead to ambiguities. For example, an extracted DL-feature may have slight variations that may not be adequately accounted for in the stored DL-feature representations. As another example, the associated machine learning model may be trained to detect objects of one or more real-world object types (e.g., the object-type “posters” associated with the Cowboy RoboNinja poster in the image) and note specific AR targets (e.g., one that is associated with the particular Cowboy RoboNinja poster), and so the DL-feature representations may not be able to disambiguate different objects of an object-type (e.g., posters) with sufficient precision. In these cases, the local-feature comparison process may be able to disambiguate the potentially matching AR targets to accurately identify a matching AR target. In some cases, while the local-feature comparison process may be accurate, it may be time-consuming and computationally challenging to perform on a large set. By narrowing it down to small subset (e.g., 20 AR targets), both time and computational resources may be conserved in finding a matching AR target using local-feature comparisons.

[0006] Embodiments of the invention may include or be implemented in conjunction with an artificial reality system. The embodiments disclosed herein are only examples, and the scope of this disclosure is not limited to them. Particular embodiments may include all, some, or none of the components, elements, features, functions, operations, or steps of the embodiments disclosed herein. Embodiments according to the invention are in particular disclosed in the attached claims directed to a method, a storage medium, a system and a computer program product, wherein any feature mentioned in one claim category, e.g. method, can be claimed in another claim category, e.g. system, as well. The dependencies or references back in the attached claims are chosen for formal reasons only. However any subject matter resulting from a deliberate reference back to any previous claims (in particular multiple dependencies) can be claimed as well, so that any combination of claims and the features thereof are disclosed and can be claimed regardless of the dependencies chosen in the attached claims. The subject-matter which can be claimed comprises not only the combinations of features as set out in the attached claims but also any other combination of features in the claims, wherein each feature mentioned in the claims can be combined with any other feature or combination of other features in the claims. Furthermore, any of the embodiments and features described or depicted herein can be claimed in a separate claim and/or in any combination with any embodiment or feature described or depicted herein or with any of the features of the attached claims.

BRIEF DESCRIPTION OF THE DRAWINGS

[0007] FIG. 1 illustrates an example of a real-world environment being viewed through the display of a client device.

[0008] FIG. 2 illustrates an example deep-learning (DL) feature extraction process.

[0009] FIG. 3 illustrates an example of an image including a region of interest.

[0010] FIG. 4 illustrates an example local-feature extraction process.

[0011] FIG. 5 illustrates an example of a feature-matching method.

[0012] FIG. 6 illustrates an example method for identifying a matching AR target in a real-world environment and rendering an associated AR effect.

[0013] FIG. 7 illustrates an example network environment associated with a social-networking system.

[0014] FIG. 8 illustrates an example social graph.

[0015] FIG. 9 illustrates an example computer system.

DESCRIPTION OF EXAMPLE EMBODIMENTS

[0016] Embodiments of the invention may include or be implemented in conjunction with an artificial reality system. In particular embodiments, a computing device may access an image of a real-world environment that may contain one or more depictions of real-world objects that are intended targets for AR effects. For example, this computing device may be a client computing device such as a smartphone that is viewing a real-world environment that includes a Cowboy RoboNinja movie poster for which an AR effect exists (e.g., on a server database). An initial feature map may be generated for the image using a first machine learning model. A separate machine learning model may then be used to identify likely regions of interest within the feature map and the corresponding image. Building on the previous example, if posters are targets of interest for an AR effect, a machine learning model residing on the smartphone may identify a region within the feature map corresponding to the portion of the image that includes the Cowboy RoboNinja movie poster as being a likely region of interest. The likely regions of interest in the feature map corresponding to the portion of the image that includes the Cowboy RoboNinja movie poster may then be sent to a second machine learning model that may be used to extract deep-learning (DL)-feature representations of the likely regions of interest. Local-feature descriptors may also be extracted from each of the likely regions of interest in the image. These local-feature descriptors may correspond to points of interest with a likely region of interest, and may be generated based on spatially bounded patches within the likely region of interest. For each likely region of interest, the extracted DL-feature representation may be compared against a DL-feature database to identify a set of potentially matching DL-feature representations. Each of these potentially matching DL-feature representations may be associated with an AR target of an AR effect. Building on the previous example, a DL-feature representation of the image portion including the Cowboy RoboNinja movie poster may be sent from the smartphone to a server that includes the database of DL-feature representations, and a comparison at the server may yield a set of 20 potentially matching DL-feature representations, each of which being associated with a potentially matching AR target corresponding to a movie poster. For each likely region of interest, the extracted local-feature descriptors may be compared against local-feature descriptors, which may be stored within a local-feature database, associated with the set of potentially matching DL-feature representations. The comparison of local-feature descriptors may further narrow the potentially matching DL-feature representations–for example, to a single matching DL-feature representation that may correspond to a single matching AR target. Building on the previous example, the comparison of local-feature descriptors may narrow the 20 potentially matching AR targets down to a single matching AR target–the Cowboy RoboNinja movie poster. An AR effect associated with the matching AR target may then be caused to be rendered. As an example and not by way of limitation, a user viewing the Cowboy RoboNinja movie poster on a display of a smartphone may see on the display, near the poster, an avatar of Cowboy RoboNinja.

[0017] By (1) comparing the extracted DL-feature representations against stored DL-feature representations in the DL-feature database and by (2) further comparing the extracted local-feature descriptors against the local-feature database, the accuracy and performance of the matching process may be vastly improved. It also serves to reduce the expenditure of computational resources, which may be particularly important when dealing with image comparisons, which is a computationally challenging task. The DL-feature comparison may be efficient at quickly narrowing down a large number of AR targets to a small subset of potentially matching AR targets. As an example and not by way of limitation, out of 12 million AR targets on the DL-feature database, 20 AR targets may be identified as potential candidates because of matching DL-feature representations. However, the DL-feature comparison process may, in some cases, lead to ambiguities. For example, an extracted DL-feature may have slight variations that may not be adequately accounted for in the stored DL-feature representations. As another example, the associated machine learning model may be trained to detect objects of one or more real-world object types (e.g., the object-type “posters” associated with the Cowboy RoboNinja poster in the image) and note specific AR targets (e.g., one that is associated with the particular Cowboy RoboNinja poster), and so the DL-feature representations may not be able to disambiguate different objects of an object-type (e.g., posters) with sufficient precision. In these cases, the local-feature comparison process may be able to disambiguate the potentially matching AR targets to accurately identify a matching AR target. In some cases, while the local-feature comparison process may be accurate, it may be time-consuming and computationally challenging to perform on a large set. By narrowing it down to small subset (e.g., 20 AR targets), both time and computational resources may be conserved in finding a matching AR target using local-feature comparisons.

更多阅读推荐......