Google Patent | Computer vision based extraction and overlay for instructional augmented reality

编辑：映维 | 分类：Google | 2021年11月4日

Patent: Computer vision based extraction and overlay for instructional augmented reality

Drawings: Click to check drawins

Publication Number: 20210345016

Publication Date: 20211104

Applicant: Google

Google Patent | Computer vision based extraction and overlay for instructional augmented reality

Abstract

Systems and methods are described that utilize one or more processors to obtain a plurality of segments of a first media content item, extract, from a first segment in the plurality of segments, a plurality of image frames associated with a plurality of tracked movements of at least one object represented in the extracted image frames, compare, objects represented in the image frames extracted from the first segment to tracked objects in a second media content item. In response to detecting that at least one of the tracked objects is similar to at least one object in the plurality of extracted image frames, generating virtual content depicting the plurality of tracked movements from the first segment being performed on the at least one tracked object in the second media content item and triggering rendering of the virtual content as an overlay on the at least one tracked object.

Claims

A computer-implemented method carried out by at least one processor, the method comprising: obtaining a plurality of segments of a first media content item; extracting, from a first segment in the plurality of segments, a plurality of image frames, the plurality of image frames being associated with a plurality of tracked movements of at least one object represented in the extracted image frames; comparing, objects represented in the image frames extracted from the first segment to tracked objects in a second media content item; in response to detecting that at least one of the tracked objects in the second media content item is similar to at least one object in the plurality of extracted image frames, generating, based on the extracted plurality of image frames, virtual content depicting the plurality of tracked movements from the first segment being performed on the at least one tracked object in the second media content item; and triggering rendering of the virtual content as an overlay on the at least one tracked object in the second media content item.
The method of claim 1, further comprising: extracting, from the plurality of segments, a second segment from the first media content item, the second segment having a timestamp after the first segment; and generating, using the extracted at least one image frame from the second segment of the first media content item, virtual content that depicts the at least one image frame from the second segment on the at least one tracked object in the second media content item.
The method of claim 2, wherein the at least one image frame from the second segment depicts a visual result associated with the at least one object in the extracted image frames.
The method of claim 1, wherein a computer vision system is employed by the at least one processor to: analyze the first media content item to determine which of the plurality of segments to extract and which of the plurality of image frames to extract; and analyze the second media content item to determine which object corresponds to the at least one object in the plurality of extracted image frames of the first media content item.
The method of claim 1, wherein the detecting that the at least one tracked object in the second media content item is similar to the at least one object in the plurality of extracted image frames includes comparing a shape of the at least one tracked object to the shape of the at least one object in the plurality of extracted image frames, and wherein the generated virtual content is depicted on the at least one tracked object according to the shape of the at least one object in the plurality of extracted image frames.
The method of claim 1, wherein triggering rendering of the virtual content as an overlay on the at least one tracked object in the second media content item includes synchronizing the rendering of the virtual content on the second media content item with a timestamp associated with the first segment.
The method of claim 1, wherein: the plurality of tracked movements correspond to instructional content in the first media content item; and the plurality of tracked movements are depicted as the virtual content, the virtual content illustrating performance of the plurality of tracked movements on the at least one object in the plurality of extracted image frames in the second media content item.
A system comprising: an image capture device associated with a computing device; at least one processor; and memory storing instructions that, when executed by the at least one processor, cause the system to: obtain a plurality of segments of a first media content item; extract, from a first segment in the plurality of segments, a plurality of image frames, the plurality of image frames being associated with a plurality of tracked movements of at least one object represented in the extracted image frames; compare, objects represented in the image frames extracted from the first segment to tracked objects in a second media content item; in response to detecting that at least one of the tracked objects in the second media content item is similar to at least one object in the plurality of extracted image frames, generating, based on the extracted plurality of image frames, virtual content depicting the plurality of tracked movements from the first segment being performed on the at least one tracked object in the second media content item; and trigger rendering of the virtual content as an overlay on the at least one tracked object in the second media content item.
The system of claim 8, further comprising: extracting, from the plurality of segments, a second segment from the first media content item, the second segment having a timestamp after the first segment; and generating, using the extracted at least one image frame from the second segment of the first media content item, virtual content that depicts the at least one image frame from the second segment on the at least one tracked object in the second media content item.
The system of claim 9, wherein the at least one image frame from the second segment depicts a visual result associated with the at least one object in the extracted image frames.
The system of claim 8, wherein the system further includes a computer vision system employed by the at least one processor to: analyze the first media content item to determine which of the plurality of segments to extract and which of the plurality of image frames to extract; and analyze the second media content item to determine which object corresponds to the at least one object in the plurality of extracted image frames of the first media content item.
The system of claim 8, wherein the detecting that the at least one tracked object in the second media content item is similar to the at least one object in the plurality of extracted image frames includes comparing a shape of the at least one tracked object to the shape of the at least one object in the plurality of extracted image frames, and wherein the generated virtual content is depicted on the at least one tracked object according to the shape of the at least one object in the plurality of extracted image frames.
The system of claim 8, wherein triggering rendering of the virtual content as an overlay on the at least one tracked object in the second media content item includes synchronizing the rendering of the virtual content on the second media content item with a timestamp associated with the first segment.
The system of claim 8, wherein: the plurality of tracked movements correspond to instructional content in the first media content item, and the plurality of tracked movements are depicted as the virtual content, the virtual content illustrating performance of the plurality of tracked movements on the at least one object in the plurality of extracted image frames in the second media content item.
A computer readable medium tangibly embodied on a non-transitory computer-readable medium and comprising instructions that, when executed, are configured to cause at least one processor to: obtain a plurality of segments of a first media content item; extract, from a first segment in the plurality of segments, a plurality of image frames, the plurality of image frames being associated with a plurality of tracked movements of at least one object represented in the extracted image frames; compare, objects represented in the image frames extracted from the first segment to tracked objects in a second media content item; in response to detecting that at least one of the tracked objects in the second media content item is similar to at least one object in the plurality of extracted image frames, generating, based on the extracted plurality of image frames, virtual content depicting the plurality of tracked movements from the first segment being performed on the at least one tracked object in the second media content item; and trigger rendering of the virtual content as an overlay on the at least one tracked object in the second media content item.
The computer readable medium of claim 15, wherein the instructions, when executed, are configured to cause the at least one processor to perform the steps of claim 15 for each of the obtained plurality of segments of the first media content item.
The computer readable medium of claim 15, wherein a computer vision system is employed by the at least one processor to: analyze the first media content item to determine which of the plurality of segments to extract and which of the plurality of image frames to extract; and analyze the second media content item to determine which object corresponds to the at least one object in the plurality of extracted image frames of the first media content item.
The computer readable medium of claim 15, wherein the detecting that the at least one tracked object in the second media content item is similar to the at least one object in the plurality of extracted image frames includes comparing a shape of the at least one tracked object to the shape of the at least one object in the plurality of extracted image frames, and wherein the generated virtual content is depicted on the at least one tracked object according to the shape of the at least one object in the plurality of extracted image frames.
The computer readable medium of claim 15, wherein triggering rendering of the virtual content as an overlay on the at least one tracked object in the second media content item includes synchronizing the rendering of the virtual content on the second media content item with a timestamp associated with the first segment.
The computer readable medium of claim 15, wherein: the plurality of tracked movements correspond to instructional content in the first media content item; and the plurality of tracked movements are depicted as the virtual content, the virtual content illustrating performance of the plurality of tracked movements on the at least one object in the plurality of extracted image frames in the second media content item.

Description

TECHNICAL FIELD

[0001] This disclosure relates to Virtual Reality (VR) and/or Augmented Reality (AR) experiences and the use of computer vision to extract content.

BACKGROUND

[0002] Users increasingly rely on digitally formatted content to learn new skills and techniques. However, when learning, it may be difficult to translate an instructor’s physical world aspects to physical world aspects of a user accessing the digitally formatted content. For example, if an instructional video is shown for exercising a particular body type, it may be difficult for the user to translate the body part depicted in the digitally formatted content to the user’s own body part in order to properly and safely carry out the exercise. Thus, improved techniques for providing instructional content within digitally formatted content may benefit a user attempting to apply techniques shown in such content.

SUMMARY

[0003] The techniques described herein may provide an application that employs computer vision (CV) analysis to find instructional content in images and generate AR content for the instructional content. The AR content may be generated for being adapted to a shape or element in a specific content feed such that the AR content may be overlaid onto a user, an object, or other element in the content feed. The overlay of the AR content may function to assist users in learning new skills by viewing the AR content on the user, object or other element in the content feed. A content feed may be live, captured live, accessed online after capture, or accessed during the capture of the feed, but with a delay.

[0004] A system of one or more computers can be configured to perform particular operations or actions by virtue of having software, firmware, hardware, or a combination of them installed on the system that in operation causes or cause the system to perform the actions. One or more computer programs can be configured to perform particular operations or actions by virtue of including instructions that, when executed by data processing apparatus, cause the apparatus to perform the actions.

[0005] In a first general aspect, a computer-implemented method is described. The method is carried out by at least one processor, which may execute at least steps including obtaining a plurality of segments of a first media content item, extracting, from a first segment in the plurality of segments, a plurality of image frames, the plurality of image frames being associated with a plurality of tracked movements of at least one object represented in the extracted image frames, and comparing, objects represented in the image frames extracted from the first segment to tracked objects in a second media content item. In response to detecting that at least one of the tracked objects in the second media content item is similar to at least one object in the plurality of extracted image frames, the method may include generating, based on the extracted plurality of image frames, virtual content depicting the plurality of tracked movements from the first segment being performed on the at least one tracked object in the second media content item. The method may further include triggering rendering of the virtual content as an overlay on the at least one tracked object in the second media content item.

[0006] Particular implementations of the computer-implemented method may include any or all of the following features. In some implementations, the method may use one or more image capture devices. The method may include extracting, from the plurality of segments, a second segment from the first media content item, the second segment having a timestamp after the first segment and generating, using the extracted at least one image frame from the second segment of the first media content item, virtual content that depicts the at least one image frame from the second segment on the at least one tracked object in the second media content item. In some implementations, the at least one image frame from the second segment depicts a visual result associated with the at least one object in the extracted image frames.

[0007] In some implementations, a computer vision system is employed by the at least one processor to analyze the first media content item to determine which of the plurality of segments to extract and which of the plurality of image frames to extract and to analyze the second media content item to determine which object corresponds to the at least one object in the plurality of extracted image frames of the first media content item.

[0008] In some implementations, detecting that the at least one tracked object in the second media content item is similar to the at least one object in the plurality of extracted image frames includes comparing a shape of the at least one tracked object to the shape of the at least one object in the plurality of extracted image frames. In some implementations, the generated virtual content is depicted on the at least one tracked object according to the shape of the at least one object in the plurality of extracted image frames. In some implementations, triggering rendering of the virtual content as an overlay on the at least one tracked object in the second media content item includes synchronizing the rendering of the virtual content on the second media content item with a timestamp associated with the first segment.

[0009] In some implementations, the plurality of tracked movements correspond to instructional content in the first media content item and the plurality of tracked movements are depicted as the virtual content, the virtual content illustrating performance of the plurality of tracked movements on the at least one object in the plurality of extracted image frames in the second media content item.

[0010] In some implementations, the plurality of tracked movements correspond to instructional content in the first media content item and the plurality of tracked movements are depicted as the virtual content. The virtual content may illustrate performance of the plurality of tracked movements on the at least one object in the plurality of extracted image frames in the second media content item.

[0011] Implementations of the described techniques may include systems, hardware, a method or process, and/or computer software on a computer-accessible medium. The details of one or more implementations are set forth in the accompanying drawings and the description below. Other features will be apparent from the description and drawings, and from the claims.

BRIEF DESCRIPTION OF THE DRAWINGS

[0012] FIG. 1 illustrates an example of instructional content accessed by a user utilizing an example electronic device, according to example implementations.

[0013] FIG. 2 is a block diagram of an example computing device with framework for extracting and modifying instructional content for overlay onto image content presented in an AR experience, according to example implementations.

[0014] FIGS. 3A-3D depict an example illustrating extraction and modification of instructional content for overlay onto live image content presented in an AR experience, according to example implementations.

[0015] FIGS. 4A-4B depict another example illustrating extraction and modification of instructional content for overlay onto live image content presented in an AR experience, according to example implementations.

[0016] FIGS. 5A-5B depict yet another example illustrating extraction and modification of instructional content for overlay onto live image content presented in an AR experience, according to example implementations.

[0017] FIG. 6 is an example process to analyze image content for use in generating layered augmented reality content, according to example implementations.

[0018] FIG. 7 illustrates an example of a computer device and a mobile computer device, which may be used with the techniques described herein.

[0019] The use of similar or identical reference numbers in the various drawings is intended to indicate the presence of a similar or identical element or feature.

DETAILED DESCRIPTION

[0020] This disclosure relates to Virtual Reality (VR) and/or Augmented Reality (AR) experiences and the use of computer vision (CV) techniques to enable users to view and experience immersive media content items (e.g., instructional content including, but not limited to images, image frames, videos, video clips, video or image segments, etc.). For example, the CV techniques may detect, analyze, modify, and overlay AR content (representing instructional video content) onto an image/video feed belonging to a user to visually assist the user in carrying out instructions from the instructional video.

[0021] The techniques described herein may provide an application that employs CV analysis to find instructional content in images and generate AR content for the instructional content. The AR content may be generated for being adapted to a shape or element in a specific live feed such that the AR content may be overlaid onto a user, an object, or other element in the live feed. The overlay of the AR content may function to assist users in learning new skills by viewing the AR content on the user, object or other element in the live feed recognized by the user.

[0022] The techniques described herein may provide an advantage of improved learning, because the systems described herein can adapt and fit an AR overlay representing the instructional video content onto video and/or images of a user attempting to carry out the instructions of the instructional video, which can help guide the user using elements captured in the video and/or images of the user (and/or content/objects with which the user is interacting). In some implementations, the instructional content may be adapted to video and/or images of the user accessing the instructional content to improve user learning while providing product information and shopping opportunities related to products and content in the instructional content.

[0023] The systems and methods described herein leverage CV techniques to extract, modify, and overlay AR content (e.g., user interface (UI) elements, virtual objects, brushstrokes, etc.) onto image content. The overlaid AR content may provide the advantage of improved understanding of instructional content by providing visual instructions (e.g., content, motions, movement, etc.) that pertain to a specific user accessing the instructional content. For example, the systems and methods described herein may employ CV technology to extract instructional content (e.g., image frames, objects, movements, etc.) from a video.

[0024] In some implementations, image frames from the instructional content in the video can be preprocessed, and objects within the content may be tracked to identify relevant visual steps from the instructions provided in the video. Segmentation techniques may then be applied to extract such objects (or portions of the objects) for use in generating AR content and objects to be depicted (e.g., overlaid) on a camera feed associated with the user accessing the instructional content on an electronic device, for example.

[0025] In some implementations, the image frames from the instructional content can be processed during play (e.g., live streaming, streaming, online access), and objects within the content may be tracked to identify visual steps from the instructions. Such image frames may be provided as AR content overlaid onto a live feed of a user carrying out the instructions of the instructional content (e.g., video).

[0026] In some implementations, the CV techniques employed by the systems and methods described herein can detect and/or otherwise assess movements carried out in an instructional video and results of such movements can be extracted, modified, and overlaid onto a live feed of a user carrying out the instructions on elements (e.g., face, craft project, body part, etc.) shown in the live feed. In a non-limiting example, an instructional video showing how to shape an eyebrow may be executing on an electronic device while a camera of such a device captures an image (e.g., live feed) of the user operating the electronic device. The visually instructional portions (e.g., brush strokes, makeup application, etc.) of the instructional video may be extracted and modified so that such portions can be overlaid to appear as if the instructions are being carried out on the eyebrow of the user in the live feed.

[0027] In such an example, the systems and methods described herein may determine how to modify the extracted content by analyzing content in the instructional video and content in a video feed. For example, the systems and methods described herein may assess the shape of the eye, facial features, and/or eyebrow in both the instructional video and the shape of the eye and/or eyebrow of the user in the live feed. The assessment may apply one or more algorithms to ensure the outcome of the eyebrow on the live feed follows guidelines of shaping eyebrows for a particular eye shape, facial feature, shape, etc. For example, the instructional video may ensure that the shaping and makeup application on the eyebrow begins at a starting point associated with an inner eye location and ends at an ending point associated with an outer eye location. Such locations may be mapped to fit the shape of the eye, eyebrow, face, etc., of the user in the live feed such that the look (e.g., shape, color, movement) in the instructional video is appropriately fitted to the images of the user in the live feed. Such assessment may ensure that the user is provided a realistic approach to eyebrow shaping and associated makeup application for the eyebrow belonging to the user in the live feed. The instructions can also include providing feedback and if the user is not following the instructions properly, the systems and methods described herein can provide specific instructions on how to modify what the user is doing via textual or visual feedback.

[0028] In some implementations, the techniques described herein can be used to provide AR content to assist with instructional content for makeup application. For example, CV and object tracking can be used to detect and track movement of makeup tools and to segment makeup around the object (e.g., an eye, face, lips, etc.). For example, the techniques described herein can identify an eye category or area within an instructional video upon identifying that an eye liner tool in the instructional video is the object that is moving at a high threshold level (i.e., more than other objects in the video). Upon identifying the eye liner tool is moving, the techniques can extract the path (e.g., brushstroke) of the eye liner tool. The extracted path may be modified to appropriately fit the eye liner application to an eye of the user, which is captured in a camera feed directed at the face of the user.

[0029] After particular relevant content is extracted using CV techniques, the content may be applied to the live feed as augmented reality (AR) content. For example, the content may be morphed to properly to fit an object in the user’s live feed. For example, the eyebrow shaping path may be extracted from an instructor’s face mesh in the instructional video and modified to fit the user’s face parts shape in AR. In some implementations, additional UI content may be displayed. For example, an AR application may display a dot element on top of makeup content to highlight particular instructions such as a brushstroke path, a current position of the brush, and the like. Additional elements including motion paths, joint positioning, and/or other instructional content can be provided as AR content overlaid onto the live feed.

[0030] A number of extraction methodologies and content segmentation techniques may be employed and thus scale of generated content (e.g., using face mesh analysis, body pose analysis, optical flow techniques, etc.) may vary depending on different types of instructional content. Similar extraction and content segmentation techniques may be applied to other instructional content examples including, but not limited to crafts, exercise, sports, interior design, repair, hobbies, and/or other accessible instructional content.

[0031] In some implementations, the systems and methods described herein may utilize machine learning models with the CV techniques to improve tracking and segmentation results. Machine learning models that utilize neural networks may receive images as input in order to provide any number of types of output. One such example output includes image classification, in which the machine learning model is trained to indicate a class associated with an object in an image. Another example includes object detection in which the machine learning model is trained to output the specific location of an object in the image. Yet another example includes the class of image to image translation, in which the input is an image and the output is a stylized version of the original input image. Other examples can include, but are not limited to, facial feature tracking and segmentation for Augmented Reality (AR) (e.g., localizing 2D facial features from an input image or video), facial mesh generation for AR (e.g., inferring a 3D face mesh from an input image or video), hand, body, and/or pose tracking, and lighting estimation for AR (e.g., estimating scene illumination from an input image to use for realistically rendering virtual assets into the image or video feed), and translation (text on screen for product names, instructions, translation, etc.). In some implementations, instructional content may be improved by audio inputs using speech-to-text algorithms.

[0032] In general, the techniques described herein may provide an application to find relevant content in an instructional content video and to experience the relevant content in an immersive way. Particular implementations may utilize computer vision (CV) based techniques to extract an instructional content (e.g., tutorials, instructions, movements, etc.). The content may be associated with a particular timestamp from a video depicting the instructional content. The extracted content may be overlaid onto a live camera feed belonging to a user operating a device streaming (e.g., executing) the instructional media content items (e.g., video, clips, segments, images, frames, etc.). The overlay may provide instructional guidance to the user on the live feed.

[0033] In some implementations, an object tracker is used to identify a relevant step in the instructional content. The object tracker uses an optical flow CV technique to track the relevant step. In some implementations, relevant content (e.g., makeup texture) may be extracted using segmentation techniques. The extracted content may be morphed (using face mesh algorithms, body mesh algorithms) before being overlaid onto a live camera feed. In some implementations, the extraction is pre-processed and uses a fixed number of frames in the instructional content video (e.g., before and after the current timestamp).

[0034] After the relevant content is extracted using computer vision, the content is applied to the live camera feed in AR. For example, an eyeliner path may be extracted from the instructor’s face mesh in the instructional video, which may be modified to fit particular face portions (or shapes) belonging to the user in the live feed. In some implementations, the extraction is performed real time and applied as AR content to the user of the live feed in near real time.

[0035] In some implementations, the techniques described herein may also use additional UI element(s) along with the extracted content to highlight the instructions. For example, a location dot element may be used on top of the makeup content to highlight the instructions, which may show a particular brushstroke path, current position of the brush, etc. In the case of sports-based instructional video, the AR experience may utilize such additional UI elements to instruct motion paths and proper joint positions to teach the user in the live feed to properly carry out instructions.

[0036] In some implementations, particular extracted frames from the instructional content may be preprocessed. Such preprocessing may include the user of a vision service and/or Optical Code Recognition (OCR) to enable the systems described herein to determine and suggest a particular product or other instructional content.

[0037] FIG. 1 illustrates an example media content item 100 accessed by an example electronic device 102, according to example implementations. The electronic device 102 is depicting an instructional video (e.g., content item 100) and a live feed 104 (e.g., a live video feed) of a user 106 (shown as user 106a and captured user 106b). Here, the user 106a may use device 102 to capture live feed 104 from a front-facing camera of the electronic device 102. In some implementations, the live feed may be provided by a rear-facing camera and/or otherwise within an AR application 108.

[0038] In this example, the user 106a may access a camera 110 in sensor system 112, computer vision (CV) system 114, and tracking system 116, which may work together to provide software and algorithms that track, generate, and place AR content around captured image feed 104 (e.g., live and real time). For example, the computing device 102 can detect that instructional content item 100 is being accessed and that the user is capturing a feed 104. In some implementations, computations can be done in the cloud (e.g., pre-processed or live computer vision algorithm on video and camera feed, etc.) where the device is used to render content. Both content 100 and feed 104 may be depicted on device 102 to allow the user to learn the instructional content 100 using the face belonging to the user (106a), as shown by captured user face 106b, in this example. The device 102 may include or have access to a computer vision system 114, which can detect elements, objects, or other details in content item 100 and/or feed 104. The detected elements may represent portions of content item 100 to be modified for use in generating AR content 118, which may be overlaid onto feed 104. Tracking system 116 can assist the computer vision system 114 to extract and modify particular content 100. AR application 108 may assist in modifying and rendering AR content 118 on device 102.

[0039] For example, the computing device 102 can detect (or be provided with indications) that instructional content item 100 is being accessed and that the user 106a is capturing the feed 104. Both content 100 and feed 104 may be depicted on device 102 to allow the user to learn the instructional content 100 on the face belonging to the user (106a), as shown by captured user face 106b, in this example. Here, the instructional content 122 includes the user 120 applying makeup to her cheek, as shown by moving hands near the cheek of user 120. The device 102 may include or have access to the computer vision system 114, which can detect the instructional content 122 (e.g., actions, movements, color application, modification of objects, facial features, etc.) or other details in content item 100. The instructional content (e.g., makeup application movements) and resulting output of such content (e.g., makeup color application on the cheek of user 120) may be detected, extracted, and/or otherwise analyzed. In some implementations, the instructional content and resulting output can be modified (e.g., segmented, morphed, etc.) to be properly aligned to portions of the live feed 104. In this example, the instructional content and resulting output can be tracked with respect to movements (e.g., fingers/brush applying blush) and may then be modified for placement (as AR content) on the cheek of user 106b, as shown by blush content 124 in live feed 104. The AR content may be applied using the same motions of the instructional content using the tracked movements from the instructional content. In this example, finger-based application of cheek color can be simulated to appear over time as if the cheek color is being applied to the user 106b in the same fashion as in the instructional content. The resulting AR content 124 may appear in a determined location corresponding to the location of user 120, as retrieved from a face location of user 120, shown by content 122.

[0040] FIG. 2 is a block diagram of an example computing device 202 with framework for extracting and modifying instructional content for overlay onto image content presented in an AR experience, according to example implementations. In some implementations, the framework may extract image content from media content items (e.g., images, image frames, videos, video clips, video or image segments, etc.) for use in generating virtual content for presentation in the AR experience. In some implementations, the framework may be used to generate virtual content that may be overlaid onto other media content items (e.g., a live feed of a user) to provide an AR experience that assists the user in learning how to apply instructional content from the media content item to a face, an object, or other element captured in the live feed of the user.

[0041] In operation, the system 200 provides a mechanism to use CV to determine how to modify extracted content by analyzing content in instructional images or videos and content in a live (video) feed. In some implementations, the system 200 may use machine learning to generate virtual content from extracted content from such instructional images or videos. The virtual content may be overlaid onto a live video feed. In some implementations, the system 200 may also use machine learning to estimate high dynamic range (HDR) lighting and/or illumination for lighting and rendering the virtual content into the live feed.

[0042] As shown in FIG. 2, the computing device 202 may receive and/or access instructional content 204 via network 208, for example. The computing device 202 may also receive or otherwise access virtual content from AR content source 206 via network 208, for example.

[0043] The example computing device 202 includes memory 210, a processor assembly 212, a communication module 214, a sensor system 216, and a display device 218. The memory 210 may include an AR application 220, AR content 222, an image buffer 224, an image analyzer 226, a computer vision system 228, and a render engine 230. The computing device 202 may also include various user input devices 232 such as one or more controllers that communicate with the computing device 202 using a wireless communications protocol. In some implementations, the input device 232 may include, for example, a touch input device that can receive tactile user inputs, a microphone that can receive audible user inputs, and the like. The computing device 202 may also one or more output devices 234. The output devices 234 may include, for example, a display for visual output, a speaker for audio output, and the like.

[0044] The computing device 202 may also include any number of sensors and/or devices in sensor system 216. For example, the sensor system 216 may include a camera assembly 236 and a 3-DoF and/or 6-DoF tracking system 238. The tracking system 238 may include (or have access to), for example, light sensors (not shown), inertial measurement unit (IMU) sensors 240, audio sensors 242, image sensors 244, distance/proximity sensors (not shown), positional sensors (not shown), haptic sensors (not shown), and/or other sensors and/or different combination(s) of sensors. Some of the sensors included in the sensor system 216 may provide for positional detection and tracking of the device 202. Some of the sensors of system 216 may provide for the capture of images of the physical environment for display on a component of a user interface rendering the AR application 220. Some of the sensors included in sensor system 216 may track content within instructional content 204 and or one or more image and/or video feeds captured by camera assembly 236. Tracking content within both instructional content 204 (e.g., a first media content item) and feeds (e.g., a second content item) captured by assembly 236 may provide a basis for correlating objects between the two media content items for purposes of generating additional content to assist the user in learning how to carry out instructions from the instructional content 204.

[0045] The computing device 202 may also include a tracking stack 245. The tracking stack 245 may represent movement changes over time for a computing device and/or for an AR session. In some implementations, the tracking stack 245 may include the IMU sensor 240 (etc. gyroscopes, accelerometers, magnetometers). In some implementations, the tracking stack 245 may perform image-feature movement detection. For example, the tracking stack 245 may be used to detect motion by tracking features (e.g., objects) in an image or number of images. For example, an image may include or be associated with a number of trackable features that may be tracked from frame to frame in a video including the image (or number of images), for example. Camera calibration parameters (e.g., a projection matrix) are typically known as part of an onboard device camera and thus, the tracking stack 245 may use image feature movement along with the other sensors to detect motion and changes within the image(s). The detected motion may be used to generate virtual content (e.g., AR content) using the images to fit an overlay of such images onto a live feed from camera assembly 236, for example. In some implementations, the original images, and/or the AR content may be provided to neural networks 256, which may use such images and/or content to further learn and provide lighting, additional tracking, or other image changes. The output of such neural networks may be used to train AR application 220, for example, to accurately generate and render particular AR content onto live feeds.

[0046] As shown in FIG. 2, the computer vision system 228 includes a content extraction engine 250, a segment detector 252, and a texture mapper 254. The content extraction engine 250 may include a content detector 251 to analyze content within media content items (e.g., image frames, images, video, etc.) and identify particular content or content areas of the media content items in which to extract. For example, the content extraction engine may employ computer vision algorithms to identify features (e.g., objects) within particular image frames and may determine relative changes in features (e.g., objects) in the image frames relative to similar features (e.g., objects) in another set of image frames (e.g., another media content item). In some implementations, the content extraction engine 250 may recognize features and/or changes in such features between two media content items and may extract portions of a first media content item in order to enable render engine 230 to render content from the first media content item over objects and/or content in a second media content item.

[0047] Similarity may be based on performing computer vision analysis (using system 228) on both a first media content item and a second content item. The analysis may compare particular content including, but not limited to objects, tracked objects, object shapes, movements, tracked movements, etc. to determine at least a portion of similarity between such content. For example, an eye may be detected in a first media content item while another eye may be detected in a second media content item. The similarity may be used to apply movements being shown near the eye in the first media content item as an overlay in the eye in the second media content item to mimic a result from the first media content item in the second media content item.

[0048] In some implementations, the computer vision system 228 may perform multiple passes of CV algorithms (e.g., techniques). In an example media content item (e.g., video), the computer vision system 228 may perform a first pass to assess areas of the video that include moving elements. For example, the computer vision system 228 may detect movement in an area where a makeup brush is moving over a face of the user. Such an area may be extracted for further processing. For example, the computer vision system 228 may perform the further processing by performing a second pass of the extracted content to target an area in which the makeup tool is working upon. The targeted area in this example may include an eyeliner path and as such, the system 228 may extract the eyeliner path for application as an overlay on another video showing a live feed of a user, for example. In some implementations, the computer vision system 228 may be used on a digital model in which a content author generates media content items using digital models instead of using themselves as the model in the media content item (e.g., video). The system 228 may extract the content and movements applied to the digital model for use as an overlay on another video showing a live feed of a user, for example.

[0049] The content detector 251 may identify particular edges, bounding boxes, or other portions within media content items (e.g., within image frames of media content items). For example, the content detector 251 may identify all or a portion of edges of elements (e.g., features, objects, face portions, tools, body portions, etc.) within a particular image frame (or set of image frames). In some implementations, the content detector 251 may identify edges of a tool being used in the content item (or identify bounding boxes around such tools) in order to determine that the edges (or bounded tool) represent particular unique images for a given location (e.g., a reference position for using a painting tool). In some implementations, the identified edges and bounding boxes may be provided by an author of a particular content item using timestamps, reference positions, and/or other location representation for identifying content in a media content item. In some implementations, the content detector 251 may identify features within media content items. The features may be detected when segments of the media content items are provided as part of a process to place virtual content (e.g., VR content, AR content) onto objects identified within additional media content items.

[0050] In some implementations, the content detector 251 may use landmark detection techniques, face mesh overlay techniques (e.g., using feature points), masking techniques, and mesh blending techniques to detect and extract particular content from media content items.

[0051] The segment detector 252 may detect video segments within media content items. In some implementations, the segment detector 252 may be configured to detect preconfigured segments generated by a media content item author. For example, a user that generates instructional media content items may preconfigure (e.g., label, group, etc.) segments of the content item (e.g., video). The segment detector 252 may use the preconfigured segments to perform comparisons between segments of a first content item and objects in other content items. In some implementations, the segment detector 252 may generate sectors using content detector 251, for example.

[0052] In some implementations, the texture mapper 254 may be used to extract texture (rather than a full face mesh) from particular image frames, objects, etc. The texture mapper 254 may define image detail, surface texture, and/or color information onto three dimensional AR objects, for example. Such content may be mapped and used as an overlay onto objects within a media content item.

[0053] In some implementations, the computer vision system 228 also includes a lighting estimator 258 with access to neural networks 256. The lighting estimator 258 may include or have access to texture mapper 254 in order to provide proper lighting for the virtual content (e.g., VR and/or AR content) being overlaid onto objects or features within media content items. In some implementations, the lighting estimator 258 may be used to generate lighting estimations for an AR environment. In general, the computing device 202 can generate the lighting conditions to illuminate content which may be overlaid on objects in a media content item. In addition, the device 202 can generate the AR environment for a user of the system 200 to trigger rendering of the AR scene with the generated lighting conditions on device 202, or another device. Lighting estimator can be used to remove lighting information and extract material information from the original content to be applied properly as the AR content overlaid on top of the camera feed.

[0054] As shown in FIG. 2, the render engine 230 includes a UI content generator 260 and an AR content generator 262. The UI content generator 260 may use extracted content (e.g., from engine 250) to generate and/or modify image frames representing the extracted content. Such image frames may be used by AR content generator 262 to generate the AR content for overlay onto objects within media content items. In some implementations, the UI content generator 260 may generate elements to display advertising and purchasing options for products that are described within instructional content 204, for example. In some implementations, the UI content generator 260 may additionally generate suggestions for additional media content items related to particular accessed media, instructional content, and/or products.

[0055] The computing device 202 may also include face tracking software 264. The face tracking software 264 may include (or have access to) one or more face cue detectors (not shown), smoothing algorithms, pose detection algorithms, computer vision algorithms (via computer vision system 228), optical flow algorithms, and/or neural networks 256. The face cue detectors may operate on or with one or more cameras assemblies 236 to determine a movement in the position of particular facial features, head, or body of the user. For example, the face tracking software 264 may detect or obtain an initial three-dimensional (3D) position of computing device 202 in relation to facial features or body features (e.g., image features) captured by the one or more camera assemblies 236. In some implementations, one or more camera assemblies 236 may function with software 264 to retrieve particular facial features captured in a live feed, for example, by camera assemblies 236 in order to enable placement of AR content upon the facial features captured in the live feed. In addition, the tracking system 238 may access the onboard IMU sensor 240 to detect or obtain an initial orientation associated with the computing device 202, if for example, the user is moving (or moving the device 202) during capture.

[0056] The computing device 202 may also include object tracking software 266. The object tracking software 266 may include (or have access to) one or more object detectors (e.g., object trackers, not shown), smoothing algorithms, pose detection algorithms, computer vision algorithms (via computer vision system 228), optical flow algorithms, and/or neural networks 256. The object detectors may operate on or with one or more cameras assemblies 236 to determine a movement in the position of particular objects within a scene. For example, the object tracking software 266 may detect or obtain an initial three-dimensional (3D) position of computing device 202 in relation to objects (e.g., image features) captured by the one or more camera assemblies 236. In some implementations, one or more camera assemblies 236 may function with software 266 to retrieve particular object features captured in a live feed, for example, by camera assemblies 236 in order to enable placement of AR content upon the tracked objects captured in the live feed.

[0057] In some implementations, the computing device 202 is a mobile computing device (e.g., a cellular device, a tablet, a laptop, an HMD device, AR glasses, a smart watch, smart display, etc.) which may be configured to provide or output AR content to a user via the device and/or via an HMD device.

[0058] The memory 210 can include one or more non-transitory computer-readable storage media. The memory 210 may store instructions and data that are usable to generate an AR environment for a user.

[0059] The processor assembly 212 includes one or more devices that are capable of executing instructions, such as instructions stored by the memory 210, to perform various tasks associated with the systems and methods described herein. For example, the processor assembly 212 may include a central processing unit (CPU) and/or a graphics processor unit (GPU). For example, if a GPU is present, some image/video rendering tasks, such as shading content based on determined lighting parameters, may be offloaded from the CPU to the GPU.

[0060] The communication module 214 includes one or more devices for communicating with other computing devices, such as the instructional content 204 and the AR content source 206. The communication module 214 may communicate via wireless or wired networks, such as the network 208.

[0061] The IMU 240 detects motion, movement, and/or acceleration of the computing device 202 and/or the HMD. The IMU 240 may include various different types of sensors such as, for example, an accelerometer, a gyroscope, a magnetometer, and other such sensors. A position and orientation of the device 202 may be detected and tracked based on data provided by the sensors included in the IMU 240. The detected position and orientation of the device 202 may allow the system to in turn, detect and track the user’s gaze direction and head movement. Such tracking may be added to a tracking stack 245 that may be polled by the computer vision system 228 to determine changes in device and/or user movement and to correlate times associated to such changes in movement. In some implementations, the AR application 220 may use the sensor system 216 to determine a location and orientation of a user within a physical space and/or to recognize features or objects within the physical space.

[0062] The camera assembly 236 captures images and/or videos of the physical space around the computing device 202. The camera assembly 236 may include one or more cameras. The camera assembly 236 may also include an infrared camera or time of flight sensors (e.g., used to capture depth).

[0063] The AR application 220 may present or provide virtual content (e.g., AR content) to a user via the device 202 and/or one or more output devices 234 of the computing device 202 such as the display device 218, speakers (e.g., using audio sensors 242), and/or other output devices (not shown). In some implementations, the AR application 220 includes instructions stored in the memory 210 that, when executed by the processor assembly 212, cause the processor assembly 212 to perform the operations described herein. For example, the AR application 220 may generate and present an AR environment to the user based on, for example, AR content 222 (e.g., AR content 124), and/or AR content received from the AR content source 206.

[0064] In some implementations, advertisement content 126 may be provided to the user. Such content 126 may include UI content that includes products accessed in item 100, media content items related to particular accessed media content item 100, instructional content, and/or related products. In some implementations, the system 200 may use computer vision system 228 or speech-to-text technology (if the content creator mentions them in the video) to automatically detect which products are used within a particular instructional content 204. The automatically detected products can be used as input to a search to generate advertisement content for display to users accessing the instructional content 204. This may provide an advantage of allowing the content item author to automatically embed (or otherwise provide) advertising content and informational content for products being used without having to manually provide the information alongside (or within) the executing content item

[0065] The AR content 222 herein may include AR, VR, and/or mixed reality (MR) content such as images or videos that may be displayed on a display 218 associated with the computing device 202, or other display device (not shown). For example, the AR content 222 may be generated with instructional content, UI content, lighting (using lighting estimator 258) that substantially matches the physical space in which the user is located. The AR content 222 may include objects that overlay various portions of the physical space. The AR content 222 may be rendered as flat images or as three-dimensional (3D) objects. The 3D objects may include one or more objects represented as polygonal meshes. The polygonal meshes may be associated with various surface textures, such as colors and images. The polygonal meshes may be shaded based on various lighting and/or texture parameters generated by the AR content source 206 and/or computer vision system 228 and/or render engine 230.

[0066] In some implementations, a number of mesh algorithms may be used including, but not limited to face alignment techniques including facial landmark detection (e.g., Haar Cascade Face Detector or Dlib using Histogram of Oriented number of Gradients (HOG)-based Face Detector), finding convex hull, Delaunay triangulation, and affine warp triangles. In some implementations, seamless cloning can be used to extract, morph, and overlay content from the video onto a camera feed. Semantic segmentation can be used to extract objects from the video content. In some implementations, a neural network approach can be used using autoencoders to extract and apply content. In addition, machine learning approaches may be used to interpolate and approximate the mesh when a frame in the original video is not sufficient to extrapolate mesh. For example, if a person looked away or down for one second and the mesh extraction algorithm was not able to detect a human face in the frame, the preceding and succeeding frames can be used to estimate the mesh at that point in the instructions.

[0067] The AR application 220 may use the image buffer 224, image analyzer 226, lighting estimator 258, and render engine 230 to generate images for display based on the AR content 222. For example, one or more images captured by the camera assembly 236 may be stored in the image buffer 224. The AR application 220 may use the computer vision system 228 to determine a location within a media content item in which to insert content. For example, the AR application 220 may determine a tracked object in which to overlay the AR content 222. In some implementations, the location may also be determined based on a location that was determined for the content in a previous image captured by the camera assembly (e.g., the AR application 220 may cause the content to move across a surface in that was identified within the physical space captured in the image).

……
……
……

本文链接：https://patent.nweon.com/20935

Google Patent | Computer vision based extraction and overlay for instructional augmented reality

您可能还喜欢...

分类

最新AR/VR行业分享

最新AR/VR论文

最新AR/VR行业招聘

Google Patent | Computer vision based extraction and overlay for instructional augmented reality

您可能还喜欢...

Google Patent | Millimeter wave radar on flexible printed circuit board

Google Patent | Power Management For Electromagnetic Position Tracking Systems

Google Patent | Classifying facial expressions using eye-tracking cameras

分类

最新AR/VR行业分享

最新AR/VR论文

最新AR/VR行业招聘