Google Patent | Volumetric Capture Of Objects With A Single Rgbd Camera

编辑：映维 | 分类：Google | 2020年11月6日

Patent: Volumetric Capture Of Objects With A Single Rgbd Camera

Publication Number: 20200349772

Publication Date: 20201105

Applicants: Google

Google Patent | Volumetric Capture Of Objects With A Single Rgbd Camera

Abstract

A method includes receiving a first image including color data and depth data, determining a viewpoint associated with an augmented reality (AR) and/or virtual reality (VR) display displaying a second image, receiving at least one calibration image including an object in the first image, the object being in a different pose as compared to a pose of the object in the first image, and generating the second image based on the first image, the viewpoint and the at least one calibration image.

CROSS-REFERENCE TO RELATED APPLICATION

[0001] This application claims priority to, and the benefit of, U.S. Provisional Patent Application No. 62/840,905, filed on Apr. 30, 2019, entitled “VOLUMETRIC CAPTURE OF HUMANS WITH A SINGLE RGBD CAMERA VIA SEMI-PARAMETRIC LEARNING”, the disclosure of which is incorporated by reference herein in its entirety.

FIELD

[0002] Embodiments relate to displaying images in a virtual environment and/or in an augmented reality environment (e.g., on a head mount display (HMD)).

BACKGROUND

[0003] Complex capture rigs can be used to generate very high-quality volumetric reconstructions (e.g., images). These systems rely on high-end, costly infrastructure to process the high volume of data that the rigs capture. The required computational time of several minutes per frame make current techniques unsuitable for real-time applications. Another way to capture humans is to extend real-time non-rigid fusion pipelines to multi-view capture setups. However, the results suffer from distorted geometry, poor texturing and inaccurate lighting, making it difficult to reach the level of quality required in augmented reality (AR)/virtual reality (VR) applications.

SUMMARY

[0004] In a general aspect, a device, a system, a non-transitory computer-readable medium (having stored thereon computer executable program code which can be executed on a computer system), and/or a method can perform a process with a method including receiving a first image including color data and depth data, determining a viewpoint associated with an augmented reality (AR) and/or virtual reality (VR) display displaying a second image, receiving at least one calibration image including an object in the first image, the object being in a different pose as compared to a pose of the object in the first image, and generating the second image based on the first image, the viewpoint and the at least one calibration image.

[0005] Implementations can include one or more of the following features. For example, the first image can be received from a single camera configured to capture the color data as red, green, blue (RGB) data and at least one of capture the depth data and generate the depth data based on the color data. The viewpoint associated with the AR and/or VR display can be different than a viewpoint associated with the first image. The at least one calibration image can be a silhouette image of the object. The generating of the second image can include determining a target pose of the object by mapping two dimensional (2D) keypoints to corresponding three dimensional (3D) points of depth data associated with the at least one calibration image, and generating the second image by warping the object in the at least one calibration image using a convolutional neural network that takes the at least one calibration image and the target pose of the object as input.

[0006] For example, the generating of the second image can include generating at least one part-mask in a first pass of a convolutional neural network having the at least one calibration image as an input, generating at least one part-image in the first pass of the convolutional neural network, and generating the second image a second pass of the convolutional neural network having the at least one part-mask and the at least one part-image as input. The generating of the second image can include using two passes of a convolutional neural network that is trained by minimizing at least two losses associated with warping the object. The second image can be blended using a neural network to generate missing portions of the second image. The second image can be a silhouette image of the object, the method further comprising merging the second image with a background image.

[0007] For example, the method can further include a pre-processing stage in which a plurality of images can be captured while the pose of the object is changed, storing the plurality of images as the at least one calibration image, generating a similarity score for each of the at least one calibration image based on a target pose, and selecting the at least one calibration image from the at least one calibration image based on the similarity score. The method can further include a pre-processing stage in which a plurality of images can be captured while the pose of the object is changed, storing the plurality of images as the at least one calibration image, capturing an image, during a communications event, the image including the object in a new pose, and adding the image to the stored plurality of images. In addition, a non-transitory computer-readable storage medium may have stored thereon computer executable program code which, when executed on a computer system, causes the computer system to perform a method according to any of the method claims. Also, an augmented reality (AR) and/or virtual reality (VR) system may comprise a sensor configured to capture color data and depth data and a processor configured to perform a method according to any of the method claims.

BRIEF DESCRIPTION OF THE DRAWINGS

[0008] Example embodiments will become more fully understood from the detailed description given herein below and the accompanying drawings, wherein like elements are represented by like reference numerals, which are given by way of illustration only and thus are not limiting of the example embodiments and wherein:

[0009] FIG. 1A illustrates a diagram of a system according to an example implementation.

[0010] FIG. 1B illustrates a block diagram of a signal flow according to an example implementation.

[0011] FIG. 2 illustrates a block diagram of a signal flow according to an example implementation.

[0012] FIG. 3 illustrates another block diagram of a signal flow according to an example implementation.

[0013] FIG. 4 illustrates still another block diagram of a signal flow according to an example implementation.

[0014] FIG. 5 illustrates yet another block diagram of a signal flow according to an example implementation.

[0015] FIG. 6 illustrates a method for generating an image according to an example implementation.

[0016] FIG. 7 illustrates a method for generating a normal map according to an example implementation.

[0017] FIG. 8 illustrates a method for selecting an image according to an example implementation.

[0018] FIG. 9 illustrates a method for warping an image according to an example implementation.

[0019] FIG. 10 illustrates a method for generating an image according to an example implementation.

[0020] FIG. 11 shows an example of a computer device and a mobile computer device according to at least one example embodiment.

[0021] It should be noted that these Figures are intended to illustrate the general characteristics of methods, structure and/or materials utilized in certain example embodiments and to supplement the written description provided below. These drawings are not, however, to scale and may not precisely reflect the precise structural or performance characteristics of any given embodiment, and should not be interpreted as defining or limiting the range of values or properties encompassed by example embodiments. For example, the relative thicknesses and positioning of molecules, layers, regions and/or structural elements may be reduced or exaggerated for clarity. The use of similar or identical reference numbers in the various drawings is intended to indicate the presence of a similar or identical element or feature.

DETAILED DESCRIPTION

[0022] Techniques used to capture and/or generate multi-view images for augmented reality (AR)/virtual reality (VR) (e.g., AR and/or VR) applications can be problematic in that they include expensive multi-view capture systems that can use complex capture rigs including several (4-8) calibrated RGBD sensors in order to generate very high-quality volumetric reconstructions necessary in AR and/or VR applications. For example, real-time capture and/or generate multi-view images may be possible, however the complications of setting up such a multi-view system and the associated cost remain high. Further, real time multi-view systems can also have reduced quality compared to their non-real time counterparts. Example implementations can solve the problem of these expensive systems by using a single camera (e.g., RGBD sensor) in AR and/or VR applications. The single camera can capture and store images, as calibration images, that can be used by AR and/or VR applications to generate high-quality volumetric reconstructions.

[0023] The technical solution can include determining a pose of an object (e.g., a user) captured by the single camera. One of the calibration images can then be selected based on the pose and a viewpoint. For example, a calibration image that includes the object in a pose that most closely (but not likely exactly) matches the determined pose at the viewpoint may be selected. The object in the calibration image is then warped to match the pose of the object in the captured image. The image (including the warped object and a background) is then output as the volumetric reconstruction. Further, an augmented reality (AR) and/or virtual reality (VR) system may comprise a sensor configured to capture color data and depth data as well as a processor configured to (i) determining a viewpoint associated with an augmented reality (AR) and/or virtual reality (VR) display displaying a second image, (ii) receiving at least one calibration image including an object in the first image, the object being in a different pose as compared to a pose of the object in the first image, and (iii) generating the second image based on the first image, the viewpoint and the at least one calibration image. The object in the at least one calibration image is warped to attain a different pose as compared to a pose of the object in the first image.

[0024] The use of the single camera with the calibration images can be beneficial in that the single camera can simplify (e.g., simplify setup and operation) AR and/or VR systems and make the AR and/or VR systems available to more users. Further, the use of the single camera with the calibration images can reduce the cost associated with the AR and/or VR systems. In other words, the benefit of example implementations can be the ability to use a multi-view setup to capture ground truth data, and to train a model that enables free viewpoint rendering using only a single RGBD sensor. The ability to render portions of an object (e.g., a human) that cannot be seen in the current view comes from previously captured calibration images and information associated with the object’s shape (e.g., the human shape) and color learned by the model through training on multiple objects.

[0025] FIG. 1A illustrates a diagram of a system according to an example implementation. As shown in FIG. 1A, the system 100-A includes a camera 105, a server 140, a user 145, and a user 130. The system 100-A can be, at least, a portion of an augmented reality (AR) and/or virtual reality (VR) system. Therefore, the user 145 and the user 130 can be communicating using an AR and/or VR application. The user 130 can be communicating through use of the camera 105. The user 145 can be communicating via an AR and/or VR display (e.g., a head mount display (HMD), a display associated with a mobile device).

[0026] The camera 105 can capture image data, video data and/or video frame data (hereinafter image data) of the user 130 and communicate the image data to the server 140. The image data can include color (e.g., pixel) data and depth data. For example, the image data can be RGBD data. According to example implementations, a single (e.g., one) camera is 105 used. A camera can sometimes be called a sensor. Therefore, the camera 105 can be a single sensor (both a camera and a sensor may be referred to herein). The single camera 105 can be a conventional (e.g., readily available in commerce) camera configured to capture and communicate color data and depth data. The apparatus and methods associated with this system 100-A are advantageous over techniques that involved use of, for example, expensive multi-view capture rigs with several (4-8) calibrated RGBD sensors.

[0027] The camera 105 can be used to capture image(s) that are communicated to the server 140 and stored by the server 140. The image(s) can be called calibration image(s). The calibration image(s) can be captured as an initialization process (e.g., a pre-processing stage) of the AR and/or VR application.

[0028] The server 140 can be configured to use the calibration image(s) and/or the image data to generate and communicate modified image data 125 to the AR and/or VR display used by user 145. The user 130 can be in a first position and pose 130-1. The user 145 can move (e.g., virtually move) around the user 130 resulting in modified image data 125 being a rendering of the user 130 in a second position and pose 130-2 on the AR and/or VR display. The second position and pose 130-2 being different than the first position and pose 130-1. Accordingly, the server 140 can generate the modified image data 125 in response to receiving a viewpoint 120 associated with the AR and/or VR display.

[0029] In an example implementation, the camera 105 is in a fixed position. In other words, camera 105 does not move in response to the server 140 receiving the viewpoint 120. Accordingly, the server 140 can be configured to generate the modified image data 125 with the camera 105 in the fixed position. In other words, a rendered image can include at least one object (e.g., a human) in a different position and/or pose than the corresponding captured image.

[0030] In an example implementation, a viewpoint can refer to a virtual point from which the second image is to be observed. A pose of an object can include information on the object’s spatial orientation (and, possibly, on the relative position of different portions of the object). A calibration image can be an image that is captured and stored prior to generating the modified image. The calibration image can be filtered to remove a background so that the calibration only includes an object of interest (e.g., a video call participant) which is sometimes called a silhouette image. A plurality of calibration images can include images (and/or silhouette images) with the object of interest in different poses and viewpoints.

[0031] FIG. 1B illustrates a block diagram of a signal flow according to an example implementation. As shown in FIG. 1B, the signal flow 100-B includes the camera 105, a modify block 110, image(s) block 115, the viewpoint 120 and a modified image 125. In FIG. 1B, modify block 110 receives image data from camera 105, images (or calibration images) from image(s) block 115 and the viewpoint associated with an AR and/or VR display from the viewpoint 120 and generates the modified image 125. In other words, the modified image 125 can be (generated) based on captured image data, stored image data (e.g., the calibration image(s)) and a viewpoint. The captured image data and the stored image data(e.g., the calibration image(s)) can be captured by a single camera (e.g., camera 105) which is advantageous over techniques that involved use of, for example, expensive multi-view capture rigs with several (4-8) calibrated RGBD sensors.

[0032] The modify block 110 can be program code stored in a memory of the server 140 which is executed by a processor of the server 140. The image(s) 115 can be at least one calibration image stored in a memory of server 140. The at least one calibration image can be captured using the camera 105 prior to initiating communication using an AR and/or VR application. The at least one calibration image can be captured using the camera 105 during a calibration process as an initial or first phase (or stage or step) after initiating communication using an AR and/or VR application. The at least one calibration image can be captured using the camera 105 while communicating using an AR and/or VR application (e.g., as a user moves, rotates, changes a position and/or changes a pose). Example implementations can utilize the at least one calibration image in place of images captured by a second (or multiple) cameras included in the expensive multi-view capture rigs with several (4-8) calibrated RGBD sensors.

[0033] Modify block 110 can modify an image (e.g., generate modified image 125) by selecting an image from image(s) 115 based on the image received from camera 105. An object (e.g., a human) in the selected image is warped based on the viewpoint 120, a position of the object in the captured image and a pose of the object in the captured image (sometimes called a position and pose). Warping (sometimes called distorting) an object of an image can include changing a position and pose of the object. Changing the position and pose if the object can include manipulating pixels (e.g., moving or mapping to a different location (x, y)) associated with the object in the image.

[0034] Therefore, according to example implementations AR and/or VR applications can generate high-quality volumetric reconstructions using a single camera and calibration images. For example, a device, a system, a non-transitory computer-readable medium (having stored thereon computer executable program code which can be executed on a computer system), and/or a method can implement a technique with a method including receiving a first image including color data and depth data, determining a viewpoint associated with an augmented reality (AR) and/or virtual reality (VR) display displaying a second image, receiving at least one calibration image including an object in the first image, the object being in a different pose as compared to a pose of the object in the first image, and generating the second image based on the first image, the viewpoint and the at least one calibration image. The object in the at least one calibration image is warped to attain a different pose as compared to a pose of the object in the first image.

[0035] Also, for example, the first image can be received from a single camera configured to capture the color data as red, green, blue (RGB) data and at least one of capture the depth data and generate the depth data based on the color data. The viewpoint associated with the AR and/or VR display can be different than a viewpoint associated with the first image. The at least one calibration image can be a silhouette image of the object. The generating of the second image can include determining a target pose of the object by mapping two dimensional (2D) keypoints to corresponding three dimensional (3D) points of depth data associated with the at least one calibration image, and generating the second image by warping the object in the at least one calibration image using a convolutional neural network that takes the at least one calibration image and the target pose of the object as input.

[0036] Also, for example, the generating of the second image can include generating at least one part-mask in a first pass of a convolutional neural network having the at least one calibration image as an input, generating at least one part-image in the first pass of the convolutional neural network, and generating the second image a second pass of the convolutional neural network having the at least one part-mask and the at least one part-image as input. The generating of the second image can include using two passes of a convolutional neural network that is trained by minimizing at least two losses associated with warping the object. The second image can be blended using a neural network to generate missing portions of the second image. The second image can be a silhouette image of the object, the method further comprising merging the second image with a background image.

[0037] Also, for example, the method can further include a pre-processing stage in which a plurality of images can be captured while the pose of the object is changed, storing the plurality of images as the at least one calibration image, generating a similarity score for each of the at least one calibration image based on a target pose, and selecting the at least one calibration image from the at least one calibration image based on the similarity score. The method can further include a pre-processing stage in which a plurality of images can be captured while the pose of the object is changed, storing the plurality of images as the at least one calibration image, capturing an image, during a communications event, the image including the object in a new pose, and adding the image to the stored plurality of images.

[0038] In a first stage, image data (I.sub.cloud), a normal map (N), a pose ( ), and a confidence (c) are generated from image data ( ) captured by a camera (v) (e.g., an RGBD image captured by camera 105). For example, a colored depthmap can be re-rendered from a viewpoint (v) to generate the image (I.sub.cloud) and to generate the approximate normal map (N). In an example implementation, only the foreground of the image re-rendered by using a fast background subtraction technique that is based on depth and color (e.g., RGB). Further, the pose ( ) of an object (e.g., a user of the VR/AR application) by generating keypoints in the coordinate frame of the viewpoint (v). Additionally, the confidence (c) (e.g., as a scalar value) can be determined by measuring the divergence between the viewpoints (v, v). Equation 1 can represent this technique.

I.sub.cloud, , N, c=( , v, v) (1)

where,

[0039] I.sub.cloud is the image data,

[0040] N is the normal map,

[0041] is the pose,

[0042] c is a (scalar) confidence,

[0043] is captured image data,

[0044] v is a viewpoint of the camera,* and*

[0045] v is a viewpoint of a AR and/or VR display.

[0046] FIG. 2 illustrates a block diagram of a signal flow according to an example implementation. The signal flow illustrated in FIG. 2 can be of an example implementation of the aforementioned first stage. As shown in FIG. 2, a captured image attribute 205 block includes a detector 210 block, an image 215 block, a normal map 220 block, a pose block 225, and a confidence 230 block. The detector 210 block receives image data from the camera 105. The image data can include color and data and depth data. In the signal flow of FIG. 2, a camera (e.g., camera 105) intrinsic parameters (optical center o and focal length f) are known. Therefore, the function .PI..sup.-1(p, z|o, f): .sup.3.fwdarw..sup.3 maps a 2D pixel p=(x, y) with associated depth z to a 3D point in the local camera coordinate frame.

[0047] The detector 210 can be configured to generate the image 215, the normal map 220, the pose 225 and the confidence 230 based on the image data received from camera 105. In an example implementation, image 215 can be rendered from the image data using the function .PI..sup.-1.

[0048] To do so, the depth channel of is converted into a point cloud of size M in matrix form as P .di-elect cons..sup.4.times.M.The point cloud is then rotated and translated into a novel viewpoint coordinate frame as P=TP, where T=.sup.4.times.4 and is a homogeneous transformation representing the relative transformation between v and v. P is then rendered to a two-dimensional (2D) image I.sub.cloud by inserting each point with a 3.times.3 kernel to reduce re-sampling artifacts. Inserting each point is sometimes called point-based rendering or splatting (e.g., using a function call in OpenGL).

[0049] In an example implementation, the detector 210 can detect the pose ( ) of the object (e.g., the user) by determining (e.g., calculating, computing, and/or the like) 2D keypoints as .sub.2D=K.sub..gamma.( ) where K is a pre-trained feed-forward network. 2D keypoints can then be mapped to their 3D counterparts by using the depth channel of and transform the keypoints in the camera coordinate frame v as . Missing keypoints can be extrapolated based on object features. For example, if the object is a human, keypoints can be extrapolated based on a rigidity of the limbs, torso and/or face.

[0050] In some implementations, extrapolating keypoints may fail. In this situation the current image (e.g., as a frame of a video) can be discarded and a previous pose ( ) is used. In some implementations, in order to use (e.g., to communicate data if necessary) the keypoints in the pose ( ) in the the networks of equations (3) and (4), described below, each point in the image I.sub.cloud channel (e.g., a grayscale image of the image) can be encoded as a Gaussian centered around the point with a fixed variance.

[0051] In an example implementation, the detector 210 can generate a normal map 220 (or N). The normal map can be used to determine whether a pixel in the image data ( ) can be observed sufficiently in reference to the viewpoint of the camera v. The normal map can be generated using the techniques used to generate the image data (I.sub.cloud) described above. The normal map (N) color components (e.g., RGB) can correspond to the x, y, z coordinates of the surface normal.

[0052] In an example implementation, the detector 210 can generate a confidence 230 (or c). confidence (c) can be determined (e.g., calculated, computed, and/or the like) as the dot product between the cameras view vectors: c=[0,0,1]r.sub.z/.parallel.r.sub.z.parallel., where the viewpoint of the camera (v) is assumed to be the origin and r.sub.z is the third column of the rotation matrix for the viewpoint of the AR and/or VR display (v). The relationship between v and v, as a function of c, can be used to infer whether the viewpoint of the AR and/or VR display v is back-facing (c<0) or front-facing (c>0).

[0053] In a pre-processing (e.g., before the first) stage, a set of calibration images { .sub.calib.sup.n} can be captured and stored. The calibration images can be taken of the object (e.g., the user) in any number of poses and/or positions { .sub.calib.sup.n}. For example, the AR and/or VR application can include a routine configured to instruct the object (e.g., the user) to move, rotate, change position, change pose, and/or the like in front of the camera (e.g., camera 105) before the AR and/or VR communication starts. Example implementations may not capture a set of calibration images { .sub.calib.sup.n} large enough to contain the object in every possible pose that could be observed from viewpoint of the AR and/or VR display v. However, the set of calibration images { .sub.calib.sup.n} can include a sufficient number of images to extrapolate the appearance of the object from most, if not all, possible viewpoints of the AR and/or VR display (v).

[0054] According to an example embodiment, an image that best resembles the in the new, target or desired pose ( ) in the viewpoint of the AR and/or VR display (v) can be selected from the set of calibration images and poses { .sub.calib.sup.n, .sub.calib.sup.n}. Equation 2 can represent this technique.

.sub.calib, .sub.calib=S({ .sub.calib.sup.n, .sub.calib.sup.n}, ) (2)

[0055] where,

[0056] .sub.calib is the calibration image,

[0057] .sub.calib is the calibration image pose,

[0058] .sub.calib.sup.n is the set of calibration images,

[0059] .sub.calib.sup.n is the set of calibration image poses,* and*

[0060] is the target or desired pose.

[0061] FIG. 3 illustrates a block diagram of a signal flow according to an example implementation. The signal flow illustrated in FIG. 3 can be of an example the implementation of selecting a calibration image stored during the aforementioned pre-processing stage. As shown in FIG. 3, a calibration image 305 block includes a pose detector 310 block, a selector 315 block, an image block and a pose 325 block. The selector 315 block uses the image(s) 115 block, the pose 225 block, the viewpoint 120 block and the pose detector 310 block as input. Further, the pose detector 310 block uses the image(s) 115 block as input.

[0062] The pose detector 310 is configured to detect a pose of each calibration image (e.g., each of the set of calibration images { .sub.calib.sup.n}) received from the image(s) 115 block. Each pose is sent to the selector 315 block and associated with it’s corresponding image of the image(s) 115. The selector 315 block uses the pose corresponding to each image of the image(s) 115, pose 225 and viewpoint 120 to select one of the image(s) 115 as the calibration image ( .sub.calib).

[0063] The selected image can be the image that has the pose that most closely matches pose 225 at the viewpoint 120. In other words, the selected image can be the image that when warped using equation (3), see below, can provide sufficient information to equation (4), see below, to produce the image to be displayed on the VR/AR display. Image 320 can be the calibration image ( .sub.calib) and the pose for the selected the calibration image ( .sub.calib) as determined by the pose detector 310 can be the pose 325.

[0064] A score for each of the image(s) 115 can be determined (e.g., calculated, computed, and/or the like). In an example implementation, the image with the highest score can be selected as the calibration image. Alternatively, the image with the lowest score can be selected as the calibration image. Alternatively, the image with a score that satisfies some criterion (e.g., equal to or above a threshold number, below a threshold number, and the like) can be selected as the calibration image. The score can be calculated based on weighted scores for elements of the object. For example, the score for an object that is a human can be computed as follows:

S.sup.n=.omega..sub.headS.sub.head.sup.n+.omega..sub.torsoS.sub.torso.su- p.n+.omega..sub.simS.sub.sim.sup.n (5)

[0065] where,

[0066] .omega..sub.head is a weight variable for a head score,

[0067] S.sub.head.sup.n is the head score,

[0068] .omega..sub.torso is a weight variable for a torso score,

[0069] S.sub.torso.sup.n is the torso score

[0070] .omega..sub.sim is a weight variable for a similarity score,* and*

[0071] S.sub.sim.sup.n is the similarity score.

[0072] A 3D unit vector representing the forward-looking direction of the user’s head can be computed using the 3D keypoints k. The vector can be computed by creating a local coordinate system from the keypoints of the eyes and nose. 3D unit vectors {d.sub.calib.sup.n} can be determined (e.g., calculated, computed, and/or the like) from the calibration images keypoints {k.sub.calib.sup.n}. The head score can be determined (e.g., calculated, computed, and/or the like) as the dot product S.sub.head.sup.n=dd.sub.calib.sup.n, and a similar process can be used to determine (e.g., calculate, compute, and/or the like) S.sub.torso.sup.n, where the coordinate system can be created from the left/right shoulder and the left hip keypoints.

[0073] These two scores are already sufficient to accurately select a calibration image from the desired novel viewpoint. However, they do not take into account the configuration of the limbs. Therefore, S.sub.sim.sup.n can be used to determine (e.g., calculate, compute, and/or the like) a similarity score between the keypoints k.sub.calib.sup.n in the calibration images to those in the new, target or desired pose ( ).

[0074] In order to simplify the notation, {circumflex over (k)} and {circumflex over (k)}.sub.calib.sup.n can be referred to as the image-space 2D coordinates of keypoints in homogeneous coordinates. A similarity transformation (rotation, translation, scale) T.sub.n .di-elect cons..sup.3.times.3 that can align the two sets can be determined (e.g., calculated, computed, and/or the like). In an example implementation, at least 2 points may be needed to estimate a 4 degrees of freedom (DOF) transformation (e.g., one for rotation, two for translation, and one for scale). Therefore, arm keypoints (elbow, wrist) and leg keypoints (knee, foot) can be grouped together. For example, all the keypoints belonging to the left arm group (LA) can be calculated as:

arg min T n LA LA k ^ LA - T n LA k ^ calib n , LA 2 ( 6 ) ##EQU00001##

[0075] where,

[0076] {circumflex over (k)}.sup.LA are the detected, left arm keypoints for the current view,

[0077] {circumflex over (k)}.sub.calib.sup.n,LA are the detected left arm keypoints for the calibration images,

[0078] Keypoints are expressed in 3D homogeneous coordinates (4.times.1 vectors),* and*

[0079] The transformation T.sub.n.sup.LA is a 4.times.4 matrix that is applied to roto-translate each keypoint from the calibration image to the current view.

[0080] The similarity score can be defined as:

S.sup.LA=exp(-.sigma..parallel.{circumflex over (k)}-T.sub.n.sup.LA{circumflex over (k)}.sub.calib.sup.n,LA.parallel.) (7)

[0081] where,

[0082] .sigma. is a scaling factor,* and*

[0083] all the other quantities are defined above.

[0084] The final S.sub.sim.sup.n can be the sum of the scores for elements of the object (e.g., the four (4) limbs) (indexed by j). The weights .omega..sub.j can be adjusted to give more importance to head and torso directions, which can define the desired target viewpoint. The calibration image ( .sub.calib) with the respective pose .sub.calib and with the highest score S can be selected as the calibration image (e.g., by selector 315).

[0085] The selected calibration image ( .sub.calib) should have a similar viewpoint to the viewpoint 120 associated with the AR and/or VR display (v). However, the pose .sub.calib could be different from the desired pose ( ). In other words, the set of calibration images ({ .sub.calib.sup.n}) is unlikely to include an image at the desired pose ( ). Therefore, the selected calibration image ( .sub.calib) can be warped to generate an image equivalent to a silhouette of the object (I.sub.warp) (e.g., a silhouette image or part-image) and a silhouette mask (or part-mask) of the object in the desired pose (I.sub.warp.sup. ). According to an example implementation, a convolutional neural network can be used to warp the selected calibration image ( .sub.calib).

[0086] According to an example implementation, a calibration image .sub.calib can be selected (e.g., from the set of calibration images { .sub.calib.sup.n}). A neural network W with learnable parameters .omega. can warp the selected image into the desired pose ( ) based on an object pose .sub.calib. Substantially simultaneous a silhouette mask (or part-mask) of the object in the desired pose ( )(I.sub.warp.sup. ) can be generated. Equation 3 can represent this technique.

I.sub.warp, I.sub.warp.sup. =W.sub..omega.( .sub.calib, .sub.calib, ) (3)

[0087] where,

[0088] I.sub.warp is a silhouette of the object in the desired (new or target) pose,

[0089] I.sub.warp.sup. is a silhouette mask of the object in the desired (new or target) pose,

[0090] W.sub..omega. is a neural network W with learnable parameters 0),

[0091] .sub.calib is a calibration image,

[0092] .sub.calib is an object pose,* and*

[0093] is a desired pose.

[0094] FIG. 4 illustrates still another block diagram of a signal flow according to an example implementation. The signal flow illustrated in FIG. 4 can be of an example the implementation of warping a calibration image. As shown in FIG. 4, an image warper 405 block includes a warper 410 block, an image block 415 and an image mask 420. The warper 410 block uses the pose 325, the image 320 and the pose 225 as input. In an example implementation, the warper 410 generates the image 415 and the image mask 420 based on the pose 325, the image 320 and the pose 225.

[0095] The warper 410 can use a convolutional neural network to generate the image 415 as the silhouette of the object (e.g., a silhouette image or part-image) in the desired pose (I.sub.warp) based on image 320 as the calibration image ( .sub.calib), the pose 325 as the pose of an object in the image 320 ( .sub.calib), and the pose 225 as a the target or desired pose ( ). Further, the warper 410 can use the convolutional neural network to generate the image mask 420 as the silhouette mask (or part-mask) of the object in the desired pose (I.sub.warp.sup. ) based on the image 320 as the calibration image ( .sub.calib), the pose 325 as the pose of an object in the image 320 ( .sub.calib), and the pose 225 as a the target or desired pose ( ).

……
……
……

本文链接：https://patent.nweon.com/13610

Google Patent | Volumetric Capture Of Objects With A Single Rgbd Camera

您可能还喜欢...

分类

最新AR/VR行业分享

最新AR/VR论文

最新AR/VR行业招聘

Google Patent | Volumetric Capture Of Objects With A Single Rgbd Camera

您可能还喜欢...

Google Patent | Head mounted display with multifocal module

Google Patent | Providing augmented reality view based on geographical data

Google Patent | Geographic augmented reality design for low accuracy scenarios

分类

最新AR/VR行业分享

最新AR/VR论文

最新AR/VR行业招聘