Nvidia Patent | Inverse Rendering Of A Scene From A Single Image

Patent: Inverse Rendering Of A Scene From A Single Image

Publication Number: 20200160593

Publication Date: 20200521

Applicants: Nvidia

Abstract

Inverse rendering estimates physical scene attributes (e.g., reflectance, geometry, and lighting) from image(s) and is used for gaming, virtual reality, augmented reality, and robotics. An inverse rendering network (IRN) receives a single input image of a 3D scene and generates the physical scene attributes for the image. The IRN is trained by using the estimated physical scene attributes generated by the IRN to reproduce the input image and updating parameters of the IRN to reduce differences between the reproduced input image and the input image. A direct renderer and a residual appearance renderer (RAR) reproduce the input image. The RAR predicts a residual image representing complex appearance effects of the real (not synthetic) image based on features extracted from the image and the reflectance and geometry properties. The residual image represents near-field illumination, cast shadows, inter-reflections, and realistic shading that are not provided by the direct renderer.

CLAIM OF PRIORITY

[0001] This application claims the benefit of U.S. Provisional Application No. 62/768,591 (Attorney Docket No. 510888) titled “Inverse Rendering, Depth Sensing, and Estimation of 3D Layout and Objects from a Single Image,” filed Nov. 16, 2018, the entire contents of which is incorporated herein by reference.

TECHNICAL FIELD

[0002] The present disclosure relates to training a neural network model to perform inverse rendering. More specifically, inverse rendering is performed using a single input image to generate reflectance and geometry properties for the single input image.

BACKGROUND

[0003] As one of the core problems in computer vision, inverse rendering aims to estimate physical attributes (e.g., geometry, reflectance, and illumination) of a scene from photographs, with wide applications in gaming, augmented reality, virtual reality, and robotics. As a long-standing, highly ill-posed problem, inverse rendering has been studied primarily for single objects or for estimating a single scene attributes. There is a need for addressing these issues and/or other issues associated with the prior art.

SUMMARY

[0004] Inverse rendering estimates physical scene attributes (e.g., reflectance, geometry, and lighting) from image(s) and is used for gaming, virtual reality, augmented reality, and robotics. An inverse rendering network (IRN) receives a single input image of a 3D scene and generates the physical scene attributes for the image. Specifically, the IRN estimates reflectance properties (albedo), geometry properties (surface normal vectors), and an illumination map (for global distant-direct lighting). Generally, the albedo characterizes materials in the image. In an embodiment, the IRN also predicts glossiness segmentation.

[0005] The IRN is trained by using the estimated physical scene attributes generated by the IRN to reproduce the input image and updating parameters of the IRN to reduce differences between the reproduced input image and the input image. A direct renderer and a residual appearance renderer (RAR) reproduce the input image. The RAR predicts a residual image representing complex appearance effects of the real (not synthetic) image based on features extracted from the image and the reflectance and geometry properties. The residual image represents near-field illumination, cast shadows, inter-reflections, and realistic shading that are not provided by the direct renderer.

[0006] A method, computer readable medium, and system are disclosed for training a neural network model to perform inverse rendering. Reflectance properties and geometry properties extracted from an image of a three-dimensional (3D) scene are provided to a first encoder neural network that computes intrinsic features based on the reflectance properties and geometry properties. The image is processed by a second encoder neural network to produce image features. A decoder neural network computes a residual image representing complex appearance effects of the image based on the image features and the intrinsic features.

BRIEF DESCRIPTION OF THE DRAWINGS

[0007] FIG. 1A illustrates a block diagram of an inverse rendering training system, in accordance with an embodiment.

[0008] FIG. 1B illustrates a block diagram of the residual appearance renderer of FIG. 1A, in accordance with an embodiment.

[0009] FIG. 1C illustrates a flowchart of a method for computing a residual image, in accordance with an embodiment.

[0010] FIG. 1D illustrates an image, corresponding extracted properties, and a reconstructed image, in accordance with an embodiment.

[0011] FIG. 2A illustrates a flowchart of a method for training an inverse rendering system, in accordance with an embodiment.

[0012] FIG. 2B illustrates a block diagram of the inverse rendering network of FIG. 1A, in accordance with an embodiment.

[0013] FIG. 2C illustrates an image, reconstructed direct rendered image, and the combination of the reconstructed direct rendered image and the residual image, in accordance with an embodiment.

[0014] FIG. 2D illustrates an image I, A estimated by the IRN trained without the RAR, and A estimated by the IRN trained with the RAR, in accordance with an embodiment.

[0015] FIG. 2E illustrates an image annotated by humans for weak supervision, in accordance with an embodiment.

[0016] FIG. 3 illustrates a parallel processing unit, in accordance with an embodiment.

[0017] FIG. 4A illustrates a general processing cluster within the parallel processing unit of FIG. 3, in accordance with an embodiment.

[0018] FIG. 4B illustrates a memory partition unit of the parallel processing unit of FIG. 3, in accordance with an embodiment.

[0019] FIG. 5A illustrates the streaming multi-processor of FIG. 4A, in accordance with an embodiment.

[0020] FIG. 5B is a conceptual diagram of a processing system implemented using the PPU of FIG. 3, in accordance with an embodiment.

[0021] FIG. 5C illustrates an exemplary system in which the various architecture and/or functionality of the various previous embodiments may be implemented.

DETAILED DESCRIPTION

[0022] A learning based approach that jointly estimates albedo, surface normal vectors (normals), and lighting of a 3D scene from a single image is described. Albedo is a measure of the amount of light that is reflected from a surface, so that higher values indicate a highly reflective surface and lower values indicate a surface that absorbs most of the light that hits it. A normal vector for a point on a surface is perpendicular to the surface. The lighting is global or environmental lighting (intensity and color represented by an illumination map) indicating levels of brightness present in a center area of the image.

[0023] Inverse rendering has two main challenges. First, it is inherently ill-posed, especially if only a single image is given. In an embodiment, the 3D scene is an indoor scene. Conventional solutions for inverse rendering a single image focus only on a single object in the 3D scene. Second, inverse rendering of an image of a 3D scene is particularly challenging, compared to inverse rendering an image including a single object, due to the complex appearance effects (e.g., inter-reflection, cast shadows, near-field illumination and realistic shading). In contrast with existing techniques that are limited to estimating only one of the scene attributes, the described technique estimates multiple scene attributes from a single image of a 3D scene.

[0024] A major challenge in solving the problem of inverse rendering is the lack of ground-truth labels for real (not synthetic) images that are used to train a neural network model to perform inverse rendering. Although ground-truth labels are available for geometry, collected by depth sensors, reflectance and lighting are extremely difficult to measure at the large scale that is needed for training a neural network. Ground-truth labeled training datasets for inverse rendering are available for synthetic images. However, neural network models trained on synthetic images often fail to generalize well on real images. In other words, a neural network model trained to perform inverse rendering for synthetic images typically does not perform well when used to perform inverse rendering for a different domain, namely real images.

[0025] FIG. 1A illustrates a block diagram of an inverse rendering training system 100, in accordance with an embodiment. The inverse rendering training system 100 includes an inverse rendering network (IRN) 105, a residual appearance renderer (RAR) 110, a direct renderer 112, and a loss function unit 115. Although the inverse rendering training system 100 is described in the context of processing units, one or more of the IRN 105, the RAR 110, the direct renderer 112, and the loss function unit 115 may be implemented by a program, custom circuitry, or by a combination of custom circuitry and a program. For example, operations of the IRN 105 may be performed by a GPU (graphics processing unit), CPU (central processing unit), or any processor capable of extracting features.

[0026] To enable training of the IRN 105 to generalize from inverse rendering synthetic images to inverse rendering real images, the system includes the direct renderer 112 and the RAR 110. The IRN 105 receives an image (I) of a 3D scene and decomposes the image based on trainable parameters (e.g. weights), producing albedo (A), normals (N), and an illumination map (L). Albedo represents the reflectance properties and the normals represent the geometric properties of the 3D scene. In an embodiment, the albedo A, normals N, and illumination map L can be generated as a 2D array with a value or vector for each pixel of the input image I. In an embodiment, glossiness segmentation is also predicted by the IRN 105. The components may be used to reconstruct the input image I, producing a reconstructed image I.sub.s. The loss function unit 115 compares the input image I to the reconstructed (resynthesized) image I.sub.s and updates the parameters used by the IRN 105 to decompose the image.

[0027] The direct renderer 112 is a shading function which synthesizes the direct illumination contribution of the reconstructed image from the components predicted by the IRN 105. Specifically, the direct renderer 112 receives the illumination map L, the reflectance properties A, and the geometry properties N and computes a rendered image Id that approximates the input image I. The direct renderer 112 is differentiable and does not require any trained parameters. In an embodiment, the direct renderer 112 is a closed-form shading function with no learnable (e.g., trained) parameters.

[0028] As shown in the images of FIG. 1A, the direct illumination portion of the reconstructed image synthesized by the direct renderer 112, Id, is missing the more complex appearance effects (e.g., inter-reflection, cast shadows, near-field illumination, and realistic shading) that are included in the input image I and the reconstructed image I.sub.s. The RAR 110 synthesizes the more complex appearance effects. The RAR 110 receives the image I, the reflectance properties A, and the geometry properties N and computes a residual image Ir that represents the complex appearance effects of the input image I. The rendered image is summed with the residual image to produce the reconstructed image I.sub.s. In an embodiment, the loss function unit 115 computes a photometric reconstruction loss by comparing the input image and the reconstructed image.

[0029] The RAR 110 is a neural network model that, prior to being incorporated into the inverse rendering training system 100, has learned to synthesize the complex appearance effects for labeled synthetic images. In other words, in an embodiment, the RAR 110 is not trained within the inverse rendering training system 100. After being trained, the RAR 110 operates as a fixed-function differentiable function configured to produce a residual image given an image and the reflectance and geometry properties extracted from the image. The trained RAR 110 is then included in the inverse rendering training system 100 to train the IRN 105 in a self-supervised manner using unlabeled real images. In an embodiment, labeled synthetic images are used to pre-train the IRN 105 before the IRN 105 is trained within the inverse rendering training system 100, and then parameters of the IRN 105 are fine-tuned within the inverse rendering training system 100 using unlabeled real images.

[0030] The purpose of the RAR 110 is to enable self-supervised training of the IRN 105 on real images by capturing complex appearance effects that cannot be modeled by a direct renderer 112. The RAR 110, along with the direct renderer 112, reconstruct the image from the components estimated by the IRN 105. The reconstructed image can then be used to train the IRN 105 with a reconstruction loss computed by the loss function unit 115. Performance of the IRN 105 is improved compared with training only with synthetic images. Additionally, labeled real image training datasets are not necessary to train the IRN 105.

[0031] Conventional inverse rendering training systems do not include the RAR 110 and are therefore typically limited to inverse rendering images with direct illumination under distant lighting and a single material. For real images of a scene, however, the simple direct illumination renderer cannot synthesize important, complex appearance effects represented by the residual image, such as inter-reflections, cast shadows, near-field lighting, and realistic shading. The complex appearance effects provided by the RAR 110 may be simulated with a rendering equation via physically-based ray-tracing, which is non-differentiable. However, learning-based frameworks, such as the inverse rendering training system 100, require differentiable computations to perform back propagation of the losses to update parameters of the IRN 105. Therefore, a system that omits the RAR 110 and replaces, or includes, the direct renderer 112 with the rendering equation for physically-based ray-tracing cannot be used to train the IRN 105. In contrast, the RAR 110 is a neural network model that is differentiable.

[0032] More illustrative information will now be set forth regarding various optional architectures and features with which the foregoing framework may be implemented, per the desires of the user. It should be strongly noted that the following information is set forth for illustrative purposes and should not be construed as limiting in any manner. Any of the following features may be optionally incorporated with or without the exclusion of other features described.

[0033] FIG. 1B illustrates a block diagram of the RAR 110 of FIG. 1A, in accordance with an embodiment. Although the RAR 110 is described in the context of processing units, the RAR 110 may also be implemented by a program, custom circuitry, or by a combination of custom circuitry and a program. The RAR 110 includes an intrinsic feature encoder 120, an image features encoder 125, and a residual decoder 130. In an embodiment, the combined intrinsic feature encoder 120 and residual decoder 130 is implemented as a U-Net structure. The intrinsic feature encoder 120 receives the normals and albedo and extracts latent image features. In an embodiment, the intrinsic feature encoder 120 is a convolutional encoder. The image feature encoder 120 receives the image and extracts image features from the input image that are combined with the latent image features by block 128 before being processed by the residual decoder 130.

[0034] In principle, a combination of the RAR 110 and the direct renderer 112 may be considered to be an auto-encoder. In an embodiment, the RAR 110 learns to encode complex appearance features from the image into a latent subspace (D=300 dimension). The bottleneck of the auto-encoder architecture present in the RAR 110 forces the RAR 110 to focus only on the complex appearance features rather than on the entire image. Therefore, the RAR 110 learns to encode the non-directly rendered part of the image to avoid paying a penalty in the reconstruction loss. In an embodiment, the RAR 110 is simpler compared with a differentiable renderer.

[0035] In an embodiment, the intrinsic feature encoder 120 is implemented as: C64(k3)-C*64(k3)-C*128(k3)-C*256(k3)-C*512(k3), the residual decoder 130 is implemented as: CU512(k3)-CU256(k3)-CU128(k3)-CU64(k3)-Co3(k1), where CN(kS) denotes convolution layers with N S.times.S filters with stride 1, following by Batch Normalization and ReLU (Rectified Linear Unit). C*N(kS)’ denotes convolution layers with N S.times.S filters with stride 2, followed by Batch Normalization and ReLU. CUN(kS)’ represents a bilinear up-sampling layer, followed by convolution layers with N S.times.S filters with stride 1, followed by Batch Normalization and ReLU. Co3(k1)’ consists of 3 1.times.1 convolution filters to produce Normal or Albedo. Skip-connections exist between C*N(k3)’ layers of the intrinsic feature encoder 120 and CUN(k3)’ layers of the residual decoder 130. The image feature encoder 125 encodes the image features to a latent D=300 dimensional subspace is given by: C64(k7)-C*128(k3)-C*256(k3)-C128(k1)-C64(k3)-C*32(k3)-C16(k3)-MLP(300). CN(kS) denotes convolution layers with N S.times.S filters with stride 1, followed by Batch Normalization and ReLU. C*N(kS) denotes convolution layers with N S.times.S filters with stride 2, followed by Batch Normalization and ReLU. MLP(300) takes the response of the previous layers and outputs a 300 dimensional feature, which is concatenated with the last layer of the intrinsic feature encoder 120 by the block 128.

[0036] FIG. 1C illustrates a flowchart of a method 135 for computing a residual image, in accordance with an embodiment. Although method 135 is described in the context of the RAR 110 and the inverse rendering training system 100, persons of ordinary skill in the art will understand that any system that performs method 135 is within the scope and spirit of embodiments of the present disclosure.

[0037] At step 140, reflectance properties and geometry properties (e.g., intrinsics) extracted from an image of a 3D scene are received. In an embodiment, only a single image of the 3D scene is received. In an embodiment, the 3D scene is an indoor scene. In an embodiment, the reflectance properties are albedo and the geometry properties are normal vectors. In an embodiment, the reflectance properties and the geometry properties are extracted from the image by the IRN 105 or another inverse renderer and are received by the RAR 110. In an embodiment, the 3D scene is a real scene captured by an image sensor and the image is not an image of a synthetic scene rendered by a processor or processors.

[0038] At step 145, intrinsic features are computed based on the reflectance properties and the geometry properties. In an embodiment, the intrinsic features are extracted by the intrinsic feature encoder 120. At step 150, the image is processed to produce image features. In an embodiment, the intrinsic features are extracted by the image feature encoder 125. At step 155, a residual image representing complex appearance effects of the image is computed based on the image features and the intrinsic features. In an embodiment, the residual image is computed by the residual decoder 130. In an embodiment, the residual image represents regions of the image where one or more of the following complex appearance effects are present: inter-reflection, cast shadows, near-field illumination, and realistic shading.

[0039] In an embodiment, the reflectance properties and the geometry properties, as well as an illumination map corresponding to the image, are processed by the direct renderer 112 to generate a rendered image corresponding to the image. In an embodiment, the rendered image is combined with the residual image to produce a reconstructed image corresponding to the image. The reconstructed image approximates the image. In an embodiment, values for corresponding pixels in the reconstructed image are compared, by the loss function unit 115, with the image to compute a loss value. In an embodiment, parameters used by the IRN 105 to generate the reflectance properties and the geometry properties are adjusted based on the loss value.

[0040] FIG. 1D illustrates an image, corresponding extracted properties, and a reconstructed image, in accordance with an embodiment. As shown in FIG. 1D, the extracted properties include the albedo, the normal, the lighting, and glossiness. For augmented reality (AR) and re-lighting applications, the illumination should be separated from the geometry. Therefore, the lighting is needed in addition to the normals. In an embodiment, the lighting is an illumination or environment map.

[0041] The glossiness, G, indicates highly reflective surfaces associated with specular highlights, such as the faucet, sink, and window. In an embodiment, the glossiness may be estimated by the IRN 105 and used by the RAR 110 to produce more accurate complex appearance effects. Glossiness also indicates areas that are very diffuse and not reflective, such as carpet or fabric upholstery. In an embodiment, glossiness is a segmentation map identifying pixels associated with different categories (i.e., matte, semi-glossy, and glossy).

[0042] Given an input image I, the IRN 105, denoted as h.sub.d (I; .THETA..sub.d), estimates surface normal N, albedo A, and lighting L:

h.sub.d(I;.THETA..sub.d).fwdarw.{A,{circumflex over (N)},{circumflex over (L)}} (1)

[0043] The IRN 105 may be trained with supervised learning using labeled synthetic data. The ground truth lighting is challenging to obtain, as it is the approximation of the actual surface light field. Environment maps may be used as the exterior lighting for rendering synthetic images of indoor scenes, but the environment maps cannot be directly set as L*, when the virtual cameras are placed inside each of the indoor scenes. Due to occlusions, only a small fraction of the exterior lighting (e.g., through windows and open doors) is directly visible. The surface light field of each scene is mainly due to global illumination (i.e., inter-reflection) and some interior lighting. L* can be approximated by minimizing the difference between a ray-traced synthetic image I and the output Id of the direct renderer 112, denoted by f.sub.d( ), with ground truth albedo A* and normals N*. However, the approximation was found to be inaccurate, since f.sub.d( ) cannot model the residual appearance present in the ray-traced image I.

[0044] Therefore, in an embodiment, a learning-based method is used to approximate the ground truth lighting L* Specifically, a residual block-based network, h.sub.e( ;.THETA..sub.e), is trained to predict L* from the input image I, normals N*, and albedo A*. In an embodiment, the “ground truth” lighting, {circumflex over (L)}* is approximated by the separate neural network h.sub.e( ;.THETA..sub.d).fwdarw.{{circumflex over (L)}*}.

[0045] In an embodiment, h.sub.e( ,.THETA..sub.e’) is first trained with the images synthesized by f.sub.d( ) with ground truth normals, albedo, and indoor lighting, I.sub.d=f.sub.d(A*,N*,L), where L is randomly sampled from a set of real indoor environment maps. The separate neural network learns apriori over the distribution of indoor lighting; i.e., h (I.sub.d;.THETA..sub.e’).fwdarw.L. Next, the separate neural network h.sub.e(.about.;.THETA..sub.e’) is fine-tuned on the ray-traced images I, by minimizing the reconstruction loss: .parallel.I-f.sub.d(A*,N*,{circumflex over (L)}).parallel.. In this manner, the approximated ground truth of the environmental lighting {circumflex over (L)}=h.sub.e(I;.THETA..sub.e) is obtained which can best reconstruct the ray-traced image I modelled by f.sub.d( ). The IRN 105 can then be trained using the ground truth training dataset including synthetic images.

[0046] To generalize from synthetic to real images, the self-supervised reconstruction loss is used to train the pre-trained IRN 105 using real images. Specifically, as shown in FIG. 1A, during self-supervised training, the direct renderer 112 and the RAR 110 are used to re-synthesize the input image from the estimations provided by the IRN 105.

[0047] The direct renderer 112, denoted by f.sub.d( ) is a simple closed-form shading function with no learnable parameters, which synthesizes the direct illumination part I.sub.d of the image. The RAR 110, denoted by f.sub.r( ;.THETA..sub.r), is a trainable neural network model, which learns to synthesize the complex appearance effects I.sub.r:

Direct Renderer: f.sub.d{A,{circumflex over (N)},{circumflex over (L)}}.fwdarw.I.sub.d (2)

RAR: f.sub.r{I,A,{circumflex over (N)},.THETA..sub.r}.fwdarw.I.sub.r. (3)

The self-supervised reconstruction loss computed by the loss function unit 115 may be defined as .parallel.I-(I.sub.d+I.sub.r).parallel..sub.1. When glossiness is estimated by the IRN 105, the glossiness segmentation S is:

IRN-Specular: h.sub.s{I,.THETA..sub.s}.fwdarw.S, (4)

where .THETA..sub.s is a set of parameters (e.g., weights) used by the IRN 105 to estimate S and is learned during training.

[0048] FIG. 2A illustrates a flowchart of a method 200 for training an inverse rendering system, in accordance with an embodiment. Although method 200 is described in the context of the inverse rendering training system 100, persons of ordinary skill in the art will understand that any system that performs method 200 is within the scope and spirit of embodiments of the present disclosure.

[0049] At step 210, the RAR 110 is trained via supervised learning using a labeled synthetic training dataset. In an embodiment, the RAR 110 is trained using an L1 image reconstruction loss. At step 215, the IRN 105 is trained via supervised learning using a labeled synthetic training dataset. The labeled synthetic training dataset for the IRN 105 includes at least ground truth albedos and normals. Ground truth lighting (e.g., indoor environment map) may be approximated for the labeled synthetic training dataset using the separate neural network. In an embodiment, the supervised loss computed during supervised training of the IRN 105** is**

L.sub.s=.lamda..sub.1.parallel.{circumflex over (N)}-N*.parallel..sub.1+.lamda..sub.2.parallel.A-A*.parallel..sub.1+.lamd- a..sub.3.parallel.f.sub.d(A*,N*,{circumflex over (L)})-f.sub.d(A*,N*,{circumflex over (L)}*).parallel..sub.1, (4)

where .lamda..sub.1=1, .lamda..sub.2=1, and .lamda..sub.3=0.5.

[0050] Learning from synthetic data alone is not sufficient for the IRN 105 to perform well on real images. Obtaining ground truth labels for inverse rendering is almost impossible for real images (especially for reflectance and illumination). Therefore, the IRN 105 may be trained using self-supervised reconstruction loss and weak supervision from sparse labels. The sparse labels for real images, when available, may be associated with either the reflectance properties or the geometry properties. At step 220, the image is processed by the IRN 105 to produce an illumination map, the reflectance properties, and the geometry properties (e.g., A, L, and N).

[0051] Previous works on faces and objects have shown success in using a self-supervised reconstruction loss for learning from unlabeled real images. Typically, scenes including a face or single object do not require estimations of complex appearance effects resulting from localized lighting and/or a variety of materials. As previously described, the reconstruction for faces and single objects is typically limited to the direct renderer f.sub.d( ), which is a simple closed-form shading function (under distant lighting) with no learnable parameters.

[0052] At step 240, the illumination map, the reflectance properties, and the geometry properties are processed by the closed-form direct renderer 112 to produce a rendered image corresponding to the image. In an embodiment, the direct renderer 112, f.sub.d( ) may be implemented as:

I.sub.d=f.sub.d(A,{circumflex over (N)},{circumflex over (L)})=A.SIGMA..sub.i max(0,{circumflex over (N)}{circumflex over (L)}.sub.i), (5)

where {circumflex over (L)}.sub.i corresponds to the pixels on the environment map {circumflex over (L)}.sub.i.

[0053] While using f.sub.d ( ) to compute the reconstruction loss may work well for images of faces or single objects with homogeneous material, using f.sub.d( ) fails for inverse rendering of an image of a 3D scene, particularly an indoor scene or a scene with multiple objects and complex appearance effects. Therefore, the RAR 110 is included in the inverse rendering training system 100 to estimate the residual image representing the complex appearance effects. Steps 140, 145, 150, and 155 are performed by the RAR 110 as previously described in conjunction with FIG. 1C to compute a residual image representing complex appearance effects of the image.

[0054] At step 245, the rendered image is combined with the residual image to produce a reconstructed image corresponding to the image. At step 250, the loss function unit 115 compares values for corresponding pixels in the reconstructed image to the image to compute a loss value. At step 255, parameters of the IRN 105 are adjusted based on the loss value.

[0055] FIG. 2B illustrates a block diagram of the IRN 105 of FIG. 1A, in accordance with an embodiment. Although the IRN 105 is described in the context of processing units, the IRN 105 may also be implemented by a program, custom circuitry, or by a combination of custom circuitry and a program. The IRN 105 includes an image feature encoder 225, a residual decoder 230, and a residual decoder 235. In an embodiment the image feature encoder 225 is a convolutional encoder. In an embodiment, the residual decoders 230 and 235 are convolutional decoders.

[0056] In an embodiment, the input to the IRN 105 is an image of spatial resolution 240.times.320, and the output is an albedo and normal map of the same spatial resolution along with an 18.times.36 resolution environment map. In an embodiment, image feature encoder 225 architecture is: C64(k7)-C*128(k3)-C*256(k3), where CN'(kS) denotes convolution layers with N S.times.S filters with stride 1, followed by Batch Normalization and ReLU. C*N(kS) denotes convolution layers with N S.times.S filters with stride 2, followed by Batch Normalization and ReLU. The output of the image feature encoder 225 is a blob (e.g., feature map) of spatial resolution 256.times.60.times.80.

[0057] In an embodiment, blocks 222 and 227 each include 9 Residual Blocks, ResBLKs, which operate at a spatial resolution of 256.times.60.times.80. EachResBLKconsists of Conv256(k3)-BN-ReLU-Conv256(k3)-BN, whereConvN(kS)andBN` denote convolution layers with N S.times.S filters of stride 1 and Batch Normalization. Note that the weights used by blocks 222 and 227 are not shared because the block 222 is trained to estimate normals and the block 227 is trained to estimate albedo.

[0058] In an embodiment, the residual decoder 230 estimates the normals using the following architecture: CD*128(k3)-CD*64(k3)-Co3(k7), where CD*N(kS) denotes Transposed Convolution layers with N S.times.S filters with stride 2, followed by Batch Normalization and ReLU, and CN(kS) denotes convolution layers with N S.times.S filters with stride 1, followed by Batch Normalization and ReLU. The last layer Co3k(7) consists of only convolution layers of 2 7.times.7 filters, followed by a Tanh layer. In an embodiment, the residual decoder 235 estimates the albedo using the same architecture as the residual decoder 230 with separate weights.

[0059] The outputs of the image feature encoder 225, the block 222, and the block 227 are concatenated by a sum operation to produce a blob of spatial resolution 768.times.60.times.80 that is input to block 228. In an embodiment, the block 228 estimates the illumination (environment) map using the following architecture: C256(k1)-C*256(k3)-C*128(k3)-C*3(k3)-BU(18,36), where CN(kS) denotes convolution layers with N S.times.S filters with stride 1, followed by Batch Nomalization and ReLU, C*N(kS) denotes convolution layers with N S.times.S filters with stride 2, followed by Batch Normalization and ReLU, and BU(18,36) up-samples the response to produce a 18.times.36.times.3 resolution environment map.

[0060] Intrinsic image decomposition is a sub-problem of inverse rendering, where a single image is decomposed into albedo and shading. In contrast with the inverse rendering performed by the IRN 105, conventional intrinsic image decomposition methods do not explicitly recover geometry or illumination but rather combine them together as shading. Applications such as AR and virtual reality (VR) require geometry data, and the shading data that is produced by intrinsic image decomposition does not provide the geometric information needed by the AR and VR applications. The separate normals and albedo data estimated by the IRN 105 are suitable for a wide range of applications in AR/VR. Example applications include image editing, such as inserting an object into scene and using the estimates of intrinsic attributes for navigation (vehicles) or grasping (robotics) to improve accuracy.

[0061] FIG. 2C illustrates an image I, reconstructed direct rendered image I.sub.d, and the combination of the reconstructed direct rendered image and the residual image I.sub.d+I.sub.r=I.sub.s, in accordance with an embodiment. The reconstructed direct rendered image does not include the complex appearance effect of the localized lighting. The complex visual effect of the localized lighting producing brightness in the center area of the image I.sub.s is provided by the residual image I.sub.r.

[0062] FIG. 2D illustrates an image I, A estimated by the IRN 105 trained without the RAR 110, and A estimated by the IRN 105 trained with the RAR 110, in accordance with an embodiment. For the intrinsic attribute of albedo, A to be accurate, which is necessary for using the albedo for image editing, navigation, and grasping applications, A should represent the material content of the scene and should not include complex appearance effects, such as specular highlights and shadows. As previously described, the goal of the RAR 110 is to provide the complex appearance effects.

[0063] As shown in FIG. 2D, a specular highlight from exterior light appears on the floor in the image I. When the IRN 105 is trained without the RAR 110, the IRN 105 learns to include specular highlights in the albedo because the direct renderer 112 cannot produce the specular highlights and the loss computation unit 115 adjusts the parameters of the IRN 105 to reduce differences between I and I.sub.d. Therefore, when I.sub.d is missing the specular highlights, the parameters of the IRN 105 are adjusted to insert the specular highlights. In contrast, when the IRN 105 is trained with the RAR 110, the specular highlights are not included in the albedo because the RAR 110 produces the specular highlights in I.sub.r and the loss computation unit 115 does not adjust the parameters of the IRN 105 to cause to the IRN 105 to include the specular highlights in the albedo.

[0064] To ensure that the RAR 110 is trained to capture only the residual appearances and not to correct the artifacts of the direct rendered image due to faulty normals, albedo, and/or lighting estimation of the IRN 105, the RAR 110 is fixed when used in the inverse rendering training system 100 to train the IRN 105. In an embodiment, the RAR 110 is trained on only synthetic data with ground-truth normals and albedo, before being used in the inverse rendering training system 100, so that the RAR 110 learns to correctly predict the residual appearances when the direct renderer reconstruction is accurate. Training the RAR 110 separately enables the RAR 110 to learn to accurately estimate the complex appearance effects (e.g., inter-reflection, cast shadows, near-field illumination, and realistic shading) based on I, A, and N.

[0065] In addition to training the IRN 105 in a self-supervised manner using real images, the IRN 105 may by trained in a pseudo-supervised manner. In an embodiment, sparse relative reflectance judgement from humans is used as a weak form of supervision to disambiguate reflectance from shading. FIG. 2E illustrates an image annotated by humans for weak supervision, in accordance with an embodiment. The image annotated with pair-wise judgements 260 may be used for pseudo-supervised training of the IRN 105. In an embodiment, pair-wise relative reflectance judgments may be used as a form of supervision over albedo. Using such weak supervision can substantially improve performance on real images.

[0066] For any two points R.sub.1 and R.sub.2 on an image, a weighted confidence score classifies R.sub.1 to be same, brighter, or darker than R.sub.2. The labels are used to construct a hinge loss for sparse supervision. Specifically, R.sub.1 is predicted to be darker than R.sub.2 with confidence w.sub.t, a loss function w.sub.t max(1+.delta.-R.sub.2/R.sub.1,0) is used. If R.sub.1 and R.sub.2 are predicted to have similar reflectance, a loss function w.sub.t[max(R.sub.2/R.sub.1-1-.delta.,0)+max(R.sub.2/R.sub.1-1-.delta.,0)- ] is used.

[0067] The IRN 105 may be trained on real data with the following losses: (i) Psuedo-supervision loss over albedo (L.sub.a), normal (L.sub.n) and lighting (L.sub.e), (ii) Photometric reconstruction loss with the RAR 110 (L.sub.u), and (iii) Pair-wise weak supervision (L.sub.w). The net loss function is defined as:

L=0.5*L.sub.a+0.5*L.sub.n+0.1*L.sub.e+L.sub.u+30*L.sub.w (6)

[0068] The IRN 105 may also be trained using a dataset with weak supervision over normals using the following losses: (i) Psuedo-supervision loss over albedo (L.sub.a) and lighting (L.sub.e), (ii) Photometric reconstruction loss with the RAR 110 (L.sub.u), and (iii) Supervision (L.sub.w) over normals. The net loss function is then defined as:

L=0.2*L.sub.a+0.05*L.sub.e+L.sub.u+20*L.sub.w (7)

[0069] The disclosed technique for training the IRN 105 for inverse rendering generalizes across different datasets. The IRN 105 and RAR 110 may be trained with synthetic data for a different domain and then the IRN 105 may be trained via self-supervision on real data in the different domain. For example, the IRN 105 may be trained to inverse render images of a scene within an office building and then trained to inverse render images of a scene within a cabin.

[0070] Generalization results from jointly reasoning about all components of the scene. Jointly predicting the intrinsic attributes by combining supervised training on synthetic data and self-supervised training on real data using the RAR 110 improves the albedo and normal estimates across different datasets.

[0071] The predicted normals are significantly improved when the RAR 110 is used to train the IRN 105. As for the albedo, using relative reflectance judgments without the RAR 110 produces very low contrast albedo. Conversely, training the IRN 105 with only the RAR 110, without any weak supervision, often fails to produce consistent albedo across large objects like walls, floor, ceilings, etc. Thus, the predicted albedos are improved when the RAR 110 is used to train the IRN 105 and the predicted albedos are further improved when relative reflectance judgements are used during the training.

[0072] Furthermore, training without RAR 110 and weak supervision produces poor albedo estimations which contain the complex appearance effects like cast shadows, inter-reflections, highlights, etc., as the reconstruction loss with direct renderer 112 alone cannot model the complex appearance effects. When the albedo is polluted with the complex appearance effects, the albedo is not suitable for use in image editing, guidance, and grasping applications.

[0073] The RAR 110 can synthesize complex appearance effects such as inter-reflection, cast shadows, near-field illumination, and realistic shading. In the absence of the RAR 110, the reconstruction loss used for self-supervised training cannot capture complex appearance effects, and the estimates of scene attributes are less accurate. The RAR 110 is important for employing the self-supervised reconstruction loss to learn inverse rendering on real images.

[0074] The inverse rendering training system 100 performs inverse rendering for an entire 3D scene rather than single objects in an image. The training technique offers several key benefits: increases the application scenarios for AR and image-based graphics, improves the quality and realism in the estimation of the intrinsics, and effectively removes artifacts from the estimations.

更多阅读推荐......