Intel Patent | Point cloud based 3d semantic segmentation
Patent: Point cloud based 3d semantic segmentation
Drawings: Click to check drawins
Publication Number: 20210303912
Publication Date: 20210930
Applicant: Intel
Assignee: Intel Corporation
Abstract
System and techniques are provided for three-dimension (3D) semantic segmentation. A device for 3D semantic segmentation includes: an interface, to obtain a point cloud data set for a time-ordered sequence of 3D frames, the 3D frames including a current 3D frame and one or more historical 3D frames previous to the current 3D frame; and processing circuitry, to: invoke a first artificial neural network (ANN) to estimate a 3D scene flow field for each of the one or more historical 3D frames by taking the current 3D frame as a reference frame; and invoke a second ANN to: produce an aggregated feature map, based on the reference frame and the estimated 3D scene flow field for each of the one or more historical 3D frames; and perform the 3D semantic segmentation based on the aggregated feature map.
Claims
-
A device for three-dimension (3D) semantic segmentation, comprising: an interface, to obtain a point cloud data set for a time-ordered sequence of 3D frames, the 3D frames including a current 3D frame and one or more historical 3D frames previous to the current 3D frame; and processing circuitry, to: invoke a first artificial neural network (ANN) to estimate a 3D scene flow field for each of the one or more historical 3D frames by taking the current 3D frame as a reference frame; and invoke a second ANN to: produce an aggregated feature map, based on the reference frame and the estimated 3D scene flow field for each of the one or more historical 3D frames; and perform the 3D semantic segmentation based on the aggregated feature map.
-
The device of claim 1, wherein the first ANN includes a scene flow estimation sub-network for each of the one or more historical 3D frames.
-
The device of claim 1, wherein the second ANN includes a feature extract sub-network for each of the one or more historical 3D frames and the current 3D frame, to generate an origin feature map for each of the one or more historical 3D frames and the current 3D frame.
-
The device of claim 3, wherein the second ANN includes an alignment layer for each of the one or more historical 3D frames, to align the origin feature map with the 3D scene flow field for each of the one or more historical 3D frames.
-
The device of claim 3, wherein the first ANN includes an alignment layer for each of the one or more historical 3D frames, to align the 3D scene flow field with the origin feature map for each of the one or more historical 3D frames.
-
The device of claim 3, wherein the second ANN includes a feature warping layer for each of the one or more historical 3D frames, to obtain a warped feature map for each of the one or more historical 3D frames, by warping the original feature map for each of the one or more historical 3D frames based on the 3D scene flow field for each of the one or more historical 3D frames.
-
The device of claim 6, wherein the second ANN includes an alignment layer for the current 3D frame, to align the origin feature map of the current 3D frame with the warped feature map for each of the one or more historical 3D frames.
-
The device of claim 6, wherein the second ANN includes a feature aggregation layer, to aggregate the warped feature map for each of the one or more historical 3D frames with the original feature map of the current 3D frame to produce the aggregated feature map.
-
The device of claim 8, wherein the feature warping layer is to produce an adaptive weight along with the warped feature map for each of the one or more historical 3D frames; and the feature aggregation layer is to aggregate a result of the warped feature map multiplying by the adaptive weight for each of the one or more historical 3D frames, with the original feature map of the current 3D frame to produce the aggregated feature map.
-
The device of claim 9, wherein the adaptive weight for the warped feature map for each of the one or more historical 3D frames is determined by a combination of a degree of proximity of the corresponding historical 3D frame to the reference frame and a degree of occlusion of an object of interest in the corresponding historical 3D frame.
-
The device of claim 1, wherein the second ANN is configured to produce the aggregated feature map by: predicting a displacement of each point in point cloud data for the one or more historical 3D frames, based on the estimated 3D scene flow field for each of the one or more historical 3D frames; obtaining a warped 3D frame for each of the one or more historical 3D frames based on the predicted displacement of each point in the point cloud data for the one or more historical 3D frames and an initial position of the point in the corresponding historical 3D frame; obtaining a warped feature map for each of the one or more historical 3D frames from the warped 3D frame for the historical 3D frame; and aggregating the warped feature map for each of the one or more historical 3D frames to an original feature map of the current 3D frame.
-
The device of claim 1, wherein the second ANN includes a 3D semantic segmentation sub-network to perform the 3D semantic segmentation based on the aggregated feature map.
-
The device of claim 1, wherein the first ANN and the second ANN are integrated into a single ANN.
-
An apparatus for three-dimension (3D) semantic segmentation, comprising: means for obtaining a point cloud data set for a time-ordered sequence of 3D frames, the 3D frames including a current 3D frame and one or more historical 3D frames previous to the current 3D frame; and means for invoking a first artificial neural network (ANN) to estimate a 3D scene flow field for each of the one or more historical 3D frames by taking the current 3D frame as a reference frame; and means for invoking a second ANN to: produce an aggregated feature map, based on the reference frame and the estimated 3D scene flow field for each of the one or more historical 3D frames, and perform the 3D semantic segmentation based on the aggregated feature map.
-
The apparatus of claim 14, wherein the first ANN includes a scene flow estimation sub-network for each of the one or more historical 3D frames.
-
The apparatus of claim 14, wherein the second ANN includes a feature extract sub-network for each of the one or more historical 3D frames and the current 3D frame, to generate an origin feature map for each of the one or more historical 3D frames and the current 3D frame.
-
The apparatus of claim 16, wherein the second ANN includes an alignment layer for each of the one or more historical 3D frames, to align the origin feature map with the 3D scene flow field for each of the one or more historical 3D frames.
-
The apparatus of claim 16, wherein the first ANN includes an alignment layer for each of the one or more historical 3D frames, to align the 3D scene flow field with the origin feature map for each of the one or more historical 3D frames.
-
The apparatus of claim 16, wherein the second ANN includes a feature warping layer for each of the one or more historical 3D frames, to obtain a warped feature map for each of the one or more historical 3D frames, by warping the original feature map for each of the one or more historical 3D frames based on the 3D scene flow field for each of the one or more historical 3D frames.
-
The apparatus of claim 19, wherein the second ANN includes an alignment layer for the current 3D frame, to align the origin feature map of the current 3D frame with the warped feature map for each of the one or more historical 3D frames.
-
The apparatus of claim 19, wherein the second ANN includes a feature aggregation layer, to aggregate the warped feature map for each of the one or more historical 3D frames with the original feature map of the current 3D frame to produce the aggregated feature map.
-
The apparatus of claim 21, wherein the feature warping layer is to produce an adaptive weight along with the warped feature map for each of the one or more historical 3D frames; and the feature aggregation layer is to aggregate a result of the warped feature map multiplying by the adaptive weight for each of the one or more historical 3D frames, with the original feature map of the current 3D frame to produce the aggregated feature map.
-
The apparatus of claim 21, wherein the adaptive weight for the warped feature map for each of the one or more historical 3D frames is determined by a combination of a degree of proximity of the corresponding historical 3D frame to the reference frame and a degree of occlusion of an object of interest in the corresponding historical 3D frame.
-
A machine-readable storage medium having instructions stored thereon, which when executed by a processor, cause the processor to perform operations for three-dimension (3D) semantic segmentation, the operations comprises: obtaining a point cloud data set for a time-ordered sequence of 3D frames, the 3D frames including a current 3D frame and one or more historical 3D frames previous to the current 3D frame; and invoking a first artificial neural network (ANN) to estimate a 3D scene flow field for each of the one or more historical 3D frames by taking the current 3D frame as a reference frame; and invoking a second ANN to: produce an aggregated feature map, based on the reference frame and the estimated 3D scene flow field for each of the one or more historical 3D frames, and perform the 3D semantic segmentation based on the aggregated feature map.
-
A machine-readable storage medium having instructions stored thereon, which when executed by a processor, cause the processor to perform operations for training a neural network for three-dimension (3D) semantic segmentation, the operations comprises: obtaining a point cloud data set for a time-ordered sequence of 3D frames, the 3D frames including a current 3D frame and one or more historical 3D frames previous to the current 3D frame; randomly selecting a historical 3D frame from the one or more historical 3D frames; producing a test result based on forward-propagating processing of the selected historical 3D frame through a training neural network; applying a loss function to evaluate the test result to produce a loss value; reducing the loss value by refining trainable parameters of the training neural network, based on backpropagation of the loss function through the training neural network; and supplying the refined trainable parameters to configure the neural network for 3D semantic segmentation.
-
The machine-readable storage medium of claim 25, wherein the test result includes an outcome of 3D semantic segmentation based on an aggregated feature map.
Description
TECHNICAL FIELD
[0001] Embodiments described herein generally relate to computer vision techniques, and more specifically to a point cloud based three-dimension (3D) semantic segmentation.
BACKGROUND
[0002] Autonomous or semi-autonomous automotive technologies, often referred to as “self-driving” or “assisted-driving” operation in automobiles, are undergoing rapid development and deployment in commercial- and consumer-grade vehicles. These systems use an array of sensors to continuously observe the vehicle’s motion and surroundings. One of common sensor technologies is Light Detection and Ranging (LiDAR). LiDAR is a system that combines laser, global positioning system (GPS) and inertial navigation system (INS) technologies to obtain point clouds and generate an accurate ground Digital Elevation Model (DEM).
[0003] In the autonomous or semi-autonomous automotive technologies, semantic segmentation may be used to provide information about other vehicles, pedestrians and other objects on a road, as well as information about lane markers, curbs, and other relevant items. Accurate semantic segmentation plays a significant role in safety of autonomous driving.
BRIEF DESCRIPTION OF THE DRAWINGS
[0004] In the drawings, which are not necessarily drawn to scale, like numerals may describe similar components in different views. Like numerals having different letter suffixes may represent different instances of similar components. The drawings illustrate generally, by way of example, but not by way of limitation, various embodiments discussed in the present document.
[0005] FIG. 1 shows an example situation of part occlusion obstacles in point cloud data, according to an embodiment of the disclosure
[0006] FIG. 2 shows an example situation illustrating a relationship between a two-dimension (2D) optical flow and corresponding 3D scene flow, according to an embodiment of the disclosure.
[0007] FIG. 3 shows an example system for point cloud based 3D semantic segmentation, along with an illustrative process flow, according to an embodiment of the disclosure.
[0008] FIG. 4 shows a vehicle with a LiDAR mounted thereon, according to an embodiment of the disclosure.
[0009] FIG. 5 shows an illustrative process flow of using the example system of FIG. 3 to perform 3D semantic segmentation based on the point cloud data of the example situation of FIG. 1.
[0010] FIG. 6 shows a schematic diagram of a neural network for point cloud based 3D semantic segmentation, according to an embodiment of the disclosure.
[0011] FIG. 7 shows an example workflow of the neural network of FIG. 6.
[0012] FIG. 8 shows a schematic diagram of a training neural network for point cloud based 3D semantic segmentation, according to an embodiment of the disclosure.
[0013] FIG. 9 is a flow diagram illustrating an example of a method for point cloud based 3D semantic segmentation, according to an embodiment of the disclosure.
[0014] FIG. 10 is a flow diagram illustrating an example of a method for training a neural network for point cloud based 3D semantic segmentation, according to an embodiment of the disclosure.
[0015] FIG. 11 is a flow diagram illustrating an example of a method for point cloud based 3D semantic segmentation, according to an embodiment of the disclosure.
[0016] FIG. 12 is a block diagram illustrating an example of a machine upon which one or more embodiments may be implemented.
[0017] FIG. 13 is a diagram illustrating an exemplary hardware and software architecture of a computing device, according to an embodiment of the disclosure.
[0018] FIG. 14 is a block diagram illustrating processing devices that may be used, according to an embodiment of the disclosure.
[0019] FIG. 15 is a block diagram illustrating example components of a central processing unit (CPU), according to an embodiment of the disclosure.
DETAILED DESCRIPTION
[0020] Various aspects of the illustrative embodiments will be described using terms commonly employed by those skilled in the art to convey the substance of the disclosure to others skilled in the art. However, it will be apparent to those skilled in the art that many alternate embodiments may be practiced using portions of the described aspects. For purposes of explanation, specific numbers, materials, and configurations are set forth in order to provide a thorough understanding of the illustrative embodiments. However, it will be apparent to those skilled in the art that alternate embodiments may be practiced without the specific details. In other instances, well-known features may have been omitted or simplified in order to avoid obscuring the illustrative embodiments.
[0021] Further, various operations will be described as multiple discrete operations, in turn, in a manner that is most helpful in understanding the illustrative embodiments; however, the order of description should not be construed as to imply that these operations are necessarily order dependent. In particular, these operations need not be performed in the order of presentation.
[0022] The phrases “in an embodiment” “in one embodiment” and “in some embodiments” are used repeatedly herein. The phrase generally does not refer to the same embodiment; however, it may. The terms “comprising,” “having,” and “including” are synonymous, unless the context dictates otherwise. The phrases “A or B” and “A/B” mean “(A), (B), or (A and B).”
[0023] A variety of semantic segmentation techniques may be based on data provided by a variety of sensors. When using LiDAR to observe surroundings of a vehicle, the data are provided as point cloud data, which may also be referred as a LiDAR point cloud. Semantic segmentation based on point cloud data is almost the most important functionality in a perception module of autonomous driving. A regular approach of semantic segmentation is to reduce dimensionality of the point cloud data into 2D and then perform 2D semantic segmentation. Another approach of semantic segmentation is based on point cloud data of a current frame. However, these approaches only focus on single frame segmentation, and don’t take point cloud data of historical frames into consideration, such that these approaches are susceptible to LiDAR data noise. These approaches are especially inefficient when dealing with a situation of part occlusion obstacles, which is common in point cloud data. FIG. 1 shows an example situation 100 of part occlusion obstacles in point cloud data according to an embodiment of the disclosure. As can be seen in FIG. 1, a vehicle indicated by an arrow 110 is obscured by another vehicle indicated by an arrow 120. The situation of FIG. 1 is an example and not meant to limit the present disclosure. There may be other situations, for example, a pedestrian may be obscured by a vehicle, a tree or other objects.
[0024] Embodiments of the present application provides architectures to perform 3D semantic segmentation based on a point cloud data set for a time-ordered sequence of 3D frames, including a current 3D frame and one or more historical 3D frames previous to the current 3D frame. The point cloud data set for a time-ordered sequence of 3D frames may be captured by a LiDAR mounted on a vehicle.
[0025] As used herein, the phase “current 3D frame” means a 3D frame that is of interest currently or for which the 3D semantic segmentation is to be performed; and the phase “historical 3D frame” means a 3D frame occurred before the current 3D frame.
[0026] As used herein, the term “3D scene flow” means a 3D motion field of points in the scene. A “3D scene flow” used herein may be interchangeable with a “3D optical flow”, “3D flow”, “range flow”, “scene flow”, etc.
[0027] As used herein, the term “2D optical flow” means a perspective projection of corresponding 3D scene flow. FIG. 2 shows an example situation illustrating a relationship between a 2D optical flow and corresponding 3D scene flow, according to an embodiment of the disclosure. As shown in FIG. 2, {right arrow over (V)} is a 3D velocity of a 3D point {right arrow over (P)}(t)=(X, Y, Z), and {right arrow over (v)}=(u, v) is a 2D image of {right arrow over (V)}, i.e., {right arrow over (v)} is a perspective projection of {right arrow over (V)}. When {right arrow over (P)}(t) moves with a displacement {right arrow over (V)}.delta.t to {right arrow over (P)}(t’) from time t to time t’, its image {right arrow over (Y)}(t)=(x, y, f) moves to {right arrow over (Y)}(t’)=(x’, y’, f) with a displacement of {right arrow over (v)}.delta.t, where .delta.t=t’-t and f is a focal length of a sensor for imaging. In the situation, {right arrow over (v)} is known as an image velocity or a 2D optical flow
[0028] As used herein, the term “FlowNet3D” refers to an end-to-end (EPE) deep learning architecture for 3D scene flow estimation.
[0029] As used herein, the term “EPE loss function” refers generally to an end-to-end point error. In particular, the end-to-end point error measures an average Euclidean distance (i.e., L2 distance) between an estimated flow vector (which includes 2D and 3D versions) to a ground truth flow vector. The EPE loss function is used to train an artificial neural network (ANN), such as, the well-known FlowNet/FlowNet2.0 and FlowNet3D etc.
[0030] Next, for simplicity, a 2D RGB image is used to explain feature warping. For example, a frame indicated as f.sub.0 includes a pixel p.sub.0(x.sub.0, y.sub.0). The pixel p.sub.0(x.sub.0, y.sub.0) has a new position p.sub.1(x.sub.1, y.sub.1) in a frame immediately subsequent to the frame f.sub.0, which is indicated as f.sub.1. A flow estimation network (for example, the FlowNet or FlowNet2.0) may be used to estimate a velocity (u, v) of p.sub.0 in the frame f.sub.0. The new position p.sub.1 in the frame f.sub.1 may then be estimated by (x.sub.0, y.sub.0)+(u, v).delta.t=(x.sub.1, y.sub.1), where .delta.t is a time difference between the two frames. The above operation may be performed on all pixels in in the frame f.sub.0 so as to obtain a predicted frame, which is indicated as frame f.sub.1’. The process from frame f.sub.0 to f.sub.1’ is known as raw image warping. For a feature map produced by a deep learning optimization algorithm, the progress is similar. It is assumed that a flow field M.sub.i->j=F(I.sub.i, I.sub.j) is produced by a flow network F (e.g., the FlowNet) based on a reference frame I.sub.i and a frame I.sub.j previous to the reference frame. Feature maps associated with the frame I.sub.j may be warped to the reference frame I.sub.i by a warping function, according to the flow field M.sub.i->j. The warping function is defined as f.sub.j->i=W(f.sub.j, M.sub.i-j)=W(f.sub.j, F(I.sub.i, I.sub.j)), where W(.) is a bilinear warping function applied on all locations for each channel in the feature maps, f.sub.j->i denotes a feature map warped from the frame I.sub.j to the frame I.sub.i, and f.sub.j denotes a feature map of frame I.sub.j without any feature warping operation.
[0031] As used herein, the term “feature aggregation” refers to combining feature maps of a reference frame and warped feature maps of one or more frames neighboring the reference frame (including historical frames or future frames) into a smaller set of feature maps. Generally speaking, the one or more frames neighboring the reference frame may be aggregated with the reference frame to obtain an aggregated frame.
[0032] As used herein, the term “semantic segmentation” may include “2D semantic segmentation” and “3D semantic segmentation”. 2D semantic segmentation refers to a process of linking each pixel in an image to a class label, which may include a person, a vehicle, a bike, a tree, a curb, a road surface etc., for example. 3D semantic segmentation is similar as 2D semantic segmentation, except that the operation object is a red-green-blue-depth (RGBD) image or a point cloud set, instead of a 2D image.
[0033] As used herein, the term “artificial neural network (ANN)” is a collective name of neural networks, which is interchangeable with a neural network, deep neural network (DNN), deep learning network and others.
[0034] As used herein, the term “full convolution network (FCN)” refers to a famous end-to-end 2D semantic segmentation deep learning architecture, and the term “U-net” refers to another famous end-to-end 2D semantic segmentation deep learning architecture.
[0035] As used herein, the term “PointNet” refers to an end-to-end 3D semantic segmentation deep learning architecture.
[0036] As used herein, the term “Softmax” refers to a loss function that takes a vector of K real numbers as an input, and normalizes the vector into a probability distribution, which consists of K probabilities proportional to exponentials of the K real numbers. The Softmax loss function is used to train a 2D or 3D semantic segmentation network herein.
[0037] FIG. 3 shows an example system 300 for point cloud based 3D semantic segmentation, along with an illustrative process flow, according to an embodiment of the disclosure. As illustrated, the system is composed of a number of subsystems, components, circuits, modules, or engines, which for the sake of brevity and consistency are termed engines, although it will be understood that these terms may be used interchangeably. Engines are realized in hardware, or in hardware controlled by software or firmware. As such, engines are tangible entities specially-purposed for performing specified operations and are structured in a certain manner.
[0038] In an example, circuitry may be arranged (e.g., internally or with respect to external entities such as other circuits) in a specified manner as an engine. In an example, the whole or part of one or more hardware processors may be configured by firmware or software (e.g., instructions, an application portion, or an application) as an engine that operates to perform specified operations. In an example, the software may reside on a tangible machine-readable storage medium. In an example, the software, when executed by the underlying hardware of the engine, causes the hardware to perform the specified operations. Accordingly, an engine is physically constructed, or specifically configured (e.g., hardwired), or temporarily configured (e.g., programmed) to operate in a specified manner or to perform part or all of any operation described herein.
[0039] Considering examples in which engines are temporarily configured, each of the engines need not be instantiated at any one moment in time. For example, where the engines comprise a general-purpose hardware processor core configured using software; the general-purpose hardware processor core may be configured as respective different engines at different times. Software may accordingly configure a hardware processor core, for example, to constitute a particular engine at one instance of time and to constitute a different engine at a different instance of time.
[0040] In an embodiment, the system 300 may be mounted on a vehicle having a LiDAR, as shown in FIG. 4. The system 300 may be used to provide 3D semantic segmentation results of surroundings along a route of the vehicle, for use with an autonomous vehicle control system. In another embodiment, the system 300 may be implemented on a remote server communicatively connected with the vehicle.
[0041] As depicted, the system 300 may include an input interface 310 to receive a point cloud data set from a LiDAR, a network, or a local memory. In an embodiment, the point cloud data set includes point cloud data for time-ordered sequence of 3D frames. As shown on the right side of the dividing line, the time-ordered sequence of 3D frames may include frame.sub.i, frame.sub.i-1, … , frame.sub.i-k, where i and k are positive integers and k [0042] The system 300 may include a scene flow estimation engine 320 to perform 3D scene flow estimation for frame.sub.i-1, … , frame.sub.i-k, taking frame.sub.i as a reference frame. A velocity of each point on frame.sub.i-1, … , frame.sub.i-k may be predicted based on the 3D scene flow estimation. As shown on the right side of the dividing line, the “arrows (.fwdarw.)” on frame.sub.i-1, … , frame.sub.i-k simulate a velocity of each point. [0043] The system 300 may include a feature warping engine 330 to obtain a warped feature map corresponding to each of frame.sub.i-1, … , frame.sub.i-k, based on the 3D scene flow estimation for the frame. For example, a displacement of each point on frame.sub.i-1, … , frame.sub.i-k may be predicted according the predicted velocity and a corresponding time difference between the frame including the point and the reference frame. A warped 3D frame corresponding to each of frame.sub.i-1, … , frame.sub.i-k may be obtained based on the predicted displacement of each point on each of frame.sub.i-1, … , frame.sub.i-k and an initial position (e.g., coordinates) of the point in the historical 3D frames. The warped feature map corresponding to each of frame.sub.i-1, … , frame.sub.i-k may be obtained based on the warped 3D frame corresponding to each of frame.sub.i-1, … , frame.sub.i-k. As another example, the warped feature map corresponding to each of frame.sub.i-1, … , frame.sub.i-k may be obtained by warping an original feature map of each of frame.sub.i-1, … , frame.sub.i-k based on the estimated 3D scene flow field for each of frame.sub.i-1, … , frame.sub.i-k. [0044] The system 300 may include a feature aggregation engine 340 to aggregate the warped feature maps corresponding to frame.sub.i-1, … , frame.sub.i-k with the an original feature map of the reference frame (i.e., frame) to produce the aggregated feature map. [0045] The system 300 may further include a semantic segmentation engine 350 to perform the 3D semantic segmentation for the points illustrated in frame.sub.i, based on the aggregated feature map. As shown by the illustrative process flow, the semantic segmentation engine 350 identifies correctly that the points on frame.sub.i belongs to a car, according to history information provided by frame.sub.i-1, … , frame.sub.i-k. [0046] The above process will be performed similarly for all points in the current frame. The semantic segmentation engine 350 may then obtain and output an outcome of 3D semantic segmentation. As shown, the outcome of 3D semantic segmentation may be rendered as a 3D map with different labels to identify different objects. [0047] The system 300 may also include an output interface 360 to output the outcome of the 3D semantic segmentation. In an embodiment, the output interface 360 may be connected to a screen to display the outcome of the 3D semantic segmentation. In another embodiment, the output interface 360 may be connected to a transceiver for transmitting the outcome of the 3D semantic segmentation to a device communicatively connected with the system 300. The outcome of the 3D semantic segmentation may be used by an autonomous vehicle control system to make a decision on a driving strategy. [0048] In an embodiment, the scene flow estimation engine 320, feature warping engine 330, feature aggregation engine 340 and semantic segmentation engine 350 may be implemented by ANNs and processing circuitry supporting the ANNs. For example, the scene flow estimation engine 320 may be implemented by the FlowNet3D as mentioned above, and the feature warping engine 330, feature aggregation engine 340 and semantic segmentation engine 350 may be may be implemented by the PointNet as mentioned above. [0049] FIG. 4 shows a vehicle 400 with a LiDAR 410 mounted thereon, according to an embodiment of the disclosure. The vehicle 400 may be an autonomous vehicle, for example. The LiDAR 410 may be used to capture point cloud data for surroundings of the vehicle continuously, when the vehicle is driving along a road. The LiDAR 410 may then provide the captured point cloud data to the system 300 of FIG. 3 for 3D semantic segmentation of the surroundings. In an example, more than one LiDAR 410 may be mounted on the vehicle 400. For example, the vehicle 400 may have multiple LiDARs 410 pointing in different directions. The vehicle 400 may also have multiple LiDARs 410 pointing in the same or similar directions with respect to the vehicle, but mounted at different locations. Although single-LiDAR vehicles are discussed herein, multiple-LiDAR vehicles may also be used, where some or all of the point cloud data may be captured by different LiDARs, or may be created from a composite of point cloud data captured from multiple LiDARs. Real-time operation, in the present context, operates with imperceptible or nominal processing delay such that 3D semantic segmentation of the surroundings is obtained at a rate that is consistent with the rate at which 3D point cloud data for the surroundings are captured. [0050] FIG. 5 shows an illustrative process flow of using the example system 300 of FIG. 3 to perform 3D semantic segmentation based on the point cloud data of the example situation 100 of FIG. 1. As mentioned above, the vehicle indicated by the arrow 110 is obscured by the vehicle indicated by the arrow 120 in FIG. 1. [0051] At 510, a point cloud data set for a time-ordered sequence of original 3D frames, including frame.sub.i, frame.sub.i-1, … , frame.sub.i-k (where i and k are positive integers and k
[0052] For the 3D scene flow field estimation at 520, a 3D scene flow field for frame.sub.j (j=i-1, … , i-k) may be defined as M.sub.i->j=3DF(frame.sub.i, frame.sub.j). [0053] At 530, the original feature maps of the historical 3D frames are warped to the reference frame, according to the 3D scene flow field for each of the historical 3D frames. The original feature maps may be outputs of N.sub.feat, which represent a 3D feature extract sub-network for extracting an original feature map for each frame. A warping function may be defined as: f.sub.j->i=W(f.sub.j,M.sub.i->j)=W(f.sub.j,3DF(frame.sub.i,frame.s- ub.j)), j=i-1, … ,i-k, (1) where W(.) is a trilinear warping function applied on all locations for each channel in the feature maps, and f.sub.j->i denotes the feature maps warped from historical 3D frame (frame.sub.j) to the reference frame (frame.sub.i) and f.sub.j denotes a feature map of frame.sub.j without any feature warping operation. [0054] For the feature warping at 530, in an embodiment, a warped 3D feature map corresponding to each of frame.sub.i-1, … , frame.sub.i-k may be obtained by predicting a displacement of each point on each of frame.sub.i-1, … , frame.sub.i-k based on the estimated 3D scene flow field of the corresponding frame and a corresponding time difference between the frame and frame.sub.i; obtaining the warped 3D frame corresponding to each of frame.sub.i-1, … , frame.sub.i-k based on the predicted displacement of each point on each of frame.sub.i-1, … , frame.sub.i-k and initial coordinates of the point in the historical 3D frames; and obtaining a feature map of the warped 3D frame corresponding to each of frame.sub.i-1, … , frame.sub.i-k. In the embodiment, the feature map of the warped 3D frame corresponding to each of frame.sub.i-1, … , frame.sub.i-k may be aggregated with the original feature map of frame.sub.i at 540, to produce an aggregated feature map. [0055] For the feature warping at 530, in another embodiment, the warped feature map corresponding to each of frame.sub.i-1, … , frame.sub.i-k may be obtained by warping an original feature map of each of frame.sub.i-1, … , frame.sub.i-k based on the estimated 3D scene flow field for each of frame.sub.i-1, … , frame.sub.i-k. In the embodiment, at 540, the warped feature map for each of frame.sub.i-1, … , frame.sub.i-k may be aggregated with the original feature map of frame.sub.i to produce an aggregated feature map. This approach tends to achieve a better result, since a selection of the feature maps will be involved in an end-to-end training process of the ANN. [0056] That is to say, during the feature aggregation at 540, the original feature map of the reference frame, i.e., frame.sub.i, accumulates multiple feature maps from the historical 3D frames, i.e., frame.sub.i-1, … , frame.sub.i-k. These feature maps provide rich and diverse information for the 3D semantic segmentation, especially for part occlusion obstacles situation as illustrated by FIG. 1. [0057] In an embodiment, during the feature aggregation process, different weights may be applied for different historical 3D frames, i.e., frame.sub.i-1, … , frame.sub.i-k. For example, different spatial locations may be assigned different weights and all feature channels at the same spatial location may share the same weight. As a result, a weight for each of frame.sub.i-1, … , frame.sub.i-k may be based on a spatial location of the frame. Particularly, a weight for each of frame.sub.i-1, … , frame.sub.i-k may be based on a degree of proximity in time of the frame to frame.sub.i. In the context, feature warping from each of frame.sub.i-1, … , frame.sub.i-k to frame.sub.i may be denoted as f.sub.j->i, j=i-1, … , i-k, and a corresponding weight to be applied for the warped feature maps may be denoted as w.sub.j->i. The aggregated feature map (f.sub.i) at the reference frame (frame.sub.i) may then be expressed as: f.sub.i=.SIGMA..sub.j=i-k.sup.iw.sub.j.fwdarw.if.sub.j.fwdarw.i (2) As can be seen, k defines a range of historical frames for aggregation. [0058] As another example, an adaptive weight may be applied for each of frame.sub.i-1, … , frame.sub.i-k. The adaptive weight indicates importance of the corresponding historical 3D frame to the reference frame (frame). On one hand, if the warped feature map f.sub.j->i(p) at location p is close to the original feature map of frame.sub.i, i.e., f.sub.i(p), in time, it will be assigned a larger weight; or otherwise, it will be assigned a smaller weight. A cosine similarity metric is used herein to measure the similarity between the warped feature map and the original feature map of the reference frame. A tiny network .GAMMA.(.) is applied to feature maps f.sub.i and f.sub.j->i, to project the feature maps to a new embedding for similarity measurement, similarly as described by Xizhou Zhu etc. in their article “Flow-Guided Feature Aggregation for Video Object Detection” (arXiv preprint arXiv:1703.10025, 2017), which is incorporated herein by reference in its entirety. As a result, an input to a layer for calculating the weight is .GAMMA.(N.sub.feat) instead of N.sub.feat itself. A corresponding weight to be applied for the warped feature map f.sub.j->i(p) may be denoted as w.sub.j->i(p), which may be estimated by the following equation: w j .fwdarw. i .function. ( p ) = exp .function. ( 3 .times. Df j .fwdarw. i e .function. ( p ) 3 .times. Df i e .function. ( p ) 3 .times. D .times. f j .fwdarw. i e .function. ( p ) .times. 3 .times. D .times. f i e .function. ( p ) ) , ( 3 ) ##EQU00001## where 3Df.sup.e=.GAMMA.(N.sub.feat) denotes 3D embedding feature maps for similarity measurement. The weight w.sub.j.fwdarw.i may be obtained by normalizing w.sub.j.fwdarw.i(p) for every spatial location p over the historical 3D frames. On another hand, the importance of the corresponding historical 3D frame to the reference frame may be determined by a combination of a degree of proximity (e.g., in time) of the historical 3D frame to the reference frame and a degree of occlusion of an object of interest in the historical 3D frame. [0059] At 550, the aggregated feature map f.sub.i may be fed into the semantic segmentation engine 350 to obtain an outcome: y.sub.i=N.sub.seg(f.sub.i), (4) where N.sub.seg denotes a 3D semantic segmentation sub-network. [0060] FIG. 6 shows a schematic diagram of a neural network 600 for point cloud based 3D semantic segmentation, according to an embodiment of the disclosure. The neural network 600 has been trained to produce a 3D semantic segmentation outcome directly from the point cloud data for a time-ordered sequence of 3D frames. Therefore, the neural network 600 is able to respond quickly and without an appreciable delay. The 3D semantic segmentation outcome can be used to assist an autonomous vehicle control system to determine a strategy for driving. [0061] A point cloud data set for a time-ordered sequence of 3D frames is provided to the neural network 600 as an input. The 3D frames may include a current 3D frame (indicated as frame.sub.i) and one or more historical 3D frames (indicated as frame.sub.i-k, frame.sub.i-(k-1), … , frame.sub.i-1, where i and k are positive integers and k
[0062] The neural network 600 may include architectures to process each of frame.sub.i-k, frame.sub.i-(k-1), … , frame.sub.i-1 and frame.sub.i. The number of the historical 3D frames that can be processed by the neural network 600 may be limited by a performance of hardware supporting the neural network. The architectures for the historical 3D frames, i.e., frame.sub.i-k, frame.sub.i-(k-1), … , frame.sub.i-1 are similar, which is referred to as Arch-H herein. The architecture for the current 3D frame (frame.sub.i) is referred to as Arch-R herein, since frame.sub.i is to be taken as a reference frame during the whole process. [0063] Taking the Arch-H for frame.sub.i-k as an example, as illustrated, the Arch-H may include a scene flow estimation sub-network 610 to estimate a 3D scene flow field for frame.sub.i-k. The 3D scene flow field provides a basis for feature warping later. For example, the scene flow estimation sub-network 610 may be the FlowNet3D or any other network capable of implementing the similar operation. [0064] The Arch-H may include a feature extract sub-network 620 to produce an original feature map for frame.sub.i-k. The original feature map is an operation object of feature warping. For example, the feature extract sub-network 620 may be a part of the PointNet, which may be referred to as “PointNetFeat”. [0065] The Arch-H may include a feature warping layer 630 to warp the original feature map produced by the feature extract sub-network 620, based on the 3D scene flow field estimated by the scene flow estimation sub-network 610, and with reference to frame.sub.i. The feature warping layer 630 may produce a warped feature map for frame.sub.i-k and obtain an adaptive weight accompanying the warped feature map. Both the warped feature map and the adaptive weight will participate in feature aggregation later. The feature warping layer 630 may also be a part of the PointNet. [0066] For other historical 3D frame, the Arch-H may include similar components to implement similar operations as frame.sub.i-k. After the processing of the Arch-H, warped feature maps and corresponding adaptive weights may be obtained for all historical 3D frames. The warped feature maps and corresponding adaptive weights provide operation objects of feature aggregation. [0067] As mentioned above, the Arch-R is used to process the reference frame, frame.sub.i. The Arch-R may include the feature extract sub-network 620 to produce an original feature map for frame.sub.i. The original feature map for frame.sub.i provides another operation object of feature aggregation. For example, the feature extract sub-network 620 may be a part of the PointNet, which may be called “PointNetFeat”. [0068] Next, the neural network 600 may include a feature aggregation layer 640. The feature aggregation layer 640 may aggregate the warped feature map for each of frame.sub.i-k, frame.sub.i-(k-1), … , frame.sub.i-1 along with corresponding adaptive weight to the original feature map of frame.sub.i, to obtain an aggregated feature map, which accumulating rich and diverse information from the historical 3D frames. For example, the feature aggregation layer 640 may also be a part of the PointNet. [0069] The neural network 600 may include a 3D semantic segmentation sub-network 650. The 3D semantic segmentation sub-network 650 performs 3D semantic segmentation based on the aggregated feature map from the feature aggregation layer 640, to output the outcome of 3D semantic segmentation. For example, the 3D semantic segmentation sub-network 650 may also be a part of the PointNet, which may be called “PointNetSeg”. [0070] It is to be noted that, in order to perform the feature warping, a feature map size of the 3D scene flow field for each of frame.sub.i-k, frame.sub.i-(k-1), … , frame.sub.i-1 and the original feature map for the same frame should be aligned. Further, in order to perform the feature aggregation, a feature map size of the warped feature map for each of frame.sub.i-k, frame.sub.i-(k-1), … , frame.sub.i-1 and the original feature map of frame.sub.i should be aligned. In an embodiment, the scene flow estimation sub-network 610 for each of frame.sub.i-k, frame.sub.i-(k-1), … , frame.sub.i-1 may include an alignment layer (not shown) to align the feature map size of the 3D scene flow field for each of frame.sub.i-k, frame.sub.i-(k-1), … , frame.sub.i-1 with the feature map size of the original feature map for the same frame. In another embodiment, the feature extract sub-network 620 for each of frame.sub.i-k, frame.sub.i-(k-1), … , frame.sub.i-1 and frame.sub.i may include an alignment layer (not shown) to align the feature map size of the original feature map for each of frame.sub.i-k, frame.sub.i-(k-1), … , frame.sub.i-1 with the feature map size of the 3D scene flow field for the same frame, and align the feature map size of the original feature map for frame.sub.i with the feature map size of the original feature map for any of frame.sub.i-k, frame.sub.i-(k-1), … , frame.sub.i-1. [0071] FIG. 7 shows an example workflow of the neural network 600 of FIG. 6. [0072] FIG. 8 shows a schematic diagram of a training neural network 800 for point cloud based 3D semantic segmentation, according to an embodiment of the disclosure. The training neural network 800 may be used to train trainable parameters of each layer for the neural network 600 of FIG. 6. [0073] As illustrated, a point cloud data set for a time-ordered sequence of 3D frames is provided to the training neural network 800 as an input. The 3D frames may include a current 3D frame (indicated as frame.sub.i) and one or more historical 3D frames (indicated as frame.sub.i-k, frame.sub.i-(k-1), … , frame.sub.i-1, where i and k are positive integers and k
[0074] Differently from the neural network 600, the training neural network 800 may include a dropout layer 810, to randomly select one from the k historical 3D frames. As shown in FIG. 8, the selected historical 3D frame is denoted as frame.sub.x, where x=i-k, i-(k-1), … i-1. The dropout layer 810 can prevent the training neural network 800 from over-fitting. [0075] The training neural network 800 may include a scene flow estimation sub-network 820 to estimate a 3D scene flow field for frame.sub.x. For example, the scene flow estimation sub-network 820 may be the FlowNet3D or any other network capable of implementing the similar operation. [0076] The training neural network 800 may include a feature extract sub-network 830 to produce an original feature map for frame.sub.x. For example, the feature extract sub-network 830 may be a part of the PointNet, which may be referred to as “PointNetFeat”. [0077] The training neural network 800 may include a feature warping layer 840 to warp the original feature map produced by the feature extract sub-network 830, based on the 3D scene flow field estimated by the scene flow estimation sub-network 820, and with reference to frame.sub.1. The feature warping layer 840 may produce a warped feature map for frame.sub.x and obtain an adaptive weight accompanying the warped feature map. Both the warped feature map and the adaptive weight will participate in feature aggregation later. The feature warping layer 840 may also be a part of the PointNet. [0078] The training neural network 800 may include a feature extract sub-network 850 to produce an original feature map for frame.sub.1. For example, the feature extract sub-network 850 may be a part of the PointNet, which may be referred to as “PointNetFeat”. [0079] Next, the training neural network 800 may include a feature aggregation layer 860. The feature aggregation layer 860 may aggregate the warped feature map for frame.sub.x along with corresponding adaptive weight to the original feature map of frame.sub.i, to obtain an aggregated feature map. For example, the feature aggregation layer 860 may also be a part of the PointNet. [0080] The training neural network 800 may include a 3D semantic segmentation sub-network 870. The 3D semantic segmentation sub-network 870 performs 3D semantic segmentation based on the aggregated feature map from the feature aggregation layer 860. For example, the 3D semantic segmentation sub-network 870 may also be a part of the PointNet. [0081] As mentioned, the performance of the training neural network 800 may be evaluated based on the loss function 880. When the result of the loss function 880 is small enough (such as, below a predefined threshold), an applicable neural network is obtained. The training neural network 800 will run repeatedly to train trainable parameters among layers of the scene flow estimation sub-network 820, the feature extract sub-network 830, the feature warping layer 840, the feature extract sub-network 850, the feature aggregation layer 860, and the 3D semantic segmentation sub-network 870, which may be collectively referred to as the PointNet. The process of training the trainable parameters among layers of the PointNet may be called “backpropagation”. [0082] During training of the neural network, the adaptive weight accompanying the warped feature map for each of the historical 3D frames may also be trained, by running the training neural network 800 repeatedly. The adaptive weights as trained will be applied along with the trained neural network, in the neural network 600 of FIG. 6. [0083] FIG. 9 is a flow diagram illustrating an example of a method 900 for point cloud based 3D semantic segmentation, according to an embodiment of the disclosure. Operations of the method are performed using computational hardware, such as that described above or below (e.g., processing circuitry). In some aspects, the method 900 can be performed by the neural network 600 of FIG. 6. In other aspects, a machine-readable storage medium may store instructions associated with method 900, which when executed can cause a machine to perform the method 900. [0084] The method 900 includes, at 910, obtaining a point cloud data set for a time-ordered sequence of 3D frames. The 3D frames including a current 3D frame and one or more historical 3D frames previous to the current 3D frame. The point cloud data set may be obtained from a vehicle-mounted LiDAR, for example. The point cloud data set may also be obtained from a local or remote database storing the point cloud data set. [0085] At 920, a 3D scene flow field for each of the one or more historical 3D frames is estimated with reference to the current 3D frame. [0086] At 930, an original feature map for each of the one or more historical 3D frames is produced. It is to be noted that operation 930 may happen synchronously or asynchronously with operations 920, which is not limited in the present disclosure. [0087] At 940, a warped feature map for each of the one or more historical 3D frames is produced along with an adaptive weight, by warping the original feature map for the corresponding historical 3D frames, based on the 3D scene flow field for the corresponding historical 3D frames. The adaptive weight may be taken into consideration in the feature aggregation operation at 960 later. [0088] At 950, an original feature map for the current 3D frames is produced. It is to be noted that operation 950 may happen concurrently with any of operations 920 to 940, which is not limited in the present disclosure. [0089] At 960, an aggregated feature map is produced by aggregating the warped feature map for each of the one or more historical 3D frames with the original feature map for the current 3D frame. [0090] At 970, 3D semantic segmentation is performed based on the aggregated feature map. [0091] Though some the operations are shown in sequence, it is not meant to limit that the operations must be performed depending on the order as shown. For example, some operations may happen concurrently, or the order of some operations may be reversed. [0092] FIG. 10 is a flow diagram illustrating an example of a method 1000 for training a neural network for point cloud based 3D semantic segmentation, according to an embodiment of the disclosure. Operations of the method are performed using computational hardware, such as that described above or below (e.g., processing circuitry). In some aspects, the method 1000 can be performed by the training neural network 800 of FIG. 8. In other aspects, a machine-readable storage medium may store instructions associated with method 1000, which when executed can cause a machine to perform the method 1000. [0093] The method 1000 includes, at 1010, obtaining a point cloud data set for a time-ordered sequence of 3D frames. The 3D frames including a current 3D frame and one or more historical 3D frames previous to the current 3D frame. The point cloud data set may be obtained from a vehicle-mounted LiDAR, for example. The point cloud data set may also be obtained from a local or remote database storing the point cloud data set. [0094] At 1020, a historical 3D frame is randomly selected from the one or more historical 3D frames. [0095] At 1030, a test result is produced based on forward-propagating processing of the selected historical 3D frame through a training neural network such as, the training neural network 800 of FIG. 8. [0096] At 1040, a loss function is applied to evaluate the test result to produce a loss value. The loss function may be Softmax, for example. [0097] At 1050, the loss value is reduced by refining trainable parameters of the training neural network based on backpropagation of the loss function through the training neural network. [0098] At 1060, the refined trainable parameters are supplied to configure an neural network for inference, such as the neural network 600 of FIG. 6. [0099] The forward-propagating processing of the selected historical 3D frame at 1030, may include: estimating a 3D scene flow field for the selected historical 3D frame at 1031; and producing an original feature map for the selected historical 3D frame at 1032. It is to be noted that operation 1031 may happen synchronously or asynchronously with operations 1032, which is not limited in the present disclosure. [0100] The forward-propagating processing of the selected historical 3D frame at 1030, may further include: at 1033, producing a warped feature map for the selected historical 3D frame along with an adaptive weight, by warping the original feature map for the selected historical 3D frame, based on the 3D scene flow field for the selected historical 3D frame. [0101] The forward-propagating processing of the selected historical 3D frame at 1030, may further include: producing an original feature map for the current 3D frames at 1034. It is to be noted that operation 1034 may happen concurrently with any of operations 1031 to 1033, which is not limited in the present disclosure. [0102] The forward-propagating processing of the selected historical 3D frame at 1030, may further include: at 1035, producing an aggregated feature map by aggregating the warped feature map for the selected historical 3D frame with the original feature map for the current 3D frame. [0103] The forward-propagating processing of the selected historical 3D frame at 1030, may further include: at 1036, performing 3D semantic segmentation based on the aggregated feature map. The test result include an outcome of the 3D semantic segmentation. [0104] Though some the operations are shown in sequence, it is not meant to limit that the operations must be performed depending on the order as shown. For example, some operations may happen concurrently, or the order of some operations may be reversed. [0105] FIG. 11 is a flow diagram illustrating an example of a method 1100 for point cloud based 3D semantic segmentation, according to an embodiment of the disclosure. Operations of the method are performed using computational hardware, such as that described above or below (e.g., processing circuitry). In some aspects, the method 1100 can be performed by the system 300 of FIG. 3 or a computing device in the vehicle 400 of FIG. 4. In other aspects, a machine-readable storage medium may store instructions associated with method 1100, which when executed can cause a machine to perform the method 1100. [0106] The method 1100 includes, at 1110, obtaining a point cloud data set for a time-ordered sequence of 3D frames. The 3D frames including a current 3D frame and one or more historical 3D frames previous to the current 3D frame. The point cloud data set may be obtained from a vehicle-mounted LiDAR, for example. The point cloud data set may also be obtained from a local or remote database storing the point cloud data set. [0107] At 1120, a first artificial neural network (ANN) is invoked to estimate a 3D scene flow field for each of the one or more historical 3D frames by taking the current 3D frame as a reference frame. [0108] At 1130, a second ANN is invoked to produce an aggregated feature map, based on the reference frame and the estimated 3D scene flow field for each of the one or more historical 3D frames and perform the 3D semantic segmentation based on the aggregated feature map. [0109] In some embodiments, the first ANN and the second ANN may be integrated into a single ANN, such as, the neural network 600 described in FIG. 6. The first ANN and the second ANN may be trained jointly or separately, which is not limited in the present disclosure. [0110] FIG. 12 illustrates a block diagram of an example machine 1200 upon which any one or more of the techniques (e.g., methodologies) discussed herein may be performed. Examples, as described herein, may include, or may operate by, logic or a number of components, or mechanisms in the machine 1200. Circuitry (e.g., processing circuitry) is a collection of circuits implemented in tangible entities of the machine 1200 that include hardware (e.g., simple circuits, gates, logic, etc.). Circuitry membership may be flexible over time. Circuitries include members that may, alone or in combination, perform specified operations when operating. In an example, hardware of the circuitry may be immutably designed to carry out a specific operation (e.g., hardwired). In an example, the hardware of the circuitry may include variably connected physical components (e.g., execution units, transistors, simple circuits, etc.) including a machine readable medium physically modified (e.g., magnetically, electrically, moveable placement of invariant massed particles, etc.) to encode instructions of the specific operation. In connecting the physical components, the underlying electrical properties of a hardware constituent are changed, for example, from an insulator to a conductor or vice versa. The instructions enable embedded hardware (e.g., the execution units or a loading mechanism) to create members of the circuitry in hardware via the variable connections to carry out portions of the specific operation when in operation. Accordingly, in an example, the machine readable medium elements are part of the circuitry or are communicatively coupled to the other components of the circuitry when the device is operating. In an example, any of the physical components may be used in more than one member of more than one circuitry. For example, under operation, execution units may be used in a first circuit of a first circuitry at one point in time and reused by a second circuit in the first circuitry, or by a third circuit in a second circuitry at a different time. Additional examples of these components with respect to the machine 1200 follow. ……
……
……