Nvidia Patent | In-hand object pose tracking

编辑：映维 | 分类：Nvidia | 2021年4月29日

Patent: In-hand object pose tracking

Drawings: Click to check drawins

Publication Number: 20210122045

Publication Date: 20210429

Applicant: Nvidia

Nvidia Patent | In-hand object pose tracking

Abstract

Apparatuses, systems, and techniques are described that estimate the pose of an object while the object is being manipulated by a robotic appendage. In at least one embodiment, a sample-based optimization algorithm tracks in-hand object poses during manipulation via contact feedback and a GPU-accelerated robotic simulation is developed. In at least one embodiment, parallel simulations concurrently model object pose changes that may be caused by complex contact dynamics. In at least one embodiment, the optimization algorithm tunes simulation parameters during object pose tracking to further improve tracking performance. In various embodiments, real-world contact sensing may be improved by utilizing vision in-the-loop.

Claims

A computer-implemented method comprising: obtaining tactile sensor information from a robotic appendage that is manipulating an object in the real world; generating a plurality of simulations of the robotic appendage manipulating the object, individual simulations having different poses for the object; determine a plurality of costs where each cost of the plurality of costs corresponds to a respective simulation of the plurality of simulations, and each cost of the plurality of costs is based at least in part on differences between the tactile sensor information and simulated tactile sensor information generated by the respective simulation of the plurality of simulations; identifying an individual simulation of the plurality of simulations based at least in part on the cost; determining a pose of the object in the real world based at least in part on a pose of the object in the identified individual simulation; and providing the pose of the object to a robotic control system that controls a robot to perform a task based at least in part on the pose of the object.
The computer-implemented method of claim 1, further comprising updating one or more physical parameters of the plurality of simulations to reduce a difference between a simulation and an observation in the real world.
The computer-implemented method of claim 1, wherein the plurality of simulations are implemented using a GPU-Accelerated physics simulator.
The computer-implemented method of claim 1, wherein the individual simulation is identified by identifying an individual simulation that generates simulated tactile sensor information most similar to the tactile sensor information.
The computer-implemented method of claim 1, further comprising: obtaining an initial pose estimation of the object; and generating a plurality of possible poses to be applied to the plurality of simulations, the plurality of possible poses generated by modifying the initial pose estimation.
The computer-implemented method of claim 5, wherein the initial pose estimation of the object is determined based on an image of the object obtained before the object is grasped by the robotic appendage.
The computer-implemented method of claim 1, wherein the individual simulation is identified by identifying an individual simulation with a lowest associated cost.
The computer-implemented method of claim 1, wherein the tactile sensor information includes a 2-dimensional array of force values for each digit of the robotic appendage.
A system comprising: one or more processors; and computer-readable memory storing executable instructions that, as a result of being executed by the one or more processors, cause the system to: obtain data describing forces on a robotic appendage that is grasping an object; generate simulations of the robotic appendage grasping the object, individual simulations of the simulations having different poses for the object; determine a plurality of values where each value of the plurality of values corresponds to a respective simulation of the plurality of simulations, and each value of the plurality of values is based at least in part on differences between the forces and simulated tactile simulated forces generated by the respective simulation of the plurality of simulations; identify an individual simulation of the simulations based at least in part on the value; and determine a pose of the object in the real world based at least in part on a pose of the object in the identified individual simulation.
The system of claim 9, wherein the executable instructions cause the system to further update one or more physical parameters of the simulations to reduce a difference between a state of a simulation and an observed state in the real world.
The system of claim 9, wherein the one or more processors include a graphics processing unit.
The system of claim 9, wherein the individual simulation is identified by identifying an individual simulation that generates simulated data most closely corresponding to the data.
The system of claim 9, wherein the executable instructions cause the system to further: obtain an initial pose of the object; and generate a plurality of poses to be applied to objects in the simulations, the plurality of possible poses generated by perturbing the initial pose.
The system of claim 13, wherein the initial pose of the object is determined using an image of the object obtained before the object is grasped by the robotic appendage.
The system of claim 9, wherein: the value is a measure of difference between the forces and the simulated forces; and the individual simulation is identified by identifying an individual simulation with a lowest associated value.
The system of claim 9, wherein the data is tactile sensor information generated by tactile force sensor on each digit of the robotic appendage.
Computer-readable media storing instructions that, as a result of being executed by one or more processors of a computer system, cause the computer system to: obtain data describing forces on a robotic appendage that is grasping an object; perform simulations of the robotic appendage grasping the object, individual simulations of the simulations having different poses for the object; determine a plurality of values where each value of the plurality of values corresponds to a respective simulation of the plurality of simulations, and each value of the plurality of values is based at least in part on differences between the forces and simulated tactile simulated forces generated by the respective simulation of the plurality of simulations; identify an individual simulation of the simulations based at least in part on the value; and determine a pose of the object in the real world based at least in part on a pose of the object in the identified individual simulation.
The computer-readable media of claim 17, wherein the instructions cause the computer system to further update one or more parameters of the simulations to reduce a difference between a state of a simulation and an observed state in the real world.
The computer-readable media of claim 17, wherein the one or more processors include a multi-core graphics processing unit.
The computer-readable media of claim 17, wherein the simulations are performed in parallel using a plurality of processors.
The computer-readable media of claim 17, wherein the instructions cause the computer system to further: obtain an initial pose of the object; and generate a plurality of poses to be applied to objects in the simulations, the plurality of possible poses generated by perturbing the initial pose.
The computer-readable media of claim 21, wherein the initial pose of the object is determined using an image of the object obtained with a depth camera.
The computer-readable media of claim 17, wherein: the value is a measure of difference between the forces and the simulated forces; and the individual simulation is identified by identifying an individual simulation with a lowest associated value.
The computer-readable media of claim 17, wherein the data is tactile sensor information generated by tactile force sensor on the robotic appendage.
A robot comprising: an arm that includes one or more articulated members connected via one or more servo motors; a robotic appendage connected to the arm, the robotic appendage having one or more tactile force sensors; one or more processors; and the computer-readable media of claim 17 connected to the one or more processors.

Description

CROSS-REFERENCE TO RELATED APPLICATION

[0001] This application claims the benefit of U.S. Provisional Patent Application No. 62/925,669, filed Oct. 24, 2019, entitled “IN-HAND OBJECT POSE TRACKING VIA CONTACT FEEDBACK AND GPU-ACCELERATED ROBOTIC SIMULATION,” the disclosure of which is incorporated herein by reference in its entirety.

TECHNICAL FIELD

[0002] At least one embodiment pertains to training and simulating robots to perform and facilitate tasks. For example, at least one embodiment pertains to training and simulating robots using artificial intelligence according to various novel techniques described herein.

BACKGROUND

[0003] Training and simulating robots to accurately perform tasks can use significant memory, time, or computing resources. Training a robotic control system to track the pose of an object held and manipulated by a robot hand is challenging for vision-based object pose tracking systems, because the object is under significant occlusion while the robot hand is holding it. Such occlusion reduces the amount of data that can be used in determining what movements a robot should make, creating risks that whatever task is being performed will be done incorrectly and/or inefficiently, perhaps damaging the object or other objects in the environment in the process. Such tracking is especially complex due to the fact that objects sometimes slip or otherwise move during the process, creating changes in the object’s orientation that can go undetected and, therefore, unaccounted for. The amount of memory, time, or computing resources used to accurately train and simulate robots can be improved.

BRIEF DESCRIPTION OF DRAWINGS

[0004] FIG. 1 illustrates an example of teleoperation across various tasks, according to at least one embodiment;

[0005] FIG. 2 illustrates an example of a robot with tactile sensors, according to at least one embodiment;

[0006] FIG. 3 illustrates an example of a system that tracks objects in real-time, according to at least one embodiment;

[0007] FIG. 4 illustrates an example of estimating hand pose, according to at least one embodiment;

[0008] FIG. 5 illustrates an example of a human hand pose, and a robotic gripper performing a corresponding pose, according to at least one embodiment;

[0009] FIG. 6 illustrates an example of human hand poses, and corresponding robot gripper poses, according to at least one embodiment;

[0010] FIG. 7 illustrates an example of an in-hand object pose tracking framework, according to at least one embodiment;

[0011] FIG. 8 illustrates an example of a comparison of optimizers, according to at least one embodiment;

[0012] FIG. 9 illustrates an example of an algorithm utilized by a system, according to at least one embodiment;

[0013] FIG. 10 illustrates a first example of results of ablation studies, according to at least one embodiment;

[0014] FIG. 11 illustrates a second example of results of ablation studies, according to at least one embodiment;

[0015] FIG. 12 illustrates an example of real-world experiment results, according to at least one embodiment;

[0016] FIG. 13 illustrates an example of a process that, as a result of being performed by a computer system, determines the pose of an object being manipulated by a robotic hand equipped with tactile force sensors;

[0017] FIG. 14A illustrates inference and/or training logic, according to at least one embodiment;

[0018] FIG. 14B illustrates inference and/or training logic, according to at least one embodiment;

[0019] FIG. 15 illustrates training and deployment of a neural network, according to at least one embodiment;

[0020] FIG. 16 illustrates an example data center system, according to at least one embodiment;

[0021] FIG. 17A illustrates an example of an autonomous vehicle, according to at least one embodiment;

[0022] FIG. 17B illustrates an example of camera locations and fields of view for the autonomous vehicle of FIG. 17A, according to at least one embodiment;

[0023] FIG. 17C is a block diagram illustrating an example system architecture for the autonomous vehicle of FIG. 17A, according to at least one embodiment;

[0024] FIG. 17D is a diagram illustrating a system for communication between cloud-based server(s) and the autonomous vehicle of FIG. 17A, according to at least one embodiment;

[0025] FIG. 18 is a block diagram illustrating a computer system, according to at least one embodiment;

[0026] FIG. 19 is a block diagram illustrating computer system, according to at least one embodiment;

[0027] FIG. 20 illustrates a computer system, according to at least one embodiment;

[0028] FIG. 21 illustrates a computer system, according at least one embodiment;

[0029] FIG. 22A illustrates a computer system, according to at least one embodiment;

[0030] FIG. 22B illustrates a computer system, according to at least one embodiment;

[0031] FIG. 22C illustrates a computer system, according to at least one embodiment;

[0032] FIG. 22D illustrates a computer system, according to at least one embodiment;

[0033] FIGS. 22E and 22F illustrate a shared programming model, according to at least one embodiment;

[0034] FIG. 23 illustrates exemplary integrated circuits and associated graphics processors, according to at least one embodiment;

[0035] FIGS. 24A and 24B illustrate exemplary integrated circuits and associated graphics processors, according to at least one embodiment;

[0036] FIGS. 25A and 25B illustrate additional exemplary graphics processor logic according to at least one embodiment;

[0037] FIG. 26 illustrates a computer system, according to at least one embodiment;

[0038] FIG. 27A illustrates a parallel processor, according to at least one embodiment;

[0039] FIG. 27B illustrates a partition unit, according to at least one embodiment;

[0040] FIG. 27C illustrates a processing cluster, according to at least one embodiment;

[0041] FIG. 27D illustrates a graphics multiprocessor, according to at least one embodiment;

[0042] FIG. 28 illustrates a multi-graphics processing unit (GPU) system, according to at least one embodiment;

[0043] FIG. 29 illustrates a graphics processor, according to at least one embodiment;

[0044] FIG. 30 is a block diagram illustrating a processor micro-architecture for a processor, according to at least one embodiment;

[0045] FIG. 31 illustrates a deep learning application processor, according to at least one embodiment;

[0046] FIG. 32 is a block diagram illustrating an example neuromorphic processor, according to at least one embodiment;

[0047] FIG. 33 illustrates at least portions of a graphics processor, according to one or more embodiments;

[0048] FIG. 34 illustrates at least portions of a graphics processor, according to one or more embodiments;

[0049] FIG. 35 illustrates at least portions of a graphics processor, according to one or more embodiments;

[0050] FIG. 36 is a block diagram of a graphics processing engine 3610 of a graphics processor in accordance with at least one embodiment.

[0051] FIG. 37 is a block diagram of at least portions of a graphics processor core, according to at least one embodiment;

[0052] FIGS. 38A and 38B illustrate thread execution logic 3800 including an array of processing elements of a graphics processor core according to at least one embodiment;

[0053] FIG. 39 illustrates a parallel processing unit (“PPU”), according to at least one embodiment;

[0054] FIG. 40 illustrates a general processing cluster (“GPC”), according to at least one embodiment;

[0055] FIG. 41 illustrates a memory partition unit of a parallel processing unit (“PPU”), according to at least one embodiment; and

[0056] FIG. 42 illustrates a streaming multi-processor, according to at least one embodiment.

DETAILED DESCRIPTION

[0057] The present document describes a system and method for estimating the pose of an object while the object is being manipulated by a robotic hand, claw, or manipulator. When an object is being held by a robot, image-based pose estimation systems may, in various examples, suffer from inaccurate pose estimation caused by object occlusion. In at least one embodiment, a robotic hand is equipped with tactile sensors, and while an object is manipulated by the robotic hand, sensor signals generated by the tactile sensors are used to improve the estimate of the pose of the object. In some situations, dynamic effects such as slipping may occur during active manipulation that increase the difficulty of pose estimation. In at least one embodiment, a physics model of the object is used to improve the modeling of robot-object interactions.

[0058] In at least one embodiment, a graphics processing units (“GPU”)-Accelerated physics engine with derivative-free, sample-based optimizers tracks in-hand object poses with contact feedback during manipulation. In at least one embodiment, the physics simulation is used as the forward model for robot-object interactions, and the techniques described herein jointly optimize for the state and the parameters of the simulations, such that the simulations more accurately estimate the real world.

[0059] In at least one embodiment, techniques described herein explicitly model the dynamics of robot-object interactions for object pose tracking and optimize for simulation parameters during pose tracking. In various examples, these features allow the system to track the object pose under complex dynamic behaviors, such as translational and torsional slippage due to inertial and external forces and breaking and re-establishing contact. In addition, by using GPU-Accelerated physics engine, these techniques can often be applied in real-time (30 Hz) by using a GPU.

[0060] Various embodiments demonstrate promising applications for GPU-Accelerated physics simulation for robotics. For example, in some embodiments, the speed of the physics engine allows expensive, contact-rich simulations and sample-based optimization methods, which rely on data from many concurrent simulations, all in real-time on the same machine, which is often more difficult with CPU-based simulations. Various embodiments may be used as a tool for estimating in-hand object pose and for relaxing the constraint that many dexterous manipulation researchers face, which is that the object is usually placed in such a way that it is mostly visible from a camera, limiting the range of manipulation tasks that can be studied.

[0061] Teleoperation may imbue lifeless robotic systems with sophisticated reasoning skills, intuition, and creativity. However, teleoperation solutions for high degree-of-actuation (“DoA”), multi-fingered robots may be complex. In at least one embodiment, a system is developed that allows for complete control over a high DoA robotic system by merely observing the bare human hand. The system may enable operators to solve a variety of complex manipulation tasks that go beyond simple pick-and-place operations. In various embodiments, the system may be implemented by one or more systems as described/depicted in FIGS. 13-40.

[0062] Tracking the pose of an object while it is being held and manipulated by a robot hand may be difficult for vision-based methods due to significant occlusions. The techniques described herein utilize GPU-accelerated parallel robot simulations and derivative-free, sample-based optimizers to track in-hand object poses with contact feedback during manipulation. In some examples, a physics simulation is used as the forward model for robot-object interactions, and the algorithm jointly optimizes for the state and the parameters of the simulations, so they better match with those of the real world. At least one embodiment runs in real-time (30 Hz) on a single GPU, and it achieves an average point cloud distance error of 6 mm in simulation experiments and 13 mm in the real-world ones.

[0063] In at least one embodiment, performing dexterous manipulation policies benefits from a robust estimate of the pose of the object held in-hand. However, in many implementations, in-hand object pose tracking still presents a challenge due to significant occlusions. In such implementations, works that require in-hand object poses may be limited to experiments where the object is mostly visible or rely on multiple cameras, or the hand-object transform is fixed or known. In some examples, the issue of visual occlusions is mitigated by studying object pose estimation via contacts or tactile feedback, often by using particle filters and knowledge of the object geometry and contact locations. In at least one embodiment, these techniques may be applied to a static-grasp setting, where an object is stationary and in-grasp. In at least one embodiment, these techniques are extended to tracking object poses during in-hand manipulation, requiring modeling of complex object-hand contact dynamics.

[0064] To provide in-hand object tracking during robot manipulation, at least one embodiment combines a GPU-accelerated, high-fidelity physics simulator as the forward dynamics model with a sample-based optimization framework to track object poses with contact feedback as shown in FIG. 7. In at least one embodiment, a concurrent set of simulations is initialized with the initial states of a real robot and the initial pose of the real object, which may be obtained from a vision-based pose registration algorithm assuming the object is not in occlusion in the beginning. In at least one embodiment, the initial poses of the simulated objects are slightly perturbed and reflect the uncertainty of the vision-based pose registration algorithm. In at least one embodiment, the GPU-accelerated physics simulator runs many concurrent simulations in real-time on a single GPU. In at least one embodiment, as a given policy controls the real robot to approach, grasp, and manipulate the object in-hand, the system runs the same robot control commands on the simulated robots. In at least one embodiment, observations of the real robot and the simulated robots are collected, which include terms like the magnitude and direction of contacts on the robot hand’s contact sensors. In at least one embodiment, a sample-based optimization algorithm periodically updates the states and parameters of the simulations according to a cost function that captures how well the observations of each simulation matches with those of the real world. In addition, in some embodiments, the algorithm updates simulation parameters, such as mass and friction, to further improve the simulations’ dynamics models of the real world. In at least one embodiment, at any point in time, the object pose estimate is the pose of the robot-object system.

[0065] In at least one embodiment, to evaluate the proposed algorithm, a total of 24 in-hand manipulation trajectories with three different objects in simulation and in the real world were collected. In at least one embodiment, a Kuka IIWA7 arm with the 4-finger Wonik Robotics Allegro hand as the end-effector was used, with each finger outfitted with a SynTouch BioTac contact sensor. In at least one embodiment, object manipulation trajectories are human demonstrations collected via a hand-tracking teleoperation system. In at least one embodiment, because ground-truth object poses in simulation are available, detailed ablation studies in simulation experiments to study the properties of the proposed algorithm are performed. In at least one embodiment, for real-world experiments, a vision-based algorithm is used to obtain the object pose in the first and last frame of the collected trajectories, where the object is not in occlusion. In at least one embodiment, the pose in the first frame is used to initialize the simulations, and the pose in the last frame is used to evaluate the accuracy of the proposed contact-based algorithm.

[0066] Various examples identify in-hand object-pose with vision only, usually by first segmenting out the robot or human hand in an image before performing pose estimation. However, vision-only approaches may degrade in performance for larger occlusions. Some embodiments use tactile feedback to aid object pose estimation. Tactile perception can identify object properties such as materials and pose, as well as provide feedback during object manipulation.

[0067] In at least one embodiment, experiments with dynamics models and particle filter techniques reveal that adding noise to applied forces instead of the underlying dynamics yield more accurate tracking results. At least one embodiment combines tactile feedback with a vision-based object tracker to track object trajectories during planar pushing tasks, and another applies incremental smoothing and mapping (“iSAM”) to combine global visual pose estimations with local contact pose readings.

[0068] In at least one embodiment, a robot hand grasps an object and localizes the object pose without moving. Some examples use point contact locations and some examples use a full tactile map to extract local geometry information around the contacts.

[0069] At least one embodiment uses contact location feedback for pose estimation, and some implementations use a variation of Bayesian or particle filtering. Some embodiments perform filtering jointly over visual features, hand joint positions, force-torque readings, and binary contact modes. Some techniques can be applied to pose estimation when the object is not held by the robot hand as well by using force probes.

[0070] In at least one embodiment, tactile maps are used for pose estimation, some examples use large, low-resolution tactile arrays to sense contacts in a grid, while other examples use high-resolution tactile sensors mounted on robot finger tips. In at least one embodiment, the system searches for similar local patches on an object surface to localize the object with respect to the contact location, and other systems fuse GelSight data with a point cloud perceived by a depth sensor before performing pose estimation.

[0071] Some embodiments implement in-hand object pose tracking during object manipulation, which is more challenging than if the object is static. In at least one embodiment, an algorithm that combines contact locations with Dense Articulated Real-time Tracking (“DART”) is used. In at least one embodiment, the algorithm fuses contact locations with color visual features, joint positions, and force-torque readings. In at least one embodiment the algorithm is sensitive to initialization of the object poses, especially when the object appears small in the depth image. In at least one embodiment, techniques described herein do not assume access to robust visual features during manipulation, but instead utilize a physics simulator to model both the kinematics and the dynamics of the robot-object system.

[0072] In various examples, robotic teleoperation may have applications in search and rescue, space, medicine, and applied machine learning. The motivation for teleoperative capability may be to allow a robot system to solve complex tasks by harnessing the cognition, creativity, and reactivity of humans through a human-machine interface (“HM”). In an embodiment, this system provides a glove-free solution to drive a multi-fingered, highly actuated robot system to solve a wide variety of grasping and manipulation tasks. In some examples, depth cameras and various graphics processing units (“GPU”) may be used along with deep learning and optimization to produce a minimal-footprint, dexterous teleoperation system. In some examples, a variety of physical tasks can be performed with visual feedback alone. Therefore, this system may utilize the human ability to plan, move, and predict the consequences of physical actions from vision alone, which may be a sufficient condition for solving a variety of tasks.

[0073] The developed system, in various embodiments, enables such dexterous robot manipulation using multi-camera depth observations of the bare human hand. In some examples, the system may be a glove-free and entirely vision-based teleoperation system that dexterously articulates a highly-actuated robotic hand-arm system through direct imitation. The system may also demonstrate a range of tasks particularly involving fine manipulation and dexterity (e.g., extracting paper money from a wallet and concurrently picking two cubes with four fingers as depicted in FIG. 1).

[0074] FIG. 1 illustrates an example of teleoperation across various tasks, according to at least one embodiment. In one example, a robotic gripper 104 grasps a cylinder using a grasp pose based on a human hand 102. In another example, a robotic gripper 108 grasps a cube using a grasp pose based on a human hand 106. In another example, a robotic gripper 112 grasps a cup using a grasp pose based on a human hand 110. In another example, a robotic gripper 116 grasps a wallet using a grasp pose based on a human hand 104.

[0075] The teleoperation setup may comprise a robot system and an adjacent human pilot arena as shown in FIG. 2. FIG. 2 illustrates an example of a robot with tactile sensors, according to at least one embodiment. In at least one embodiment, a robot 202 has a robotic gripper 204 that is used to grasp objects. In at least one embodiment a set of cameras 206, 208, 210, and 212 are used to observe the workspace of the robot 202. In at least one embodiment, the gripper 204 includes a set of tactile sensors 216, 218, 220, and 222 that provide sensory information to a control computer system. In at least one embodiment, the tactile sensors may be covered with a friction material to enhance and/or improve the robot’s ability to grip an object.

[0076] In some embodiments, as depicted in FIG. 2, the robot system may be a KUKA LBR iiwa7 R800 series Arm with a Wonik Robotics Allegro hand retrofitted with four SynTouch BioTac tactile sensors at the fingertips and 3M TB641 grip tape applied to the inner surfaces of the phalanges and palm, in which the rubbery surfaces of both the BioTac sensors and 3M tape may improve friction of the hand while the BioTacs themselves may produce 23 signals that can later be used to learn sensorimotor control from demonstrations. The human arena may be a black-clothed table surrounded by four calibrated and time-synchronized cameras, such as Intel RealSense RGB-D cameras, which may be spatially arranged to cover a workspace of 80 cm.times.55 cm.times.38 cm. In some examples, the cameras may be directly adjacent to the robot to improve line-of-sight and visual proximity since teleoperation is entirely based on human vision and spatial reasoning. It should be noted that FIG. 2 is intended to be an illustrative example and, in various embodiments, the system may include any robot system utilizing any robot components (e.g., various types of robot arms, hands, tactile sensors, grip, other sensors, cameras, and/or variations thereof) in any suitable environment.

[0077] To produce a natural-feeling teleoperation system, an imitation-type paradigm may be adopted. The bare human hand motion pose and finger configuration may be constantly observed and measured by a visual perception module. The human hand motion may then be relayed to the robot system in such a way that the copied motion is self-evident. This approach may enable a human pilot to curl and arrange their fingers, form grasps, reorient and translate their palms, with the robot system following in a similar manner. In at least one embodiment, the system relies heavily on Dense Articulated Real-Time Tracking (“DART”), which may form backbone of tracking the pose and joint angles of the human hand. The full system architecture and component connections are depicted in FIG. 3, in an embodiment.

[0078] FIG. 3 illustrates an example of a system that tracks objects in real-time, according to at least one embodiment. In at least one embodiment, the system operates using three threads, which are independent processes running on one or more processors of a computer system. In at least one embodiment, one or more images of a hand are obtained from RGB-Depth (“RGB-D”) cameras 302. The images are processed by a pointnet:stage 1 304, a pointnet:stage 2 306, and a jointnet 308 to produce a hand pose for the hand in the images. In at least one embodiment, an articulated hand model 310 and the hand pose are processed using DART 312 and kinematic retargeting 314 to produce a corresponding hand pose for a robotic gripper. In at least one embodiment, a control thread applies Reimannian motion policies 318 to the gripper hand pose, and the resulting information is used to control the robot 320.

[0079] In at least one embodiment, DART is used for continuous pose and joint angle tracking of a human hand. In at least one embodiment, DART uses an articulated model of the hand that is registered against an input point cloud. A human hand model may be obtained and turned into a single mesh model. Utilizing computer-aided design (“CAD”) software, the fingers of the mesh model may be separated into their respective proximal, medial, and distal links, and re-exported as separate meshes along with an associated extensible markup language (“XML”) file that describes their kinematic arrangement. In total, the human hand model may possess 20 revolute joints: four joints per finger with one abduction joint and three flexion joints.

[0080] In at least one embodiment, DART is a model-based tracker that relies on non-linear optimization and initialization (e.g., from the previous frame or an initial guess). In some examples, if this initialization is not within the basin of convergence, the tracker can fail to converge to the correct solution. In various embodiments, when tracking the human hand model with point cloud data, the hand model may often snap to spurious local minima leading to tracking failures every few minutes. Therefore, to reliably track the human hand over long periods of time as needed for teleoperation it may be desirable to have reliable hand pose priors, clean hand segmentation, and a multi-view camera studio to prevent the hand model from snapping onto unexpected local minima. In various embodiments, one method for generating hand pose priors is training a neural network on a large dataset of human hand poses given camera images.

[0081] In at least one embodiment, data collection is initiated with DART and no hand pose priors, seeding the training of an initial network to produce hand priors. Subsequently, DART and the latest trained neural network may generate increasing amounts of data. In at least one embodiment, the network is perpetually updated with the latest datasets to generate increasingly better priors for DART, which may ultimately extend the range over which DART can operate without any failures. In some examples, the hand pose neural network may be a PointNet-based architecture which operates directly on fused point cloud data obtained by back-projecting depth images from extrinsically calibrated depth cameras into a single global reference frame with annotations provided by DART. In various embodiments, since the fused point cloud contains both the points on table as well as human body and arm, it may be imperative to first localize the hand. Points may be removed from the table by fitting a plane and feeding the remaining points containing the arm and human body to PointNet which may localize the hand as well as provide the hand pose. PointNet may be based on estimating hand pose via a vote-based regression scheme to the 3D positions of specified keypoints on the hand, a technique which may be associated with spatial-softmax often used in 2D keypoint localization. In various embodiments, PointNet may be trained to predict 3D coordinates of 23 keypoints specified on the hand four joint keypoints for each of the five fingers and three keypoints on the back of the hand for hand pose estimation. The loss function may be the Euclidean distance between predicted and the ground truth keypoints. Additionally, an auxiliary segmentation loss may be included to obtain hand segmentation. For efficiency reasons, any input point cloud may be sub-sampled uniformly to a fixed 8192.times.3 size before being fed to PointNet. In at least one embodiment, while reasonable hand pose estimation and segmentation may be achieved, high quality predictions for the 20 joint keypoints on the fingers may not yet be achieved. In at least one embodiment, the uniform sub-sampling used at the input may indicate that points on the fingers are not densely sampled, and therefore a second stage refinement may be needed which resamples points on the hand from the original raw point cloud given the pose and segmentation of the first stage. In at least one embodiment, the second stage may be trained on the same loss functions, but may only use the points sampled on the hand instead to predict accurately the 23 keypoints. In at least one embodiment, to enable robustness to any inaccuracies in the hand pose from the first stage, random perturbations may be added to the hand pose for second stage. FIG. 4 depicts the second stage refinement within the system, in accordance with at least one embodiment. In at least one embodiment, both stages of PointNet may be trained on 100K point clouds collected over a batch of 30-45 minutes each for 7-8 hours in total by running DART to provide annotations for keypoints, joint angles and segmentation. In at least one embodiment, to provide joint angle priors for fingers, a third neural network may be trained that maps keypoint locations predicted by PointNet to corresponding joint angles. This neural network, which may be referred to as JointNet, may be a two-layer fully connected network that takes input of size 23.times.3 and predicts 20-dimensional vector of joint angles for fingers.

[0082] In at least one embodiment, the neural networks are trained on data collected across multiple human hands, ensuring accurate pose fits for this system and enabling sensible priors for DART. In some embodiments, the hand tracker may work better for hands geometrically close to the DART human hand model.

[0083] In at least one embodiment, teleoperation of a robotic hand that is kinematically disparate from the human hand may require a module that can map the observed human hand joints to the robot hand joints, which can be referred to in some embodiments as the Allegro hand joints. FIG. 5 illustrates an example of a human hand pose 502, and a robotic gripper 504 performing a corresponding pose, according to at least one embodiment. There may be many different approaches to kinematic retargeting. For instance, in at least one embodiment, a module may be used to match the positions from the palm to the fingertips and medial joints, and the directionality of proximal phalanges and thumb distal phalange. In at least one embodiment, the optimized mapping may be used to label human depth images such that a deep network can ingest a depth image and output joint angles. In at least one embodiment, motion retargeting is also utilized. For instance, a deep recurrent neural network may be trained unsupervised to retarget motion between skeletons. In at least one embodiment, the system utilizes fingertip task-space metrics because distal regions may be of the highest priority in grasping and manipulation tasks as measured by their contact prevalence, degree of innervation, and heightened controllability for fine, in-hand manipulation skill. In at least one embodiment, the joint axes and locations between two hands may be different, and therefore, no metrics directly comparing joint angles between the two hands may be used. In at least one embodiment, to capture and optimize for the positioning of fingertips, both distance and direction among fingertips are considered. Specifically, in at least one embodiment, the cost function for kinematic retargeting may be chosen as:

C(q.sub.h,q.sub.a)=1/2.SIGMA..sub.i=0.sup.Ns(d.sub.i).parallel.r.sub.i(q- .sub.a)-f(d.sub.i)r.sub.i{circumflex over ( )}(q.sub.h).parallel..sup.2+.gamma..parallel.q.sub.a.parallel..sup.2

[0084] where q.sub.h, q.sub.a may be the angles of the human hand model and Allegro hand, respectively, r.sub.i .di-elect cons. R.sup.3 may be the vector pointing from the origin of one coordinate system to another, expressed in the origin coordinate system (see FIG. 5). Furthermore, in at least one embodiment,

d i = r i .function. ( q h ) .times. .times. and .times. .times. r i .function. ( q h ) = r i .function. ( q h ) r i .function. ( q h ) . ##EQU00001##

The switching weight function s(d.sub.i) may be defined as:

s .function. ( d i ) = { 1 , d i > 2 .times. 00 , d i .ltoreq. r i .function. ( q h ) .di-elect cons. S 1 4 .times. 00 , d i .ltoreq. r i .function. ( q h ) .di-elect cons. S 2 ##EQU00002##

[0085] where S.sub.1 may be vectors that originate from a primary finger (index, middle, ring) and point to the thumb, and S.sub.2 may be vectors between two primary fingers when both primary fingers have associated vectors .di-elect cons. S.sub.1 (e.g., both primary fingers are being projected with the thumb). In at least one embodiment, the distancing function, f(d.sub.i).di-elect cons. R is defined as:

f .function. ( d i ) = { .beta. .times. .times. d i , d i > .eta. 1 , d i .ltoreq. r i .function. ( q h ) .di-elect cons. S 1 .eta. 2 , d i .ltoreq. r i .function. ( q h ) .di-elect cons. S 2 ##EQU00003##

[0086] where .beta.=1.6 may be a scaling factor, .eta..sub.1=1.times.10.sup.-4 m may be a distance between a primary finger and the thumb, and .eta..sub.2=3.times.10.sup.-2 m may be a minimum separation distance between two primary fingers when both primary fingers are being projected with the thumb. In at least one embodiment, these projections ensure that contacts between primary fingers and the thumb are close without inducing primary finger collisions in a precision grasp. In at least one embodiment, this may be particularly useful in the presence of visual finger tracking inaccuracies. In some examples, the vectors r.sub.i may not only capture distance and direction from one task space to another, but their expression in local coordinates may further contain information on how the coordinate systems, and thereby fingertips, are oriented with one another. In at least one embodiment, the coordinate systems of the human hand model may therefore have equivalent coordinate systems on the Allegro model with similarity in orientation and placement. The vectors shown in FIG. 5 may form a minimal set that produces the desired retargeting behavior. In some embodiments, .gamma.=2.5.times.10.sup.-3 may be a weight on regularizing the Allegro angles to zero (equivalent to fully opened the hand). In at least one embodiment, this term helps with reducing redundancy in solution and ensure that the hand never enters strange minima that may be difficult to recover from (e.g., the fingers embedding themselves into the palm). In at least one embodiment, various mappings from human hand 602-617 to an Allegro robotic hand 618-633 as produced by the kinematic retargeting are shown in FIG. 6.

[0087] In at least one embodiment, the above cost function is minimized in real-time using the Sequential Least-Squares Quadratic Programming (“SLSQP”) algorithm. In at least one embodiment, the routine is initiated with Allegro joint angles set to zero, and every solution thereafter may be initiated with the preceding solution. In at least one embodiment, the forward kinematic calculations between the various coordinate systems of both the human hand model and Allegro hand are found. In at least one embodiment, a first-order low-pass filter is applied to the raw retargeted joint angles in order to remove high-frequency noise present in tracking the human hand and to smooth discrete events, like the projection algorithm inducing step-response changes in retargeted angles.

[0088] Riemannian Motion Policies (“RMPs”), in an embodiment, are real-time motion generation methods that calculate acceleration fields from potential function gradients and corresponding Riemannian metrics. RMPs may combine the generation of multi-priority Cartesian trajectories and collision avoidance behaviors together in one cohesive framework. In at least one embodiment. They are used to control the Cartesian pose of the Allegro palm given the observed human hand pose while avoiding arm-palm collisions with the table or operator using collision planes. Given these objectives, in at least one embodiment, the RMPs generated target arm joint trajectories are sent to the arm’s torque-level impedance controller at 200 Hz. In at least one embodiment, the kinematically retargeted Allegro angles are sent to the torque-level joint controller at 30 Hz. In at least one embodiment, a teleoperation instance is initialized by registering the studio cameras with the robot base coordinate system via an initial, static robot pose and the initial observation of the human hand. In at least one embodiment, the human hand model axes and robot end-effector axes is approximately aligned such that direction of movements are preserved between human hand motion and robot motion.

[0089] Overall, the system can be reliably used to solve a variety of tasks spanning a range of difficulty. In some examples, the ability to solve these tasks reveals that the system may have the dexterity to exhibit precision and power grasps, multi-fingered prehensile and non-prehensile manipulation, in-hand finger gaiting, and compound in-hand manipulation (e.g., grasping with two fingers while simultaneously manipulating with the remaining fingers).

[0090] In at least one embodiment, the system may enable a highly-actuated hand-arm system to find a motor solution to a variety of manipulation tasks by translating observed human hand and finger motion to robot arm and finger motion. In at least one embodiment, several tasks, like extracting paper money from a wallet and opening a cardboard box within a plastic container, may be so complex that hand-engineering a robot solution or applying learning methods directly may be likely intractable. Solving these tasks and others through the embodied robotic may allow for these solutions to be generated on-demand for many demonstrations. Furthermore, creating these solutions on the system itself may allow for the reading, access, and storage of the various tactile signals in the robot’s fingertips, various commanded and measured joint position and velocity signals through the hand and arm, various torque commands throughout the system, and any camera feeds associated with the system. In at least one embodiment, this rich source of data together with demonstrations of tasks may be used to solve complex, multi-stage, long horizon tasks.

[0091] In an embodiment, a system is developed to track in-hand objects during robot manipulation. In various embodiments, the system may be implemented by one or more systems as described/depicted in FIGS. 13-41. As depicted in FIG. 7, the system may comprise a GPU-accelerated, high-fidelity physics simulator as the forward dynamics model with a sample-based optimization framework to track object poses with contacts feedback. In at least one embodiment, a concurrent set of simulations is initialized with the initial states of a real robot and the initial pose of a real object, which may be obtained from a vision-based pose registration algorithm assuming the object is not in occlusion in the beginning. In at least one embodiment, the initial poses of the simulated objects are slightly perturbed and reflect the uncertainty of the vision-based pose registration algorithm. In at least one embodiment, the GPU-accelerated physics simulator runs many concurrent simulations in real-time on a single GPU. In at least one embodiment, a given policy is utilized that controls the real robot to approach, grasp, and manipulate the object in-hand, and the same robot control commands are run on the simulated robots. In at least one embodiment, observations of the real robot and the simulated robots are collected, which include terms like the magnitude and direction of contacts on the robot hand’s contact sensors. In at least one embodiment, a sample-based optimization algorithm is utilized that periodically updates the states and parameters of the simulations according to a cost function that captures how well the observations of each simulation match with those of the real world. In addition, in at least one embodiment, the algorithm also updates simulation parameters, such as mass and friction, to further improve the simulations’ dynamics models of the real world. At any point in time, the object pose estimate may be the pose of the robot-object system.

[0092] In various embodiments, to evaluate the proposed algorithm, a total of 24 in-hand manipulation trajectories with three different objects in simulation and in the real world may be collected, although any number of trajectories may be collected. At least one embodiment utilizes a robot arm such as the Kuka IIWA7 arm with the 4-finger Wonik Robotics Allegro hand as the end-effector, with each finger outfitted with a SynTouch BioTac contact sensor. In at least one embodiment, object manipulation trajectories are human demonstrations collected via a hand-tracking teleoperation system. In various embodiments, due to the ground-truth object poses in simulation, detailed ablation studies are performed in simulation experiments to evaluate the properties of the proposed algorithm. In at least one embodiment, a vision-based algorithm is utilized to obtain the object pose in the first and last frame of the collected trajectories, where the object is not in occlusion. In at least one embodiment, the pose in the first frame is used to initialize the simulations, and the pose in the last frame is used to evaluate the accuracy of the proposed contact-based algorithm.

[0093] FIG. 7 illustrates an embodiment of an in-hand object pose tracking framework. In at least one embodiment, robotic controls 702 are sent to a GPU-accelerated physics simulator that runs many robot simulations in parallel 708, each with different physics parameters and perturbed object poses. In at least one embodiment, costs based on observations, such as contact feedback from the real world and from the simulations, are passed to a sample-based derivative-free optimizer 704 that periodically updates the states and parameters of all simulations to better match that of the real world. In at least one embodiment, at any point in time, the pose of the simulation with the lowest cost is chosen as the current object pose estimate 706.

[0094] In an embodiment, a system tracks the pose of an object held in-hand by a robot manipulator during object manipulation. In some embodiments, for time, which may be represented by t, an object pose may be defined as p.sub.t .di-elect cons. SE(3), and a physics dynamics model may be defined as s.sub.t+1=f(s.sub.t,u.sub.t, .theta.), where s.sub.t may be the state of the world (position and velocities of rigid bodies and of joint angles in articulated bodies), u.sub.t .di-elect cons. .sup.M may be the robot controls (desired joint positions may be utilized as the action space), and .theta. .di-elect cons. .sup.N may be the fixed parameters of the simulation (e.g., mass and friction).

[0095] In various embodiments, for a simulation model f that exactly matches reality given perfect initializations of p.sub.0, s.sub.0, and .theta., pose estimation may require only playing back the sequence of actions u.sub.t applied to the robot in the simulation. However, as forward models may be imperfect and pose initializations may be noisy, pose estimation can be improved through observation feedback.

[0096] In some embodiments, D may be defined as a number of joints the robot has and L may be defined as a number of its contact sensors. An observation vector o.sub.t may be defined as the concatenation of the joint position configuration of the robot q.sub.t .di-elect cons..sup.D, the position and rotation of the robot’s contact sensors P.sub.t.sup.(l).di-elect cons..sup.3, R.sub.t.sup.(l).di-elect cons. SO(3) (which may be located on the fingertips), the force vectors of the sensed contacts c.sub.t.sup.(l).di-elect cons..sup.3, the unit vector in the direction of the translational slippage on the contact surface d.sub.t.sup.(l).di-elect cons..sup.2, and the binary direction of the rotational slippage on the contact surface r.sub.t.sup.(l).di-elect cons. {0, 1}, where l may index into the lth contact sensor. In at least one embodiment, to determine general in-hand pose estimation, given the current and past observations o.sub.1:t, the robot controls u.sub.1:t, and the initial pose p.sub.0, the most probable current object pose p.sub.t is determined.

[0097] In various embodiments, a GPU-accelerated physics simulator may be utilized as a forward dynamics model to concurrently simulate many robot-object environments to track in-hand object pose, and derivative-free, sample-based optimizers may be utilized to jointly tune the state and parameters of the simulations to improve tracking performance. FIG. 9 depicts an example embodiment of an algorithm that may be utilized.

……
……
……

本文链接：https://patent.nweon.com/18718

Nvidia Patent | In-hand object pose tracking

您可能还喜欢...

分类

最新AR/VR行业分享

最新AR/VR论文

最新AR/VR行业招聘

Nvidia Patent | In-hand object pose tracking

您可能还喜欢...

Nvidia Patent | Techniques for character motion planning

Nvidia Patent | Perceptually-Based Foveated Rendering Using A Contrast-Enhancing Filter

Nvidia Patent | Stereo Depth Estimation Using Deep Neural Networks

分类

最新AR/VR行业分享

最新AR/VR论文

最新AR/VR行业招聘