Microsoft Patent | Methods For Two-Stage Hand Gesture Input

小编映维 | 分类：Microsoft | 2020年9月25日

Patent: Methods For Two-Stage Hand Gesture Input

Publication Number: 20200301513

Publication Date: 20200924

Applicants: Microsoft

Abstract

A method for two-stage hand gesture input comprises receiving hand tracking data for a hand of a user. Using a gesture recognition machine, it is determined whether the user has performed a ready-bloom gesture based on one or more parameters derived from the received hand tracking data satisfying ready-bloom gesture criteria. If the user has performed a ready-bloom gesture, the gesture recognition machine next determines whether the user has performed a bloom-out gesture based on one or more parameters derived from the received hand tracking data satisfying bloom-out gesture criteria. The bloom-out gesture criteria are only satisfiable from the performed ready-bloom gesture. A visual affordance is then displayed responsive to determining that the user has performed the bloom-out gesture.

BACKGROUND

[0001] Virtual and augmented reality applications may rely on gesture input provided by a user to evoke specific commands and actions. Depth and visual cameras may enable hand-tracking applications to recognize and stratify various gesture commands.

SUMMARY

[0002] A method for two-stage hand gesture input comprises receiving hand tracking data for a hand of a user. Using a gesture recognition machine, it is determined whether the user has performed a ready-bloom gesture based on one or more parameters derived from the received hand tracking data satisfying ready-bloom gesture criteria. If the user has performed a ready-bloom gesture, the gesture recognition machine next determines whether the user has performed a bloom-out gesture based on one or more parameters derived from the received hand tracking data satisfying bloom-out gesture criteria. The bloom-out gesture criteria are only satisfiable from the performed ready-bloom gesture. A visual affordance is then displayed responsive to determining that the user has performed the bloom-out gesture.

[0003] This Summary is provided to introduce a selection of concepts in a simplified form that are further described below in the Detailed Description. This Summary is not intended to identify key features or essential features of the claimed subject matter, nor is it intended to be used to limit the scope of the claimed subject matter. Furthermore, the claimed subject matter is not limited to implementations that solve any or all disadvantages noted in any part of this disclosure.

BRIEF DESCRIPTION OF THE DRAWINGS

[0004] FIG. 1 shows an example augmented reality use environment for a user wearing a head-mounted display.

[0005] FIG. 2 shows an illustration of a hand of a user performing a one-stage bloom gesture.

[0006] FIG. 3 shows a schematic view of a head-mounted display device according to an example of the present disclosure.

[0007] FIG. 4 shows an example method for two-stage hand gesture input.

[0008] FIG. 5A shows aspects of an example virtual skeleton.

[0009] FIG. 5B shows aspects of a hand portion of an example virtual skeleton.

[0010] FIG. 6 shows a hand portion of a virtual skeleton performing a two-stage bloom gesture.

[0011] FIG. 7 shows illustrations of various ready-affordances.

[0012] FIG. 8 shows an illustration of a user interacting with a visual affordance.

[0013] FIG. 9 shows a schematic view of an example computing device.

DETAILED DESCRIPTION

[0014] Various technologies may allow a user to experience a mix of real and virtual worlds. For example, some display devices, such as various head-mounted display devices, may comprise see-through displays that allow superposition of displayed images over a real-world background environment. The images may appear in front of the real-world background environment when viewed through the see-through display. In particular, the images may be displayed on the see-through display such that they appear intermixed with elements in the real-world background environment in what may be referred to as augmented reality.

[0015] FIG. 1 is a schematic illustration of a user 100 wearing head-mounted display device 105 and standing in the real-world physical environment of room 110. The room 110 includes a number of physical objects and surfaces, such as walls 114, 116 and 118, couch 122, bookcase 130, and lamp 134, all of which are visible to the user via a see-through display of head-mounted display device 105.

[0016] Head-mounted display device 105 may display to user 100 virtual content that appears to be located at different three-dimensional locations within room 110. In the example of FIG. 1, head-mounted display device 105 displays virtual content in the form of a holographic motorcycle 138, holographic panda 140, and holographic wizard 142.

[0017] Head-mounted display device 105 may have a field of view, indicated by dotted lines 150, that defines a volume of space in which the user may view virtual content displayed by the device. In different examples of head-mounted display device 105, the field of view may have different shapes, such as cone-shaped, frustum-shaped, pyramid-shaped, or any other suitable shape. In different examples of head-mounted display device 18, the field of view also may have different sizes that occupy different volumes of space.

[0018] Sensors included in head-mounted display device 105 may enable natural user interface (NUI) controls, such as gesture inputs based on gestures performed by user’s hand 160 when user’s hand 160 is within the field of view of outward facing imaging sensors of head-mounted display 105. In this way, user 100 may interact with virtual content without being required to hold a controller or other input device, thus freeing user 100 to interact with real-world and/or virtual world objects with either hand.

[0019] Virtual and augmented reality devices and applications may rely on recognizing gesture commands to provide an intuitive interface. However, without employing a controller, user 100 does not have access to dedicated inputs for switching between applications, calling a system menu, adjusting parameters, etc. In some examples, a system and/or application may desire to provide an on-demand visual affordance, such as a menu. Recognition of a specific, pre-determined gesture may trigger the visual display of such a visual affordance.

[0020] However, many intuitive hand gestures are difficult to discern from one another given the current accuracy of hand tracking technology. Users may trigger the display of a menu unintentionally when using hand gestures to assist their conversation, presentation, or other actions that may confuse the system. These false activations may force users to exit the current application, stopping them from their current work (e.g., interrupting an important public presentation). It is possible to use a mini menu for further confirmation before exiting the currently-used application and thereby avoid unintentional switching. However, this may be annoying to the user or otherwise undesirable.

[0021] By reserving specific gestures for system functions, user intent may be easier to discern. One gesture for calling an affordance is the “bloom” gesture. As shown at 200 in FIG. 2, the gesture begins with the five fingertips of the hand held close together and pointing upwards. The user then spreads the fingers apart, opening the hand with the palm facing upwards, as shown at 210. The bloom gesture may be recognized as a continuous, single gesture with motion features. The gesture may be recognized when performed with either hand of the user. The gesture may be assigned multiple functions vis a vis calling a visual affordance. For example, performing a first bloom gesture may result in displaying a menu on the head-mounted display. Performing a second bloom gesture may dismiss the menu. Performing the bloom gesture from the core shell of the head-mounted display operating system may result in the display of a system menu, while performing the bloom gesture while inside an application may result in the display of an application-specific menu.

[0022] However, to efficiently recognize the bloom gesture, parameters may be relaxed in order to gain a bigger range of deployment. This may result in a high false positive rate. As a result, if the user is talking with their hands in motion, the bloom gesture may often be mimicked, and the user may unintentionally deploy the visual affordance.

[0023] Herein, examples are provided where the bloom gesture is separated into two segments, rather than a single motion gesture. Separating the continuous gesture into two different gestures allows for a two-step process that reduces false activations and prevents unintentional actions from taking place. This may enable algorithms to provide full control over when the user wants to (or does not want to) trigger a specific visual affordance. The bloom gesture is separated into a “ready-bloom” gesture, where all of the user’s fingertips are close together and facing up, and a “bloom-out” gesture, where the user opens their hand from the ready-bloom gesture. Additional stages of the gesture may be added, such as a “bloom-in” gesture, where the user returns their hand to the ready-bloom gesture conformation from an open hand.

[0024] The two-step process further enables the possibility for the operating system to embed customizable variables between the steps in order to prevent false activations and to provide a higher level of control and certainty to the user. In some examples, additional affordances may be displayed to the user while the two-stage gesture is being performed. For example, completing the ready-bloom gesture may result in the display of a virtual holographic affordance above the fingertips that confirms completion of the ready-bloom gesture, and confirms that subsequently performing the bloom-out gesture will give desired effect (e.g., display of a visual affordance). This provides direct feedback to the user by showcasing the state of the system, creating an enhanced user experience with a highly polished interaction pattern.

[0025] FIG. 3 schematically illustrates an example head-mounted display device 300. The head-mounted display device 300 includes a frame 302 in the form of a band wearable around a head of the user that supports see-through display componentry positioned near the user’s eyes. Head-mounted display device 300 may use augmented reality technologies to enable simultaneous viewing of virtual display imagery and a real-world background. As such, the head-mounted display device 300 may generate virtual images via see-through display 304, which includes separate right and left eye displays 304R and 304L, and which may be wholly or partially transparent. The see-through display 304 may take any suitable form, such as a waveguide or prism configured to receive a generated image and direct the image towards a wearer’s eye. The see-through display 304 may include a backlight and a microdisplay, such as liquid-crystal display (LCD) or liquid crystal on silicon (LCOS) display, in combination with one or more light-emitting diodes (LEDs), laser diodes, and/or other light sources. In other examples, the see-through display 304 may utilize quantum-dot display technologies, active-matrix organic LED (OLED) technology, and/or any other suitable display technologies. It will be understood that while shown in FIG. 3 as a flat display surface with left and right eye displays, the see-through display 304 may be a single display, may be curved, or may take any other suitable form.

[0026] The head-mounted display device 10 further includes an additional see-through optical component 306, shown in FIG. 3 in the form of a see-through veil positioned between the see-through display 304 and the real-world environment as viewed by a wearer. A controller 308 is operatively coupled to the see-through optical component 306 and to other display componentry. The controller 308 includes one or more logic devices and one or more computer memory devices storing instructions executable by the logic device(s) to enact functionalities of the head-mounted display device 300. The head-mounted display device 300 may further include various other components, for example an outward facing two-dimensional image camera 310 (e.g. a visible light camera and/or infrared camera), an outward facing depth imaging device 312, and an inward-facing gaze-tracking camera 314 (e.g. a visible light camera and/or infrared camera), as well as other components that are not shown, including but not limited to speakers, microphones, accelerometers, gyroscopes, magnetometers, temperature sensors, touch sensors, biometric sensors, other image sensors, eye-gaze detection systems, energy-storage components (e.g. battery), a communication facility, a GPS receiver, etc.

[0027] Depth imaging device 312 may include an infrared light-based depth camera (also referred to as an infrared light camera) configured to acquire video of a scene including one or more human subjects. The video may include a time-resolved sequence of images of spatial resolution and frame rate suitable for the purposes set forth herein. The depth imaging device and/or a cooperating computing system (e.g., controller 308) may be configured to process the acquired video to identify one or more objects within the operating environment, one or more postures and/or gestures of the user wearing head-mounted display device 300, one or more postures and/or gestures of other users within the operating environment, etc.

[0028] The nature and number of cameras may differ in various depth imaging devices consistent with the scope of this disclosure. In general, one or more cameras may be configured to provide video from which a time-resolved sequence of three-dimensional depth maps is obtained via downstream processing. As used herein, the term “depth map” refers to an array of pixels registered to corresponding regions of an imaged scene, with a depth value of each pixel indicating the distance between the camera and the surface imaged by that pixel.

[0029] In some implementations, depth imaging device 312 may include right and left stereoscopic cameras. Time-resolved images from both cameras may be registered to each other and combined to yield depth-resolved video.

[0030] In some implementations, a “structured light” depth camera may be configured to project a structured infrared illumination having numerous, discrete features (e.g., lines or dots). A camera may be configured to image the structured illumination reflected from the scene. Based on the spacings between adjacent features in the various regions of the imaged scene, a depth map of the scene may be constructed.

[0031] In some implementations, a “time-of-flight” (TOF) depth camera may include a light source configured to project a modulated infrared illumination onto a scene. The camera may include an electronic shutter synchronized to the modulated illumination, thereby allowing a pixel-resolved phase-delay between illumination times and capture times to be observed. A time-of-flight of the modulated illumination may be calculated.

[0032] The above cameras are provided as examples, and any sensor capable of detecting hand gestures may be used.

[0033] Head-mounted display 300 further includes a gesture-recognition machine 316, and an eye-tracking machine 318. Gesture-recognition machine 316 is configured to process at least the depth video (i.e., a time-resolved sequence of depth maps and/or raw sensor data) from depth imaging device 312 and/or image data from outward facing two-dimensional image camera 310, to identify one or more human subjects in the depth video, to compute various geometric (e.g., skeletal) features of the subjects identified, and to gather from the geometric features various postural or gestural information to be used as NUI.

[0034] In one non-limiting embodiment, gesture-recognition machine 316 identifies at least a portion of one or more human subjects in the depth video. Through appropriate depth-image processing, a given locus of a depth map may be recognized as belonging to a human subject. In a more particular embodiment, pixels that belong to a human subject may be identified (e.g., by sectioning off a portion of a depth map that exhibits above-threshold motion over a suitable time scale) and a generalized geometric model of a human being may be derived from those pixels.

[0035] In one embodiment, each pixel of a depth map may be assigned a person index that identifies the pixel as belonging to a particular human subject or non-human element. As an example, pixels corresponding to a first human subject can be assigned a person index equal to one, pixels corresponding to a second human subject can be assigned a person index equal to two, and pixels that do not correspond to a human subject can be assigned a person index equal to zero. Further indices may be used to label pixels corresponding to different body parts. For example, pixels imaging a left hand may be labeled with a different index than pixels imaging a right hand; or pixels imaging a pointer finger may be labeled with a different index that pixels imaging a thumb.

[0036] Gesture-recognition machine 316 also may label pixels in any suitable manner. As one example, an artificial neural network may be trained to classify each pixel with appropriate indices/labels. In this way, different features of a hand or other body part may be computationally identified.

[0037] Gesture recognition machine 316 may track different body parts from frame to frame, thereby allowing different gestures to be discerned. For example, the three-dimensional position of fingers may be tracked from frame to frame, thus allowing parameters such as finger position, finger angle, finger velocity, finger acceleration, finger-to-finger proximity, etc. to be discerned.

[0038] The position of the user’s eye(s) may be determined by eye-tracking machine 318 and/or gesture recognition machine 316. Eye-tracking machine 318 may receive image data from inward-facing gaze-tracking camera 314. In some examples, inward-facing gaze-tracking camera 314 includes two or more cameras, including at least one camera trained on the right eye of the user and at least one camera trained on the left eye of the user. As an example, eye-tracking machine 318 may determine the position of the user’s eye based on the center point of the user’s eye, the center point of the user’s pupil, and/or gesture recognition machine 316 may estimate the location of the eye based on the location of the head-joint of the virtual skeleton.

[0039] FIG. 4 shows a method 400 for two-stage hand gesture input. Method 400 may be executed by a computing device, such as a head-mounted display device (e.g., head-mounted display devices 105 and 300 and/or computing system 900 described herein with regard to FIG. 9). Method 400 will primarily be described with regard to augmented reality applications, but may also be applied to virtual reality applications, mixed reality applications, non-immersive applications, and any other applications having a natural user interface configured to receive gesture input.

[0040] At 410, method 400 includes receiving hand tracking data for a hand of a user. Hand tracking data may be derived from received depth information, received RGB image data, received flat IR image data, etc. Data may be received in the form of a plurality of different, sequential frames. The received hand tracking data may include a feature position for each of a plurality of different hand features at each of a plurality of different frames. The received hand tracking data may include data for one or both hands of a user.

[0041] In some embodiments, a gesture recognition machine, such as gesture recognition machine 316, may be configured to analyze the pixels of a depth map that correspond to the user, in order to determine what part of the user’s body each pixel corresponds to. A variety of different body-part assignment techniques can be used to this end. In one example, each pixel of the depth map with an appropriate person index (vide supra) may be assigned a body-part index. The body-part index may include a discrete identifier, confidence value, and/or body-part probability distribution indicating the body part or parts to which that pixel is likely to correspond.

[0042] In some embodiments, machine-learning may be used to assign each pixel a body-part index and/or body-part probability distribution. The machine-learning approach analyzes a user with reference to information learned from a previously trained collection of known poses. During a supervised training phase, for example, a variety of human subjects may be observed in a variety of poses. These poses may include the ready-bloom gesture, the bloom-out gesture, the bloom-in gesture, etc. Trainers provide ground truth annotations labeling various machine-learning classifiers in the observed data. The observed data and annotations are then used to generate one or more machine-learned algorithms that map inputs (e.g., depth video) to desired outputs (e.g., body-part indices for relevant pixels).

[0043] In some implementations, a virtual skeleton or other data structure for tracking feature positions (e.g., joints) may be fit to the pixels of depth and/or color video that correspond to the user. FIG. 5A shows an example virtual skeleton 500. The virtual skeleton includes a plurality of skeletal segments 505 pivotally coupled at a plurality of joints 510. In some embodiments, a body-part designation may be assigned to each skeletal segment and/or each joint. In FIG. 5A, the body-part designation of each skeletal segment 505 is represented by an appended letter: A for the head, B for the clavicle, C for the upper arm, D for the forearm, E for the hand, F for the torso, G for the pelvis, H for the thigh, J for the lower leg, and K for the foot. Likewise, a body-part designation of each joint 510 is represented by an appended letter: A for the neck, B for the shoulder, C for the elbow, D for the wrist, E for the lower back, F for the hip, G for the knee, and H for the ankle. Naturally, the arrangement of skeletal segments and joints shown in FIG. 5A is in no way limiting. A virtual skeleton consistent with this disclosure may include virtually any type and number of skeletal segments, joints, and/or other features.

[0044] In a more particular embodiment, point clouds (portions of a depth map) corresponding to the user’s hands may be further processed to reveal the skeletal substructure of the hands. FIG. 5B shows an example hand portion 515 of a user’s virtual skeleton 500. The hand portion includes wrist joints 520, finger joints 525, adjoining finger segments 530, and adjoining finger tips 535. Joints and segments may be grouped together to form a portion of the user’s hand, such as palm portion 540.

[0045] Via any suitable minimization approach, the lengths of the skeletal segments and the positions and rotational angles of the joints may be adjusted for agreement with the various contours of a depth map. In this way, each joint is assigned various parameters–e.g., Cartesian coordinates specifying joint position, angles specifying joint rotation, and additional parameters specifying a conformation of the corresponding body part (hand open, hand closed, etc.). The virtual skeleton may take the form of a data structure including any, some, or all of these parameters for each joint. This process may define the location and posture of the imaged human subject. Some skeletal-fitting algorithms may use the depth data in combination with other information, such as color-image data and/or kinetic data indicating how one locus of pixels moves with respect to another. In the manner described above, a virtual skeleton may be fit to each of a sequence of frames of depth video. By analyzing positional change in the various skeletal joints and/or segments, the corresponding movements–e.g., gestures or actions of the imaged user–may be determined.

[0046] The foregoing description should not be construed to limit the range of approaches usable to construct a virtual skeleton 500 or otherwise identify various hand features, for hand features may be derived from a depth map and/or other sensor data in any suitable manner without departing from the scope of this disclosure.

[0047] Regardless of the method used to extract features, once identified, each feature may be tracked across frames of the depth and/or image data. The plurality of different hand features may include a plurality of finger features, a plurality of fingertip features, a plurality of knuckle features, a plurality of wrist features, a plurality of palm features, a plurality of dorsum features, etc.

[0048] In some examples, receiving hand tracking data for the first hand of the user includes, receiving depth data for an environment, fitting a virtual skeleton to point clouds of the received depth data, assigning hand joints to the virtual skeleton, and tracking positions of the assigned hand joints across sequential depth images.

[0049] Returning to FIG. 4, at 420, method 400 includes, at a gesture recognition machine, determining that the user has performed a ready-bloom gesture based on one or more parameters derived from the received hand tracking data satisfying ready-bloom gesture criteria. For each hand feature, the position, speed, rotational velocity, etc. may be calculated to determine a set of parameters, or pseudo-gesture, and the determined parameters may then be evaluated based on criteria specific to the ready-bloom gesture.

[0050] The ready-bloom gesture, as shown at 200 of FIG. 2 may be identified via a number of specific gesture criteria. For example, the ready-bloom gesture criteria may include a verticality of all finger features being within a threshold of absolute vertical. The verticality of the finger features may be determined based on the verticality of each individual feature and/or the verticality of the fingers as a group. For example, fingertip features may be used to determine a pentagon, and a line normal to the pentagon may then be determined. The angle between this determined normal and absolute vertical may then be determined, and compared to a predetermined threshold (e.g., 10 degrees).

[0051] As another example, the ready-bloom gesture criteria may include a position of the plurality of hand features within a field of view of the user. This may additionally or alternatively include a gaze direction of the user. For example, the ready-bloom gesture criteria may include a gaze direction of the user being within a threshold distance of the plurality of hand features. In other words, if the user is looking at the hand while performing a gesture, it may be more likely that the user is deliberately performing a specific gesture. Thresholds and criteria for recognizing the ready-bloom gesture may be adjusted accordingly.

[0052] As another example, the ready-bloom gesture criteria may include a closeness of fingertip features together. For example, fingertip features may be used to determine a pentagon, and an area and/or circumference of the determined pentagon derived and compared to a threshold. In some examples, the closeness of finger joint features and/or finger segment features may be determined in addition to or as an alternative to fingertip features.

[0053] For example, FIG. 6 shows a hand portion 600 of a virtual skeleton. Hand portion 600 includes wrist joints 605, finger joints 610, adjoining finger segments 615, adjoining finger tips 620, and palm portion 625. At 630, hand portion 600 shown performing a ready-bloom gesture. Dashed line 635 indicates a closeness of fingertips 620. A verticality of finger joints 610 is indicated by arrow 640, generating an angle 645 with absolute vertical ray 650.

[0054] As another example, the ready-bloom gesture criteria may include one or more of a speed of the plurality of hand features being below a threshold and a steadiness of plurality of hand features being above a threshold. By setting speed and steadiness criteria, the user is unable to trigger the menu while the hand remains in motion.

[0055] If the user’s hand arrives at a steady state, a number of data frames may be collected (e.g., 1.5 sec of data points in time of motion), extracted data points may then be collected in an array, a curve may be fit to the extracted data points, smoothed, and then it may be determined whether the gesture was performed. The threshold for the steadiness of the plurality of hand features may be based at least in part on the speed of the plurality of hand features. For example, if the hand arrives at the gesture at a speed above a threshold, the user may need to wait longer to meet the steadiness criteria. These criteria may be adjusted in real time, providing a significant benefit in determining the user’s intent.

[0056] The ready-bloom gesture criteria may not necessarily include a time-based criterion for holding the ready-bloom gesture for a pre-determined time period before it is recognized. In this way, the user can speed up performance of the gesture with practice.

[0057] In some examples, the ready-bloom criteria may be evaluated by simple thresholding of each parameter. In other examples, fuzzy logic may be employed where certain parameters are weighted more than others. In other examples, an artificial neural network may be trained to assess gesture confidence based on one or more frames of feature data input.

[0058] At 430, method 400 optionally includes providing feedback to the user indicating that the ready-bloom gesture has been performed. For example, feedback may be provided in the form of an audio cue, a haptic cue, a visual cue, etc.

[0059] In some examples, a ready-state affordance may be provided following the ready-bloom gesture in order to provide feedback to the user, indicating that the first stage of the gesture has been completed. For example, responsive to determining that the user has performed the ready-bloom gesture, a ready-state affordance may be displayed indicating that performing the bloom-out gesture will result in displaying the visual affordance. This may be used as feedback particularly when the user is learning the two-stage gesture. Based on user skill and/or preferences, the ready-state affordance may be reduced or eliminated over time. The gesture recognition engine may thus assess one or more parameters derived from the received hand tracking data and corresponding to hand gestures made before a ready-state affordance is displayed.

[0060] The ready-state affordance may be merely a visual cue or may provide another means for deploying the visual affordance. For example, the user may “touch” or manipulate the ready-state affordance with an off hand. The gesture recognition machine may receive hand tracking data and determine whether the user has interacted with the ready-state affordance with another hand (e.g., the non-gesture hand) of the user, and may display the visual affordance responsive to determining that the user has interacted with the ready-state affordance with another hand of the user. This interactive pathway may be in addition to or as an alternative to performing the bloom-out gesture.

[0061] In some examples, the ready-state affordance may be displayed progressively. For example, as the user approaches completing the ready-bloom gesture, an initial ready-state affordance may be displayed, indicating to the user that they are nearing the end of the first-stage of the two-stage gesture. At 700 of FIG. 7, hand 705 of a user is shown approaching the ready-bloom gesture, with fingers closed but not in a vertical state. An initial ready-state affordance 710 is displayed. At 720, hand 705 of the user has completed the ready-bloom gesture, and the full ready-state affordance 725 is displayed. This progression affords the user real-time feedback as to their gesture and what subsequent gestures may accomplish.

[0062] In examples, the ready-state affordance may include one or more user interface elements. For example, ready-state affordance 725 displays the current time 730 and the current battery state 735 of the head-mounted display. In this way, the user may check the time or other information and then dismiss the ready-state affordance without deploying the full visual affordance. This may be considered a preview mode, and the displayed user interface elements may be predetermined and/or selected based on user preferences.

[0063] In some examples, the recognition of the ready-bloom gesture may result in the display of multiple ready-state affordances. As shown at 740, hand 705 of the user has completed the ready-bloom gesture, and three different ready-state affordances (745, 750, 755) are displayed. Each affordance may be indicative of a different pathway and/or visual affordance that may be called. In some examples, the user may direct their gaze to one of the ready-state affordances while performing the bloom-out gesture to evoke that pathway. Additionally or alternatively, the user may point at one of the ready-state affordances with their off hand. Additional parameters may allow the user to nod, blink, speak, etc. while gazing at a ready-state affordance rather than performing the bloom-out gesture.

[0064] At 430, method 400 includes, at the gesture recognition machine, determining that the user has performed a bloom-out gesture based on one or more parameters derived from the received hand tracking data satisfying bloom-out gesture criteria, the bloom-out gesture criteria being satisfiable only from the performed ready-bloom gesture. In other words, if the gesture-recognition machine recognizes that the ready-bloom gesture is performed, the performance of the bloom-out gesture can be evaluated. If not, then the second stage of the two-stage gesture (e.g., the bloom-out gesture) will not be determined to be performed. The same gesture recognition machine may be used as for the ready-bloom gesture. However, if the ready-bloom gesture is not determined to be performed, the gesture recognition machine may not even evaluate hand movement parameters against the bloom-out gesture criteria. In examples wherein a ready-state affordance is provided,* the gesture recognition machine may assess one or more parameters derived from the received hand tracking data and corresponding to hand gestures made while the ready-state affordance is displayed*

[0065] The bloom-out gesture criteria may include a distance between all fingertip features being greater than a threshold. Further, the bloom-out gesture criteria may include the plurality of palm features facing upwards within a threshold of absolute vertical. As an example, at 660, FIG. 6 shows hand portion 600 performing a bloom-out gesture. Dashed line 665 indicates a distance between fingertips 620. A verticality of palm portion 625 is indicated by arrow 670, generating an angle 675 with absolute vertical ray 680.

[0066] The speed and/or steadiness of the hand features may also be used as criteria. For example, one criterion may specify that the plurality of hand features must be within a threshold distance from where the plurality of hand features were when the ready-bloom gesture was recognized. In some examples, a criterion may specify that the duration of the transition between the ready-bloom state and the bloom-out state must be below a threshold.

[0067] The ready-bloom gesture criteria and the bloom-out gesture criteria optionally may be user-specific. In this way, the criteria may be built for a specific user, rather than a fixed set of criteria, thereby acknowledging that different users perform the gestures slightly differently. User specificity may be trained in a calibration phase where the user performs various gestures and this test data is used to train an artificial neural network, for example. Further, physical differences between the hands of different users can be accounted for. For example, a user missing a finger or having a syndactyly would necessitate different criteria than for a user with five independent fingers. User-specific criteria and parameters may be stored in preferences for the user. When the user signs in, the preferences may be retrieved.

[0068] At 440, method 400 includes displaying a visual affordance responsive to determining that the user has performed the bloom-out gesture. As an example, the visual affordance may include a menu or other holographic user interface with which the user may interact. Method 400 enables the computing machine to recognize the first part of a gesture (e.g., ready-bloom) if compliant with determined parameters, then if the second part of the gesture is recognized (e.g., bloom-out) a visual affordance may be deployed. The visual affordance may be positioned based on the position of the hand of the user. In this way, the user controls the placement of the visual affordance before deploying the visual affordance, and can maintain the visual affordance within the user’s FOV. Once deployed, the user may reposition or rescale the visual affordance using one or more specified gestures.

[0069] For example, at 800, FIG. 8 shows a hand 805 of a user in a bloom-out gesture conformation, evoking visual affordance 810. In this example visual affordance 810 is depicted as a menu. However, in other examples, the visual affordance may be a visual keyboard, number pad, dial, switch, virtual mouse, joystick, or any other visual input mechanism that allows the user to input commands. While the two-stage bloom gesture sequence may be performed with one hand, the visual affordance may be manipulated with either the gesture hand or the off-hand of the user. At 820, FIG. 8 depicts a user manipulating visual affordance 810 with both gesture (right) hand 805 and off (left) hand 825.

[0070] Optionally, at 450, method 400 further includes closing the visual affordance responsive to determining the hand of the user has returned to the ready-bloom gesture conformation from the bloom-out gesture conformation based on one or more parameters derived from the received hand tracking data satisfying bloom-in gesture criteria. The criteria for performing this “bloom-in” gesture may include one or more of the criteria for performing the ready-bloom gesture, in addition to the criteria that the gesture must begin from the bloom-out gesture conformation. This allows the user to preview the visual affordance without fully deploying the visual affordance.

[0071] For example, at 830, FIG. 8 shows the user performing the bloom-in gesture, thereby minimizing the appearance of visual affordance 810. By completing the bloom-in gesture, the visual affordance may be removed from the display entirely, as shown at 840. Other user input such as voice input or other gestures may additionally or alternatively be used to remove the visual affordance.

[0072] In some examples, the gesture recognition will only recognize each gesture if the user’s hand is within the user’s FOV, be it the user’s actual hand or a representation of the user’s hand (e.g., VR). However, in some examples, the gesture recognition machine will recognize gestures whenever the user’s hand is within the FOV of the imaging devices used for input. This may allow for blind users to provide input to the NUI system. Rather than visual affordances, the user may be cued through the use of haptic and/or audio feedback. Further, rather than evoking a visual menu or other visual affordance, the system may enter a state where the user is enabled to issue specific voice or gesture commands, or where specific voice or gesture commands are assigned to particular responses, such as when a particular gesture is used for a different purpose within an application.

[0073] In some examples, two or more sets of hand tracking parameters may be invoked responsive to entering the ready state. For example, a user performing the ready-bloom gesture may be able to invoke multiple different UI responses using different secondary gestures. As one example, the user flipping their hand sideways in the ready-bloom conformation may trigger a different pathway than if the user performs the bloom-out gesture. Additionally or alternatively, the user interface may exit the “ready” state responsive to identifying a gesture that is not the second stage of the two-stage gesture. In other words, the gesture recognition machine may no longer invoke the bloom-out gesture criteria if it is determined that the user performed another gesture from the ready-bloom conformation.

[0074] The methods and processes described herein may be tied to a computing system of one or more computing devices. In particular, such methods and processes may be implemented as an executable computer-application program, a network-accessible computing service, an application-programming interface (API), a library, or a combination of the above and/or other compute resources.

[0075] FIG. 9 schematically shows a simplified representation of a computing system 900 configured to provide any to all of the compute functionality described herein. Computing system 900 may take the form of one or more virtual/augmented/mixed reality computing devices, personal computers, network-accessible server computers, tablet computers, home-entertainment computers, gaming devices, mobile computing devices, mobile communication devices (e.g., smart phone), wearable computing devices, Internet of Things (IoT) devices, embedded computing devices, and/or other computing devices.

[0076] Computing system 900 includes a logic subsystem 902 and a storage subsystem 904. Computing system 900 may optionally include a display subsystem 906, input subsystem 908, communication subsystem 910, and/or other subsystems not shown in FIG. 9.

[0077] Logic subsystem 902 includes one or more physical devices configured to execute instructions. For example, the logic subsystem may be configured to execute instructions that are part of one or more applications, services, or other logical constructs. The logic subsystem may include one or more hardware processors configured to execute software instructions. Additionally or alternatively, the logic subsystem may include one or more hardware or firmware devices configured to execute hardware or firmware instructions. Processors of the logic subsystem may be single-core or multi-core, and the instructions executed thereon may be configured for sequential, parallel, and/or distributed processing. Individual components of the logic subsystem optionally may be distributed among two or more separate devices, which may be remotely located and/or configured for coordinated processing. Aspects of the logic subsystem may be virtualized and executed by remotely-accessible, networked computing devices configured in a cloud-computing configuration.

[0078] Storage subsystem 904 includes one or more physical devices configured to temporarily and/or permanently hold computer information such as data and instructions executable by the logic subsystem. When the storage subsystem includes two or more devices, the devices may be collocated and/or remotely located. Storage subsystem 904 may include volatile, nonvolatile, dynamic, static, read/write, read-only, random-access, sequential-access, location-addressable, file-addressable, and/or content-addressable devices. Storage subsystem 904 may include removable and/or built-in devices. When the logic subsystem executes instructions, the state of storage subsystem 904 may be transformed–e.g., to hold different data.

[0079] Aspects of logic subsystem 902 and storage subsystem 904 may be integrated together into one or more hardware-logic components. Such hardware-logic components may include program- and application-specific integrated circuits (PASIC/ASICs), program- and application-specific standard products (PSSP/ASSPs), system-on-a-chip (SOC), and complex programmable logic devices (CPLDs), for example.

[0080] The logic subsystem and the storage subsystem may cooperate to instantiate one or more logic machines. As used herein, the term “machine” is used to collectively refer to the combination of hardware, firmware, software, instructions, and/or any other components cooperating to provide computer functionality. In other words, “machines” are never abstract ideas and always have a tangible form. A machine may be instantiated by a single computing device, or a machine may include two or more sub-components instantiated by two or more different computing devices. In some implementations a machine includes a local component (e.g., software application executed by a computer processor) cooperating with a remote component (e.g., cloud computing service provided by a network of server computers). The software and/or other instructions that give a particular machine its functionality may optionally be saved as one or more unexecuted modules on one or more suitable storage devices.

[0081] Machines may be implemented using any suitable combination of state-of-the-art and/or future machine learning (ML), artificial intelligence (AI), and/or natural language processing (NLP) techniques. Non-limiting examples of techniques that may be incorporated in an implementation of one or more machines include support vector machines, multi-layer neural networks, convolutional neural networks (e.g., including spatial convolutional networks for processing images and/or videos, temporal convolutional neural networks for processing audio signals and/or natural language sentences, and/or any other suitable convolutional neural networks configured to convolve and pool features across one or more temporal and/or spatial dimensions), recurrent neural networks (e.g., long short-term memory networks), associative memories (e.g., lookup tables, hash tables, Bloom Filters, Neural Turing Machine and/or Neural Random Access Memory), word embedding models (e.g., GloVe or Word2Vec), unsupervised spatial and/or clustering methods (e.g., nearest neighbor algorithms, topological data analysis, and/or k-means clustering), graphical models (e.g., (hidden) Markov models, Markov random fields, (hidden) conditional random fields, and/or AI knowledge bases), and/or natural language processing techniques (e.g., tokenization, stemming, constituency and/or dependency parsing, and/or intent recognition, segmental models, and/or super-segmental models (e.g., hidden dynamic models)).

[0082] In some examples, the methods and processes described herein may be implemented using one or more differentiable functions, wherein a gradient of the differentiable functions may be calculated and/or estimated with regard to inputs and/or outputs of the differentiable functions (e.g., with regard to training data, and/or with regard to an objective function). Such methods and processes may be at least partially determined by a set of trainable parameters. Accordingly, the trainable parameters for a particular method or process may be adjusted through any suitable training procedure, in order to continually improve functioning of the method or process.

[0083] Non-limiting examples of training procedures for adjusting trainable parameters include supervised training (e.g., using gradient descent or any other suitable optimization method), zero-shot, few-shot, unsupervised learning methods (e.g., classification based on classes derived from unsupervised clustering methods), reinforcement learning (e.g., deep Q learning based on feedback) and/or generative adversarial neural network training methods, belief propagation, RANSAC (random sample consensus), contextual bandit methods, maximum likelihood methods, and/or expectation maximization. In some examples, a plurality of methods, processes, and/or components of systems described herein may be trained simultaneously with regard to an objective function measuring performance of collective functioning of the plurality of components (e.g., with regard to reinforcement feedback and/or with regard to labelled training data). Simultaneously training the plurality of methods, processes, and/or components may improve such collective functioning. In some examples, one or more methods, processes, and/or components may be trained independently of other components (e.g., offline training on historical data).

[0084] When included, display subsystem 906 may be used to present a visual representation of data held by storage subsystem 904. This visual representation may take the form of a graphical user interface (GUI) including holographic virtual objects. Display subsystem 906 may include one or more display devices utilizing virtually any type of technology. In some implementations, display subsystem 906 may include one or more virtual-, augmented-, or mixed reality displays.

[0085] When included, input subsystem 908 may comprise or interface with one or more input devices. An input device may include a sensor device or a user input device. Examples of user input devices include a keyboard, mouse, touch screen, or game controller. In some embodiments, the input subsystem may comprise or interface with selected natural user input (NUI) componentry. Such componentry may be integrated or peripheral, and the transduction and/or processing of input actions may be handled on- or off-board. Example NUI componentry may include a microphone for speech and/or voice recognition; an infrared, color, stereoscopic, and/or depth camera for machine vision and/or gesture recognition; a head tracker, eye tracker, accelerometer, and/or gyroscope for motion detection and/or intent recognition.

[0086] When included, communication subsystem 910 may be configured to communicatively couple computing system 900 with one or more other computing devices. Communication subsystem 910 may include wired and/or wireless communication devices compatible with one or more different communication protocols. The communication subsystem may be configured for communication via personal-, local- and/or wide-area networks.

[0087] The methods and processes disclosed herein may be configured to give users and/or any other humans control over any private and/or potentially sensitive data. Whenever data is stored, accessed, and/or processed, the data may be handled in accordance with privacy and/or security standards. When user data is collected, users or other stakeholders may designate how the data is to be used and/or stored. Whenever user data is collected for any purpose, the user data should only be collected with the utmost respect for user privacy (e.g., user data may be collected only when the user owning the data provides affirmative consent, and/or the user owning the data may be notified whenever the user data is collected). If the data is to be released for access by anyone other than the user or used for any decision-making process, the user’s consent may be collected before using and/or releasing the data. Users may opt-in and/or opt-out of data collection at any time. After data has been collected, users may issue a command to delete the data, and/or restrict access to the data. All potentially sensitive data optionally may be encrypted and/or, when feasible anonymized, to further protect user privacy. Users may designate portions of data, metadata, or statistics/results of processing data for release to other parties, e.g., for further processing. Data that is private and/or confidential may be kept completely private, e.g., only decrypted temporarily for processing, or only decrypted for processing on a user device and otherwise stored in encrypted form. Users may hold and control encryption keys for the encrypted data. Alternately or additionally, users may designate a trusted third party to hold and control encryption keys for the encrypted data, e.g., so as to provide access to the data to the user according to a suitable authentication protocol.

[0088] When the methods and processes described herein incorporate ML and/or AI components, the ML and/or AI components may make decisions based at least partially on training of the components with regard to training data. Accordingly, the ML and/or AI components can and should be trained on diverse, representative datasets that include sufficient relevant data for diverse users and/or populations of users. In particular, training data sets should be inclusive with regard to different human individuals and groups, so that as ML and/or AI components are trained, their performance is improved with regard to the user experience of the users and/or populations of users.

[0089] ML and/or AI components may additionally be trained to make decisions so as to minimize potential bias towards human individuals and/or groups. For example, when AI systems are used to assess any qualitative and/or quantitative information about human individuals or groups, they may be trained so as to be invariant to differences between the individuals or groups that are not intended to be measured by the qualitative and/or quantitative assessment, e.g., so that any decisions are not influenced in an unintended fashion by differences among individuals and groups.

[0090] ML and/or AI components may be designed to provide context as to how they operate, so that implementers of ML and/or AI systems can be accountable for decisions/assessments made by the systems. For example, ML and/or AI systems may be configured for replicable behavior, e.g., when they make pseudo-random decisions, random seeds may be used and recorded to enable replicating the decisions later. As another example, data used for training and/or testing ML and/or AI systems may be curated and maintained to facilitate future investigation of the behavior of the ML and/or AI systems with regard to the data. Furthermore, ML and/or AI systems may be continually monitored to identify potential bias, errors, and/or unintended outcomes.

[0091] This disclosure is presented by way of example and with reference to the associated drawing figures. Components, process steps, and other elements that may be substantially the same in one or more of the figures are identified coordinately and are described with minimal repetition. It will be noted, however, that elements identified coordinately may also differ to some degree. It will be further noted that some figures may be schematic and not drawn to scale. The various drawing scales, aspect ratios, and numbers of components shown in the figures may be purposely distorted to make certain features or relationships easier to see.

[0092] In one example, a method for two-stage hand gesture input, comprises receiving hand tracking data for a hand of a user; at a gesture recognition machine, determining that the user has performed a ready-bloom gesture based on one or more parameters derived from the received hand tracking data satisfying ready-bloom gesture criteria; at the gesture recognition machine, determining that the user has performed a bloom-out gesture based on one or more parameters derived from the received hand tracking data satisfying bloom-out gesture criteria, the bloom-out gesture criteria being satisfiable only from the performed ready-bloom gesture; and displaying a visual affordance responsive to determining that the user has performed the bloom-out gesture. In such an example, or any other example, the received hand tracking data may additionally or alternatively include a feature position for each of a plurality of different hand features at each of a plurality of different frames. In any of the preceding examples, or any other example, the plurality of different hand features may additionally or alternatively include a plurality of finger features, and wherein the ready-bloom gesture criteria include a verticality of all finger features being within a threshold of absolute vertical. In any of the preceding examples, or any other example, the plurality of different hand features may additionally or alternatively include a plurality of fingertip features, and wherein the ready-bloom gesture criteria include a distance between all fingertip features being below a threshold. In any of the preceding examples, or any other example, the bloom-out gesture criteria may additionally or alternatively include a distance between all fingertip features being greater than a threshold. In any of the preceding examples, or any other example, the ready-bloom gesture criteria may additionally or alternatively include a position of the plurality of different hand features within a field of view of the user. In any of the preceding examples, or any other example, the ready-bloom gesture criteria may additionally or alternatively include one or more of a speed of the plurality of different hand features being below a threshold and a steadiness of plurality of different hand features being above a threshold. In any of the preceding examples, or any other example, the threshold for the steadiness of the plurality of different hand features may additionally or alternatively be based at least in part on the speed of the plurality of different hand features. In any of the preceding examples, or any other example, the ready-bloom gesture criteria may additionally or alternatively include a gaze direction of the user being within a threshold distance of the plurality of different hand features. In any of the preceding examples, or any other example, the plurality of different hand features may additionally or alternatively include a plurality of palm features, and wherein the bloom-out gesture criteria include the plurality of palm features facing upwards within a threshold of absolute vertical. In any of the preceding examples, or any other example, the method may additionally or alternatively comprise closing the visual affordance responsive to determining the hand of the user has returned to a ready-bloom gesture conformation from a bloom-out gesture conformation based on one or more parameters derived from the received hand tracking data satisfying bloom-in gesture criteria. In any of the preceding examples, or any other example, the method may additionally or alternatively comprise responsive to determining that the user has performed the ready-bloom gesture, providing feedback to the user indicating that the ready-bloom gesture has been performed. In any of the preceding examples, or any other example, providing feedback to the user may additionally or alternatively include displaying a ready-state affordance, and the method may additionally or alternatively comprise: determining, based on the received hand tracking data, whether the user has interacted with the ready-state affordance with another hand of the user; and displaying the visual affordance responsive to determining that the user has interacted with the ready-state affordance with another hand of the user. In any of the preceding examples, or any other example, the ready-state affordance may additionally or alternatively include one or more user interface elements. In any of the preceding examples, or any other example, the gesture recognition machine may additionally or alternatively include an artificial neural network previously trained to recognize the plurality of different hand features. In any of the preceding examples, or any other example, receiving hand tracking data for the first hand of the user may additionally or alternatively include: receiving depth data for an environment; fitting a virtual skeleton to point clouds of the received depth data; assigning hand joints to the virtual skeleton based at least in part on image data of the user performing the ready-bloom gesture and the bloom-out gesture; and tracking positions of the assigned hand joints across sequential depth images. In any of the preceding examples, or any other example, the ready-bloom gesture criteria and the bloom-out gesture criteria may additionally or alternatively be user-specific and may additionally or alternatively be stored in preferences for the user.

[0093] In another example, a system for a head-mounted display, comprises one or more outward-facing image sensors; a gesture recognition machine configured to: receive hand tracking data for a hand of a user via the one or more outward facing image sensors; determine that the user has performed a ready-bloom gesture based on one or more parameters derived from the received hand tracking data satisfying ready-bloom gesture criteria; and determine that the user has performed a bloom-out gesture based on one or more parameters derived from the received hand tracking data satisfying bloom-out gesture criteria, the bloom-out gesture criteria being satisfiable only from the performed ready-bloom gesture; and a display device configured to display a visual affordance responsive to determining that the user has performed the bloom-out gesture. In such an example, or any other example, the display device may be additionally or alternatively configured to, responsive to determining that the user has performed the ready-bloom gesture, displaying a ready-state affordance indicating that performing the bloom-out gesture will result in displaying the visual affordance.

[0094] In yet another example, a method for two-stage hand gesture input comprises: receiving hand tracking data for a hand of a user; at a gesture recognition machine, assessing one or more parameters derived from the received hand tracking data and corresponding to hand gestures made before a ready-state affordance is displayed; displaying the ready-state affordance responsive to the one or more parameters satisfying ready-bloom gesture criteria; at the gesture recognition machine, assessing one or more parameters derived from the received hand tracking data and corresponding to hand gestures made while the ready-state affordance is displayed; and displaying a visual input mechanism responsive to the one or more parameters satisfying bloom-out gesture criteria while the ready-state affordance is displayed.

[0095] It will be understood that the configurations and/or approaches described herein are exemplary in nature, and that these specific embodiments or examples are not to be considered in a limiting sense, because numerous variations are possible. The specific routines or methods described herein may represent one or more of any number of processing strategies. As such, various acts illustrated and/or described may be performed in the sequence illustrated and/or described, in other sequences, in parallel, or omitted. Likewise, the order of the above-described processes may be changed.

[0096] The subject matter of the present disclosure includes all novel and non-obvious combinations and sub-combinations of the various processes, systems and configurations, and other features, functions, acts, and/or properties disclosed herein, as well as any and all equivalents thereof.

本文链接：https://patent.nweon.com/13105

Microsoft Patent | Methods For Two-Stage Hand Gesture Input

您可能还喜欢...

分类

最新AR/VR行业分享

最新AR/VR论文

最新AR/VR行业招聘

Microsoft Patent | Methods For Two-Stage Hand Gesture Input

您可能还喜欢...

Microsoft Patent | Human tracking system

Microsoft Patent | Using Gesture Selection To Obtain Contextually Relevant Information

Microsoft Patent | Presenting Markup in a Scene Using Depth Fading

分类

最新AR/VR行业分享

最新AR/VR论文

最新AR/VR行业招聘