Google Patent | Intent detection with a computing device
Patent: Intent detection with a computing device
Drawings: Click to check drawins
Publication Number: 20200409469
Publication Date: 20201231
Applicant: Google
Abstract
A method can perform a process with a method including capturing an image, determining an environment that a user is operating a computing device, detecting a hand gesture based on an object in the image, determining, using a machine learned model, an intent of a user based on the hand gesture and the environment, and executing a task based at least on the determined intent.
Claims
-
A method, comprising: capturing an image; determining an environment in which a user is operating a computing device; detecting a hand gesture based on an object in the image; determining, using a machine learned model, an intent of the user based on the hand gesture and the environment; and executing a task based at least on the determined intent.
-
The method of claim 1, wherein determining the intent of the user further includes: translating an interaction of the user with a real-world, and using the interaction and the hand gesture to determine the intent of the user.
-
The method of claim 1, wherein the machine learned model is based on a computer vision model.
-
The method of claim 1, wherein a first machine learned model and a second machine learned model are used to determine the intent of the user, the method further comprising: continuous tracking of a hand associated with the hand gesture using the second machine learned model.
-
The method of claim 1, wherein the image is captured using a single non-depth sensing camera of a computing device.
-
The method of claim 1, wherein the task is based on use of a computer assistant.
-
The method of claim 1, wherein the task includes at least one of a visual and audible output.
-
The method of claim 1, wherein the machine learned model is trained using a plurality of images including at least one hand gesture, the machine learned model is trained using a plurality of ground-truth images of hand gestures, a loss function is used to confirm a match between a hand gesture and a ground-truth image of a hand gesture, and the detecting of the hand gesture based on the object in the image includes matching the object to the hand gesture matched to the ground-truth image of the hand gesture.
-
The method of claim 1, wherein the machine learned model is trained using a plurality of images each including at least one object, and the at least one object has an associated ground-truth box.
-
The method of claim 1, wherein the machine learned model generates a plurality of bounding boxes, the machine learned model determines a plurality of features based on at least a portion of an object within a bounding box, the machine learned model identifies the object based on the plurality of features, and the intent of the user is determined based on the identified object.
-
A system comprising: a memory storing a set of instructions; and a processor configured to execute the set of instructions to cause the system to: capture an image; determine an environment in which a user is operating a computing device; detect a hand gesture based on an object in the image; determine, using a machine learned model, an intent of the user based on the hand gesture and the environment; and execute a task based at least on the determined intent.
-
The system of claim 11, wherein determining the intent of the user further includes: translating an interaction of the user with a real-world, and using the interaction and the gesture to determine the intent of the user.
-
The system of claim 11, wherein the machine learned model is based on a computer vision model.
-
The system of claim 11, wherein a first machine learned model and a second machine learned model are used to determine the intent of the user; the set of instructions are executed by the processor to further cause the system: continuously track of the hand using the second machine learned model.
-
The system of claim 11, wherein the image is captured using a single non-depth sensing camera of a computing device.
-
The system of claim 11, wherein the task is executed using a computer assistant.
-
The system of claim 11, wherein the task includes at least one of a visual and audible output.
-
The system of claim 11, wherein the machine learned model generates a plurality of bounding boxes, the machine learned model determines a plurality of features based on at least a portion of an object within a bounding box, the machine learned model identifies the object based on the plurality of features, and the intent of the user is determined based on the identified object.
-
The system of claim 11, wherein the machine learned model is trained using a plurality of images including at least one hand gesture, the machine learned model is trained using a plurality of ground-truth images of hand gestures, a loss function is used to confirm a match between a hand gesture and a ground-truth image of a hand gesture, and the detecting of the hand gesture based on the object in the image includes matching the object to the hand gesture matched to the ground-truth image of the hand gesture.
-
A non-transitory computer readable storage medium containing instructions that when executed by a processor of a computer system cause the processor to perform steps comprising: capturing an image; determining an environment in which a user is operating a computing device; detecting a hand gesture based on an object in the image; determining, using a machine learned model, an intent of the user based on the hand gesture and the environment; and executing a task based at least on the determined intent.
Description
RELATED APPLICATION
[0001] This application claims priority to and the benefit of U.S. Provisional Patent Application No. 62/867,389, filed on Jun. 27, 2019, entitled “AUGMENTED REALITY MOUSE TO DETECT INTENT”, the contents of which are incorporated in their entirety herein by reference.
FIELD
[0002] Embodiments relate to detecting an intention of a user of a computing device based on a presentation of an object (e.g., a hand, a book, an item for sale, and/or the like) as captured by a camera of the computing device.
BACKGROUND
[0003] Pointing devices in computing are used to control or activate certain elements in a user interface. On a computer, this can be achieved by using a separate controller, for example, a mouse, which can be moved on a flat surface, and the movement of the mouse translated to a pointer/cursor on the computer’s screen. In addition, the mouse may have buttons to click and scroll which can enable various types of tasks, e.g., opening an application, selecting an application, scrolling down, etc. However, with the evolution of smartphones, tablets, etc., touchscreens are generally used and a finger, for example, can replace the physical controller. User actions such as tap, scroll, swipe, pinch and long press have become common patterns of interaction with smart phones, tablets, etc.
SUMMARY
[0004] In a general aspect, a device, a system, a non-transitory computer-readable medium (having stored thereon computer executable program code which can be executed on a computer system), and/or a method can perform a process with a method including capturing an image, determining an environment that a user is operating a computing device, detecting a hand gesture based on an object in the image, determining, using a machine learned model, an intent of a user based on the hand gesture and the environment, and executing a task based at least on the determined intent.
[0005] The system can include a memory storing a set of instructions, and a processor configured to execute the set of instructions to cause the system to capture an image, determine an environment that a user is operating a computing device, detect a hand gesture based on an object in the image, determine, using a machine learned model, an intent of a user based on the hand gesture and the environment, and execute a task based at least on the determined intent.
[0006] Implementations can include one or more of the following features. For example, determining the intent of the user can further include translating an interaction of the user with a real-world, and using the interaction and the hand gesture to determine the intent of the user. The machine learned model can be based on a computer vision model. A first machine learned model and a second machine learned model can be used to determine the intent of the user. The method can further include continuous tracking of a hand associated with the hand gesture using the second machine learned model. The image can be captured using a single non-depth sensing camera of a computing device. The task can be based on use of a computer assistant. The task can include at least one of a visual and audible output. The machine learned model can be trained using a plurality of images including at least one hand gesture, the machine learned model is trained using a plurality of ground-truth images of hand gestures, a loss function is used to confirm a match between a hand gesture and a ground-truth image of a hand gesture, and the detecting of the hand gesture based on the object in the image includes matching the object to the hand gesture matched to the ground-truth image of the hand gesture. The machine learned model can be trained using a plurality of images each including at least one object, and the at least one object can have an associated ground-truth box. The machine learned model can generate a plurality of bounding boxes, the machine learned model can determine a plurality of features based on at least a portion of an object within a bounding box, the machine learned model can identify the object based on the plurality of features, and the intent of the user can be determined based on the identified object.
BRIEF DESCRIPTION OF THE DRAWINGS
[0007] Example embodiments will become more fully understood from the detailed description given herein below and the accompanying drawings, wherein like elements are represented by like reference numerals, which are given by way of illustration only and thus are not limiting of the example embodiments and wherein:
[0008] FIG. 1 illustrates a flowchart of a method for detecting a user’s intent and executing a task based on the intent according to at least one example implementation.
[0009] FIG. 2 illustrates a trigger for detecting hand gestures according to at least one example implementation.
[0010] FIG. 3 illustrates determining intent based on disambiguation according to at least one example implementation.
[0011] FIG. 4 illustrates pointing gestures according to at least one example implementation.
[0012] FIG. 5 illustrates a block diagram of a signal flow according to at least one example implementation.
[0013] FIG. 6 illustrates a flowchart of a method
[0014] FIG. 7 illustrates a block diagram of a gesture processing system
[0015] FIG. 8A illustrates layers in a convolutional neural network with no sparsity constraints.
[0016] FIG. 8B illustrates layers in a convolutional neural network with sparsity constraints.
[0017] FIG. 9 illustrates a block diagram of a model according to an example embodiment.
[0018] FIG. 10 illustrates a block diagram of a signal flow for a machine learning process according to an example embodiment.
[0019] FIGS. 11A and 11B illustrate a head-mounted display device according to at least one example embodiment.
[0020] FIG. 12 illustrates a wearable computing device according to at least one example embodiment.
[0021] FIGS. 13A, 13B, 13C, 14A and 14B illustrate reading assistant tasks, according to example embodiments.
[0022] FIG. 15 shows an example of a computer device and a mobile computer device according to at least one example embodiment.
[0023] It should be noted that these Figures are intended to illustrate the general characteristics of methods, structure and/or materials utilized in certain example embodiments and to supplement the written description provided below. These drawings are not, however, to scale and may not precisely reflect the precise structural or performance characteristics of any given embodiment, and should not be interpreted as defining or limiting the range of values or properties encompassed by example embodiments. For example, the relative thicknesses and positioning of molecules, layers, regions and/or structural elements may be reduced or exaggerated for clarity. The use of similar or identical reference numbers in the various drawings is intended to indicate the presence of a similar or identical element or feature.
DETAILED DESCRIPTION OF THE EMBODIMENTS
[0024] Some computing devices lack a screen (e.g., a display screen) and may rely on real-world interactions through the use of natural gestures (or gesture patterns) with fingers. A computing device that does not include a controller to interface with typical input devices (e.g., a mouse, a keyboard, and/or the like) can include a user interface configured to detect a user’s intent via atypical (e.g., as user intent input devices) computing device component(s) (e.g., a camera). In some implementations, the user interface may detect the user’s intent based on natural gestures for perceptive computing devices and trigger a task (by the computing device) based on the detected user intent.
[0025] In an augmented reality (AR) application, objects to be identified can be limited to objects generated by the AR application. For example, if 200 objects are used by the AR application, only 200 detectors are needed to identify an object. By contrast, example implementations use trained ML models to identify any possible real-world object (e.g., hand pose, product, business card, and/or the like) to determine a user’s intent. Therefore, example detectors can be configured to detect and/or identify any real-world object and/or variations (e.g., pose) of the real-world object.
[0026] In some implementations, a computer vision (e.g., computer vision model) and/or a machine learned (ML) model can be used to determine the intent of a user (e.g., user’s intent) from the user’s hand gestures (e.g., as captured by a camera of the device). Examples of such hand gestures may include pointing, clicking, scrolling, circling, pinch zooming, tapping, swiping, and/or the like. In some implementations, user intents that are natural through pointing gestures, e.g., capturing a full paragraph within a text document by circling the full paragraph, can be supported when used, for instance, on a one-person view device (e.g., a wearable smart device, a head-mount display, and/or the like).
[0027] The user interface may translate (e.g., transform, convert, etc.) the user’s interactions with the physical or digital world into a digital semantic understanding of the user’s intent. The user’s intent can be used to trigger tasks that apply to the physical or digital world. In some implementations, the user interface can support a procedure or mechanism for detecting hand gestures (e.g., a user holding the business card), determining the intent of the user (e.g., intent to save the business card) based on the user’s intent and/or verbal commands (e.g., holding the business card in his/her hand (along with a verbal command of “save this business card”), and triggering a task based on the determined intent (e.g., to save the business card). In some example implementations, hand gestures can be used to query (e.g., instruct, inquire, and/or the like) a digital assistant about the definition of a word, ingredients in a product, or purchase an item being held in the user’s hands.
[0028] In some implementations, the user interface and/or mechanism(s) described above can be integrated into the Operating System and/or System Architecture of the computing device and can be used by other Application (e.g. App) developers as a cursor or gesture input medium without the need for any physical input hardware (e.g., a mouse, a keyboard, and/or the like). In addition, the user interface and/or mechanism can be extended to interact and manipulate VR/AR world (e.g., using interactions that are not included as functions within the AR/VR application) via the semantic understanding of what the user may achieve with pointing and user gestures, for example, highlighting text. In some implementations, the user interface can detect (or help detect) hands from a first-person view perspective in a pointing position together with a location of a pointer (e.g., tip of a visible part of the index finger). In an example implementation, the user interface can be a deep neural network built on, for example, a convolutional neural network (CNN) architecture.
[0029] The methods described with regard to FIG. 1 can be performed due to the execution of software code stored in a memory (e.g., a non-transitory computer readable storage medium) associated with an apparatus and executed by at least one processor associated with the apparatus. However, alternative embodiments are contemplated such as a system embodied as a special purpose processor. The special purpose processor can be a graphics processing unit (GPU). In other words, the user interface can be implemented in a GPU of a one-person view device (e.g., a wearable smart device, a head-mount display, and/or the like).
[0030] A GPU can be a component of a graphics card. The graphics card can also include video memory, random access memory digital-to-analogue converter (RMDAC) and driver software. The video memory can be a frame buffer that stores digital data representing an image, a frame of a video, an object of an image, or scene of a frame. A RAMDAC can be configured to read the contents of the video memory, convert the content into an analogue RGB signal and sends analog signal to a display or monitor.
[0031] The driver software can be the software code stored in the memory referred to above. The software code can be configured to implement the method described herein. Although the methods described below are described as being executed by a processor and/or a special purpose processor, the methods are not necessarily executed by a same processor. In other words, at least one processor and/or at least one special purpose processor may execute the method described below with regard to FIG. 1.
[0032] FIG. 1 illustrates a flowchart of a method for detecting a user’s intent and triggering the execution of a task based on the intent according to at least one example implementation. As shown in FIG. 1, in step S110, a hand gesture is detected. For example, the computing device, including the user interface, can detect a user’s hand gesture using a camera of the computing device. The camera can be a non-depth sensing camera (e.g., a two-dimensional (2D) camera) and the user interface can detect hand gestures with just one camera (in contrast to other hand gesture detection techniques which may require multiple camera inputs). In an example implementation, the user interface can be configured to detect a user’s hand(s) in a pointing position from a first person perspective together with a location of the pointer (e.g., tip of a visible part of user’s index finger) based on a machine learned (ML) model that is trained using a diverse set of images (e.g., 1000 s of images).
[0033] In step S120, the user’s intent is determined based on, at least, the detected hand gesture. For example, the hand gesture can be the user pointing (e.g., using an index finger) at an object. In some implementations, the user interface can be configured to (e.g., using the ML model) determine the user’s intent. In some implementations, for example, a ML model (e.g., a computer vision model) can be developed using the camera input of the computing device. Although computer vision models can require depth-sending camera or multi-camera inputs, the computing device may determine user’s intent using a single non-depth (e.g., 2D) sensing camera input. This can allow the ML model to be implemented on computing devices with a single camera or a single non-depth sensing camera.
[0034] In step S130, a task based at least on the determined intent is triggered. For example, the user interface can trigger a task based on the determined intent. The task can be a function of the computing device. Example tasks can include taking a picture or video, increasing/decreasing volume, skipping songs, and/or the like. Although this disclosure describes using the index finger as a trigger, other fingers can be used as a trigger. The use of the index finger being used as a trigger is for illustration purposes. As described above, the ML model can be trained with a diverse set of images.
[0035] For example, if the hand gesture is a pointing finger and the finger is pointing at an object. The user’s intent can be determined as to acquire some information about the object. The interface can trigger the computing device to identify the object and to perform a search based on the identified object. For example, the computing device can search for a price for the object at one or more stores.
[0036] FIG. 2 illustrates a trigger for detecting hand gestures according to at least one example implementation. In FIG. 2, a bounding box 205 of a user’s hand having a finger 210 (e.g., an index finger) is illustrated. In an example implementation, an object being detected can cause the bounding box 205 to be generated. Generation of the bounding box 205 can trigger an identification of the object in the bounding box 205. In some implementations, the bounding box 205 can be one of a plurality of bounding boxes generated in response to receiving the image (e.g., by a ML model). The user interface can cause the capture of an image which can trigger generation of the bounding box 205 and to determine, using the captured image and the ML model that a hand is within the bounding box 205.
[0037] After identifying the object as a hand, the user interface can cause the ML model (and/or trigger another ML model) to identify a pose and/or motion of the hand. For example, the ML model can be configured to search for fingertips. Determining the hand includes the finger 210 in a pointing pose can trigger another task (e.g., as yet another ML model and/or computer code) of the user interface. The task can include determining what the finger 210 is pointing at.
[0038] FIG. 3 illustrates determining intent based on disambiguation according to at least one example implementation. As illustrated in the image 300 of FIG. 3, finger 210 is pointing at giraffe 305. The finger 210 pointing at an identifiable object (e.g., giraffe 305) can be used to disambiguate (remove uncertainty) and determine the user’s intent using the ML model. In an example implementation, the ML model can determine that the user is likely pointing to the giraffe 305. Determine that the user is likely pointing to the giraffe 305, can trigger the user interface (e.g., based on a ML model) to cause the computing device to perform a task (e.g., search for information about a giraffe user a computer assistant).
[0039] FIG. 4 illustrates pointing gestures according to at least one example implementation. For example, the ML model can determine a hand includes a pointing finger as discussed above. In this example, the ML model can determine that the user is likely pointing to text (e.g., in a book) as opposed to pointing at an object (e.g., the giraffe 305). The user’s intent can be determined based on the text being pointed to and the pose and/or motion of the hand. For example, the intent could be determined as translate, read aloud, find a definition, and/or the like of the text or a portion of the text (e.g., a word, a phrase, a sentence, and/or the like). Some examples of pointing gestures (as shown in FIG. 4) can include: a) pointing to a word by pointing directly under a word without covering it (405), b) pointing to a phrase by sliding finger from left to right (410), c) pointing to a sentence by sliding the finger from left to right and double tapping to indicate the end of selection (415), d) pointing to a paragraph by circling around the paragraph (420), and the like.
[0040] In some implementations, the hand gestures can be detected in a plurality of phases, e.g., two or more phases. In other words, a first ML model and a second ML model are used to determine the intent of the user. Then, in later phases, the first ML model may not be used. For example, in a first phase, the user interface can be configured to (e.g., using a ML model) generate a bounding box (e.g., bounding box 205) to identify (or help identify) an object as a user’s hand. In a second phase, the user interface can be configured to (e.g., using a ML model) determine the pose of the hand. This multi-phase approach for gesture identification can allow for continuous tracking of the user’s hand (e.g., pose and motion) without re-running at least one of the phases (e.g., the first phase to identify the hand) and can make detecting hand gestures and determining intent (as well as the subsequent executing of a task) much more efficient (e.g., in terms of speed and resource utilization (e.g., processor, memory, and/or the like)).
[0041] FIG. 5 illustrates a block diagram of a signal flow according to at least one example implementation. As shown in FIG. 5, the signal flow 500 begins with a detect an object 505 block. The detect an object 505 block can be configured to monitor for and detect an object. For example, after initiating the user interface, communication with a camera of the computing device can be established. As an image(s) are captured and communicated to the user interface, the user interface can determine an object (or a new object) has come within view of the camera based on the communicated image(s). In an identify the object 510 block, the user interface can use a trained ML model to identify the object. In an example implementation, the object can be a hand (e.g., indicating the user’s intent to present a hand gesture). However, the object can be, for example, a product for sale, an item in the real-world (e.g., a house, a tree, a street sign, furniture, a book, and/or the like).
[0042] In an identify the environment 515 block, the user interface can use a user indication, an application indication, a scan (using the camera) of the surroundings, and/or the like to identify the environment that the computing device is operating in. For example, the user interface could be instantiated by a computer application. The application can be a shopping application, an education application, a translation application, and/or the like. Therefore, the identify the environment 515 block, can identify the environment as a store (or other shopping location), a school (or classroom), a reading location, and/or the like.
[0043] In addition, the identify the environment 515 block can use a trained ML model to identify the environment. In order to identify an environment, a computer vision model can be trained using images of objects that can be found in various environments. The images can include desks, chairs, blackboards and/or the like for a classroom environment. The images can include desks, chairs, bookshelves, checkout stations and/or the like for a library environment. The images can include trees, vegetation, grass, animals and/or the like for an outdoor environment. An image captured by the camera of the computing device can be input to the model. A result that includes a minimum number of objects can be classified as a likely environment. For example, if the image includes several types of trees, grass, and an animal, the environment can be classified as being an outdoor environment. In addition, the ML model can use tools available to the computer device to identify a specific outdoor (or another classified environment). For example, the ML model can use location information (e.g., a global positioning system) and the classified environment to identify the environment more precisely (e.g., as a national park, a state park, a golf course, and/or the like).
[0044] In a select ML model group(s) 520 block at least one ML model can be selected based on the object and the environment. In an example implementation, the user interface can be configured to detect and respond to hand gestures. Therefore, the at least one ML model can include an ML model trained using hand gestures. The at least one ML model can include a ML model to identify a hand gesture and a model and/or algorithm that can be unique to the environment and can map the hand gesture to user intent. The at least one ML model can be configured to identify the hand gesture and map the hand gesture to a user intent for the environment in a single operation. For example, the at least one ML model can include a detection layer or block configured to identify the hand gesture and map the hand gesture to a user intent.
[0045] Further, there can be a plurality of hand gesture techniques. For example, the hand gestures could be single hand, two hand, hand and voice, and/or the like. Therefore, the signal flow 500 can include a plurality of gesture ML models shown as gesture ML model group 1 525, gesture ML model group 2 530, … , and gesture ML model group n 535. The dashed lines indicate that one gesture ML model is likely to be selected at a time. However, other configurations are within the scope of this disclosure. Other trained ML models may also be included in example implementations as illustrated by, but not limited to, object ML model group 540.
[0046] Combinations of trained ML models can also be used by the user interface. For example, an application developer can develop an application for a grocery store chain. Therefore, the application developer can rely on gesture ML models available to all application developers and a custom ML model (e.g., as an object ML model group 540) trained based on products available at the grocery store. A user can open the developed application which may instantiate the user interface. The user can reach out for a product causing a hand to be detected and identified (the detect an object 505 block and the identify the object 510 block). The developed application can identify the environment as the grocery store (identify the environment 515 block) and select a gesture ML model group and an object ML model group. For example, a two-hand ML model group and the custom ML model group can be selected.
[0047] The signal flow 500 can include at least one repetitive flow operation illustrated in FIG. 5 as flow 575 block and flow 580 block illustrated with dashed lines to indicate that the blocks may not necessarily be structurally together or in one location. The flow 575 block includes an identify gesture 545 block, a trigger task(s) 550 block, and a monitor for gesture 555 block. The identify gesture 545 block can be configured to receive an image from the camera of the computing device. The image can be used as an input to a learned ML model to identify the gesture. Identifying the gesture can include assigning a variable used to identify unique trained gestures.
[0048] The trigger task(s) 550 block can include instructions (e.g., computer code) that can cause the computing device to execute a task based on the identified gesture. In an example implementation, each task can be identified by a unique variable. The unique variable can be same as the variable that identifies the gesture. Alternatively, or in addition to, the unique variable can be mapped to the identified gesture or the identified gesture can be mapped to the unique variable. The task can be any task that can be performed by the computing device. For example, the task can be a search, a translation, read (e.g., text to speech), a computer assistant task, store data (e.g., an image), map data (e.g., map a business card to a contact), and/or the like.
[0049] Continuing the grocery application example described above, a gesture by the shopper can be identified and a task performed. For example, reaching out and grabbing an item can cause the display or an audible indication of the price, nutrition, or other information about the product. Further, a swiping gesture with the other hand can cause the item to be placed in the shopping cart. These tasks are just examples, other tasks are within the scope of this disclosure.
[0050] The monitor for gesture 555 block can monitor images captured and communicated by the camera. The monitor for gesture 555 block can use a trained ML model that can test the image and determine if the image is likely a gesture. If the image is likely a gesture the image can be communicated to the identify gesture 545 block. If the identify gesture 545 block identifies the image as a gesture, processing continues to the trigger task(s) 550 block. Otherwise, processing returns to the monitor for gesture 555 block. In some implementations, the signal flow can begin with flow 575 and/or flow 580. In other words, a gesture (flow 575) could be identified first (e.g., a hand and ML groups(s) can be preconfigured via an application) followed by an object (flow 560) or vice versa.
[0051] The flow 580 block includes an identify object 560 block, a trigger task(s) 565 block and a monitor for object 570 block. The identify object 560 block can use a trained ML model to identify the object. The trigger task(s) 550 block can cause some task to be performed based on the identity of the object. Continuing the grocery application example described above, the object can be identified as a product and the task can be to look up information about the product. Further, two or more ML model groups can be configured to operate together. For example, the trigger task(s) 550 block can trigger the starting of the identify object 560 block.
[0052] The monitor for object 570 block can monitor images captured and communicated by the camera. The monitor for object 570 block can use a trained ML model that can test the image and determine if the image is likely an object (e.g., an object that is different than the previously identified object). If the image is likely a object the image can be communicated to the identify object 560 block. If the identify object 560 block identifies the image as an object, processing continues to the trigger task(s) 565 block. Otherwise, processing returns to the monitor for object 570 block.
[0053] The methods described with regard to FIG. 6 can be performed due to the execution of software code stored in a memory (e.g., a non-transitory computer readable storage medium) associated with an apparatus and executed by at least one processor associated with the apparatus. However, alternative embodiments are contemplated such as a system embodied as a special purpose processor. The special purpose processor can be a graphics processing unit (GPU). In other words, the user interface can be implemented in a GPU of a one-person view device (e.g., a wearable smart device, a head-mount display, and/or the like).
[0054] A GPU can be a component of a graphics card. The graphics card can also include video memory, random access memory digital-to-analogue converter (RAMDAC) and driver software. The video memory can be a frame buffer that stores digital data representing an image, a frame of a video, an object of an image, or scene of a frame. A RAMDAC can be configured to read the contents of the video memory, convert the content into an analogue RGB signal and sends analog signal to a display or monitor.
[0055] The driver software can be the software code stored in the memory referred to above. The software code can be configured to implement the method described herein. Although the methods described below are described as being executed by a processor and/or a special purpose processor, the methods are not necessarily executed by a same processor. In other words, at least one processor and/or at least one special purpose processor may execute the method described below with regard to FIG. 6.
[0056] FIG. 6 illustrates a flowchart of a method according to at least one example implementation. As shown in FIG. 6, in step S605 hand motion of a user is detected. For example, a hand motion can be detected as a hand coming into view of a camera of a computing device. The hand motion can also be of a hand that is within the view of the camera and changes position (e.g., change a pose, move from side to side, and/or the like). The motion can indicate that a user is in the process of showing an intent.
[0057] In step S610 a pose of the hand is detected. For example, the pose can be detected as an image captured by the camera. The pose can be a finger pointing, a hand grabbing, a pinch, a circling of a finger, and/or the like.
[0058] In step S615 an environment is identified. The pose of the hand can be interpreted differently based on the environment (e.g., use case, scenario, tool, application, and/or the like). In order to determine an intention of the user (e.g., based on a hand gesture). The environment that the computing device is operating in should be determined. For example, the environment can be based on a location of the computing device, an application interacting with the user interface, and/or the like. The environment can be a store, a classroom, a reading location, a park, an outdoor space (e.g., a forest, a lake, and/or the like) and/or the like. The environment can be identified based on a user input (e.g., a voice command) or a computer application setting. For example, the user can speak out loud that he/she is reading a book, in class or to open a shopping application. Alternatively, or in addition, identification can be performed using a ML model that uses an image of the real-world environment of the computing device.
[0059] For example, in order to identify an environment, a computer vision model can be trained using images of objects that can be found in various environments. The images can include desks, chairs, blackboards and/or the like for a classroom environment. The images can include desks, chairs, bookshelves, checkout stations and/or the like for a library environment. The images can include trees, vegetation, grass, animals and/or the like for an outdoor environment. An image captured by the camera of the computing device can be input to the model. A result that includes a minimum number of objects can be classified as a likely environment. For example, if the image includes several types of trees, grass, and an animal, the environment can be classified as being an outdoor environment. In addition, the ML model can use tools available to the computer device to identify a specific outdoor (or another classified environment). For example, the ML model can use location information (e.g., a global positioning system) and the classified environment to identify the environment more precisely (e.g., as a national park, a state park, a golf course, and/or the like).
[0060] In step S620 a gesture is identified based on the pose of the hand using a trained ML model. An ML model can be trained using a plurality of hand poses that can be made by the user of a computer device. The ML model can be trained based on a plurality of images (e.g., of hand poses as gestures) and ground-truth images. For example, the pose can be captured as an image using a camera of the computing device. The image can be input to the trained ML model. The trained ML model can identify the gesture based on the image. The trained ML model can output a gesture identification (e.g., as a unique ID number).
[0061] In step S625 an intent of the user is identified based on the gesture and the environment. The ML model can include a ML model to identify a hand gesture (step S620) and a ML model and/or algorithm that can be unique to the environment and can map the hand gesture to a user intent. The at least one ML model can be configured to identify the hand gesture and map the hand gesture to a user intent for the environment in a single operation. For example, the at least one ML model can include a detection layer or block configured to identify the hand gesture and map the hand gesture to a user intent.
[0062] In an example implementation, the computer device can operate in a real-world space. Unlike a computer device executing an AR application (e.g., that can identify and respond to a limited number of gestures), example implementations can be configured to determine the intent of the user based on an unlimited number of gestures (e.g., constrained to trained gestures) and an unlimited number of environments (e.g., real-world spaces).
[0063] For example, a gesture can indicate a different user intent based on the environment. Accordingly, different environments can have different maps, look-up tables, algorithms and/or ML models that are configured to determine the intent of the user. Therefore, a map, a look-up table, an algorithm and/or a ML model can be selected based on the environment. In an example implementation, determining or identifying the user intent can include mapping the identified gesture to the user intent. Determining or identifying the user intent can include using a map to identify the user intent based on the identified gesture, the map being based on the environment. Determining or identifying the user intent can include looking-up the user intent in a look-up table based on the identified gesture (e.g., using the identified gesture as a key. Determining or identifying the user intent can include using a ML model that includes a detection layer or block configured to identify the hand gesture and map the hand gesture to a user intent.
[0064] For example, A pointing gesture within a reading (e.g., of a book) environment can indicate a different intent than a pointing gesture in a shopping environment. Therefore, the ML model and/or a map or look-up table configured to the map the hand gesture to a user intent can be different for the reading environment and the shopping environment. In other words, each ML model can have a map (e.g., a look-up table) used to determine the user’s intent by mapping the gesture to a likely intent. Alternatively, an application can be configured to use a ML model configured to identify hand gestures that is available to application developers. The application can further include a map or look-up table configured to the map the hand gesture to a user intent.
[0065] In step S630 a task based on the intent of the user is performed. For example, a task can be a computer implemented task. The user’s intent can be mapped to a task which is performed in response to identifying the user’s intent. The task can be to output (e.g., an audible output) a definition of a word, translate a word, store information (e.g., a business card), search for information (e.g., a price, encyclopedic information, and/or the like), turn on/off an appliance, and/or the like.
[0066] FIG. 7 illustrates a block diagram of a gesture processing system according to at least one example embodiment. As shown in FIG. 7, a gesture processing system 700 includes at least one processor 705, at least one memory 710, a controller 720, a user interface 725, an ML model module 730, and a task module 735. The at least one processor 705, the at least one memory 710, the controller 720, the user interface 725, the ML model module 730 and the task module 735 are communicatively coupled via bus 715.
[0067] The at least one processor 705 can be utilized to execute instructions stored on the at least one memory 710, so as to thereby implement the various features and functions described herein, or additional or alternative features and functions. The at least one processor 705 can be a general-purpose processor. The at least one processor 705 can be a graphics processing unit (GPU). The at least one processor 705 and the at least one memory 710 can be utilized for various other purposes. In particular, the at least one memory 710 can represent an example of various types of memory and related hardware and software which might be used to implement any one of the modules described herein.
[0068] The at least one memory 710 can be configured to store data and/or information associated with the gesture processing system 700. For example, the at least one memory 710 can be configured to store code associated with implementing a user interface to capture and/or edit images. For example, the at least one memory 710 can be configured to store code associated with identifying a gesture, identifying and implementing a ML module, identifying and implementing a computing task, and/or the like. The at least one memory 710 can be a non-transitory computer readable medium with code that when executed by the processor 705 cause the processer 705 to implement one or more of the techniques described herein. The at least one memory 710 can be a shared resource. For example, the gesture processing system 700 can be an element of a larger system (e.g., a server, a personal computer, a mobile device, a head-mount display, smart glasses, a hands-free computer device, and the like). Therefore, the at least one memory 710 can be configured to store data and/or information associated with other elements (e.g., image/video rendering, web browsing, computer assistant, and/or wired/wireless communication) within the larger system.
[0069] The controller 720 can be configured to generate various control signals and communicate the control signals to various blocks in the gesture processing system 700. The controller 720 can be configured to generate the control signals to implement the techniques described herein. The controller 720 can be configured to control the task module 735 to execute software code to perform a computer-based process according to example embodiments. For example, the controller 720 can generate control signals corresponding to parameters to implement a search, control an application, store data, execute an ML model, train an ML model, and/or the like.
[0070] The user interface 725 can be configured to communicate with a camera of a computing device. Receive an image and/or a plurality of images from the camera and use a trained ML model to process the image. After processing the image, the user interface can be configured to identify and trigger the execution of a computer implemented task or process.
[0071] The ML model module 730 can be configured to store, train and execute at least one ML model. The ML model can be based on a convolutional neural network. The ML model can be trained for a plurality of users and/or a single user. For example, the ML model can be trained and stored on a network device. In an initialization process, the ML model can be downloaded from the network device to a local device. The ML model can be further trained before use and/or as the ML model is used by the local device.
[0072] The task module 735 can be configured to store and execute at least one computer program (e.g., computer code) configured to cause the performance of a task by the computer device. The task can cause the computer device to implement a search, control an application, control a computer assistant, interpret and store data, translate text, convert text to speech, and/or the like.
[0073] FIG. 8A illustrates layers in a convolutional neural network with no sparsity constraints. FIG. 8B illustrates layers in a convolutional neural network with sparsity constraints. With reference to FIGS. 8A and 8B, various configurations of neural networks for use in at least one example implementation will be described. An example layered neural network is shown in FIG. 8A. The layered neural network includes three layers 810, 820, 830. Each layer 810, 820, 830 can be formed of a plurality of neurons 805. In this implementation, no sparsity constraints have been applied. Therefore, all neurons 805 in each layer 810, 820, 830 are networked to all neurons 805 in any neighboring layers 810, 820, 830.
[0074] The example neural network shown in FIG. 8A is not computationally complex due to the small number of neurons 805 and layers. However, the arrangement of the neural network shown in FIG. 8A may not scale up to larger sizes of networks due to the density of connections (e.g., the connections between neurons/layers). In other words, the computational complexity can be too great as the size of the network scales and scales in a non-linear fashion. Therefore, it can be too computationally complex for all neurons 805 in each layer 810, 820, 830 to be networked to all neurons 805 in the one or more neighboring layers 810, 820, 830 if neural networks need to be scaled up to work on inputs with a large number of dimensions.
[0075] An initial sparsity condition can be used to lower the computational complexity of the neural network. For example, if a neural network is functioning as an optimization process, the neural network approach can work with high dimensional data by limiting the number of connection between neurons and/or layers. An example of a neural network with sparsity constraints is shown in FIG. 8B. The neural network shown in FIG. 8B is arranged so that each neuron 805 is connected only to a small number of neurons 805 in the neighboring layers 840, 850, 860. This can form a neural network that is not fully connected, and which can scale to function with higher dimensional data. For example, the neural network with sparsity constraints can be used as an optimization process for a model and/or generating a model for use in rating/downrating a reply based on the user posting the reply. The smaller number of connections in comparison with a fully networked neural network allows for the number of connections between neurons to scale in a substantially linear fashion.
[0076] In some implementations neural networks that are fully connected or not fully connected but in different specific configurations to that described in relation to FIG. 8B can be used. Further, in some implementations, convolutional neural networks that are not fully connected and have less complexity than fully connected neural networks can be used. Convolutional neural networks can also make use of pooling or max-pooling to reduce the dimensionality (and hence complexity) of the data that flows through the neural network. Other approaches to reduce the computational complexity of convolutional neural networks can be used.
[0077] FIG. 9 illustrates a block diagram of a model according to an example embodiment. A model 900 can convolutional neural network (CNN) including a plurality of convolutional layers 915, 920, 925, 935 940 945, 950, 955, 960 and an add layer 930. The plurality of convolutional layers 915, 920, 925, 935, 940, 945, 950, 955, 960 can each be one of at least two types of convolution layers. As shown in FIG. 9, the convolutional layers 915 and the convolution layer 925 can be a first convolution type. The convolutional layers 920, 935, 940, 945, 950, 955 and 960 can be a second convolution type. An image (not shown) can be input to the CNN. A normalize layer 905 can convert the input image into image 910 which can be used as an input to the CNN. The model 900 further includes a detection layer 975 and a suppression layer 980. The model 900 can be based on a computer vision model.
[0078] The normalize layer 905 can be configured to normalize the input image. Normalization can include converting the image to M.times.M pixels. In an example implementation, the normalize layer 905 can normalize the input image to 300.times.300 pixels. In addition, the normalization layer 905 can generate the depth associated with the image 910. In an example implementation, the image 910 can have a plurality of channels, depths or feature maps. For example, a RGB image can have three channels, a red (R) channel, a green (G) channel and a blue (B) channel. In other words, for each of the M.times.M (e.g., 300.times.300) pixels, there are three (3) channels. A feature map can have a same structure as an image. However, instead of pixels a feature map has a value based on at least one feature (e.g., color, frequency domain, edge detectors, and/or the like)