Microsoft Patent | Supervised reimaging of vulnerable computing devices with prioritization, auto healing, and pattern detection
Drawings: Click to check drawins
Publication Number: 20210149766
Publication Date: 20210520
Applicant: Microsoft
Abstract
Technologies are disclosed for supervised reimaging of vulnerable computing devices with prioritization, auto healing, and pattern detection. A system for re-imaging computing devices includes a scheduler that implements a workflow for reimaging computing devices using software agents. The system also includes a supervisor that monitors state data to determine if the reimaging workflow has failed for any computing devices and for initiating an auto heal job for remediating failure of the re-imaging workflow. The system also includes a vulnerability manager that can perform various operations with respect to computing devices for which a reimaging workflow has failed a predetermined number of times. The vulnerability manager might also, or alternately, identify failure patterns (e.g. multiple instances of a reimaging workflow failing for the same reason) and initiate various actions based upon the identified patterns.
Claims
- 
A computer-implemented method performed by a computing device, the method comprising: initiating a workflow for re-imaging a computing device; during execution of the workflow, storing state data describing a state of the re-imaging of the computing device in a data store; determining if the state data in the data store indicates that the re-imaging of the computing device has failed; if the state data indicates that re-imaging of the computing device has failed, initiating an auto heal job for remediating failure of the re-imaging of the computing device; determining if the auto heal job has failed; and responsive to determining that the auto heal job has failed, causing the workflow for re-imaging the computing device to be retried after a predefined period of time has elapsed. 
- 
The computer-implemented method of claim 1, further comprising causing the workflow for re-imaging the computing device to resume responsive to determining that the auto heal job has succeeded. 
- 
The computer-implemented method of claim 1, further comprising: determining the workflow for re-imaging the computing device failed a predetermined number of times; and responsive to determining the workflow for re-imaging the computing device failed a predetermined number of times, storing state data in the data store for the computing device specifying a reason for failure of the workflow and a date upon which the computing device last received a software patch. 
- 
The computer-implemented method of claim 3, further comprising: identifying one or more patterns based upon the state data; and initiating one or more actions based upon the identified patterns. 
- 
The computer-implemented method of claim 3, further comprising shutting down the computing device or causing network traffic to be removed from the computing device responsive to determining the workflow for re-imaging the computing device has failed a predetermined number of times. 
- 
The computer-implemented method of claim 1, wherein the workflow for re-imaging the computing device is performed by a plurality of software agents under control of a scheduler. 
- 
The computer-implemented method of claim 6, wherein the software agents provide status messages comprising data describing a status of the workflow for re-imaging the computing device to the scheduler, and wherein the scheduler is configured to update the state data in the data store based upon the status messages. 
- 
A computer-readable storage media having computer-executable instructions stored thereupon which, when executed by a computer, cause the computer to: store state data describing a state of the re-imaging of a computing device in a data store during execution of a workflow for re-imaging the computing device, determine if the state data in the data store indicates that the re-imaging of the computing device has failed; initiate an auto heal job for remediating failure of the re-imaging of the computing device if the state data indicates that re-imaging of the computing device has failed; determine that the auto heal job has failed; and responsive to determining that the auto heal job has failed, cause the workflow for re-imaging the computing device to be retried after a predefined period of time has elapsed. 
- 
The computer-readable storage media of claim 8, having further computer-executable instructions stored thereupon which, when executed by the computer, cause the computer to resume the workflow for re-imaging the computing device responsive to determine the auto heal job has succeeded. 
- 
The computer-readable storage media of claim 8, having further computer-executable instructions stored thereupon which, when executed by the computer, cause the computer to: determine the workflow for re-imaging the computing device failed a predetermined number of times; and responsive to determining the workflow for re-imaging the computing device failed a predetermined number of times, store state data in the data store for the computing device specifying a reason for failure of the workflow and a date upon which the computing device last received a software patch. 
- 
The computer-readable storage media of claim 10, having further computer-executable instructions stored thereupon which, when executed by the computer, cause the computer to: identify one or more patterns based upon the state data; and initiate one or more actions based upon the identified patterns. 
- 
The computer-readable storage media of claim 10, having further computer-executable instructions stored thereupon which, when executed by the computer, cause the computer to shut down the computing device or cause network traffic to be removed from the computing device responsive to determining the workflow for re-imaging the computing device has failed a predetermined number of times. 
- 
The computer-readable storage media of claim 8, wherein the workflow for re-imaging the computing device is performed by a plurality of software agents under control of a scheduler. 
- 
The computer-readable storage media of claim 13, wherein the software agents provide status messages comprising data describing a status of the workflow for re-imaging the computing device to the scheduler, and wherein the scheduler is configured to update the state data in the data store based upon the status messages. 
- 
A computing device, comprising: a processor; a network interface unit; and a computer-readable storage media having instructions stored thereupon which, when executed by the processor, cause the computing device to: store state data describing a state of the re-imaging of a computing device in a data store during execution of a workflow for re-imaging the computing device, determine if the state data in the data store indicates that the re-imaging of the computing device has failed; initiate an auto heal job for remediating failure of the re-imaging of the computing device if the state data indicates that re-imaging of the computing device has failed; determine that the auto heal job has failed; and responsive to determining that the auto heal job has failed, cause the workflow for re-imaging the computing device to be retried after a predefined period of time has elapsed. 
- 
The computing device of claim 15, wherein the computer-readable storage media has further computer-executable instructions stored thereupon which, when executed by the computer, cause the computer to resume the workflow for re-imaging the computing device responsive to determine the auto heal job has succeeded. 
- 
The computing device of claim 15, wherein the computer-readable storage media has further computer-executable instructions stored thereupon which, when executed by the computer, cause the computer to: determine the workflow for re-imaging the computing device failed a predetermined number of times; and responsive to determining the workflow for re-imaging the computing device failed a predetermined number of times, store state data in the data store for the computing device specifying a reason for failure of the workflow and a date upon which the computing device last received a software patch. 
- 
The computing device of claim 17, wherein the computer-readable storage media has further computer-executable instructions stored thereupon which, when executed by the computer, cause the computer to: identify one or more patterns based upon the state data; and initiate one or more actions based upon the identified patterns. 
- 
The computing device of claim 17, wherein the computer-readable storage media has further computer-executable instructions stored thereupon which, when executed by the computer, cause the computer to shut down the computing device or cause network traffic to be removed from the computing device responsive to determining the workflow for re-imaging the computing device has failed a predetermined number of times. 
- 
The computing device of claim 15, wherein the workflow for re-imaging the computing device is performed by a plurality of software agents under control of a scheduler. 
Description
BACKGROUND
[0001] In order for computing devices to operate with a high level of security, it is imperative that security patches that remedy software vulnerabilities be applied in a timely manner. Security patches can be easily applied to vulnerable computing devices on a small scale. However, it can be extremely complex to apply security patches to vulnerable computing devices on a very large scale. For example, modern data centers can have tens or even hundreds of thousands of server computers. In these environments, the process for applying security updates can be complex and, as a result, can be error prone.
[0002] Errors occurring during the process of applying security updates to server computers (a process that might be referred to herein as “reimaging”) can cause serious operational issues. For example, errors occurring during reimaging can cause server computers to remain out of service for extended periods of time. As another example, a failed reimaging of a server computer might have to be restarted from the beginning or might be unnecessarily repeated on the same computer, thereby wasting computing resources unnecessarily. Moreover, server computers vulnerable to attacks might not be patched quickly enough, thereby increasing their vulnerability to malicious attacks.
[0003] It is with respect to these and other technical challenges that the disclosure made herein is presented.
SUMMARY
[0004] Technologies are disclosed herein for supervised reimaging of vulnerable computing devices with prioritization, auto healing, and pattern detection. Through implementations of the disclosed technologies, failures that occur during the reimaging of computing devices, such as server computers, can be prevented or quickly remediated, thereby maximizing the up time of these devices, conserving computing resources, and improving security. Technical benefits other than those specifically identified herein might also be realized through implementations of the disclosed technologies.
[0005] In one embodiment, a system for re-imaging computing devices includes a scheduler. The scheduler implements a workflow for reimaging computing devices, such as server computers. As discussed above, the reimaging workflow includes, among other things, installing security patches (which might also be referred to herein as “security updates” or “software patches”) that address security vulnerabilities on computing devices.
[0006] In one embodiment, the scheduler controls and coordinates the operation of software agents that implement steps of the reimaging workflow. For instance, software agents might provide functionality for vacating virtual machines (“VMs”) from the computing devices, performing quality checks on the computing devices, installing security patches, or installing other software components such as instrumentation components.
[0007] In some embodiments, the software agents provide status messages to the scheduler that include data describing various aspects of the status of the reimaging workflow. For instance, the status messages might indicate that a step in the reimaging workflow has started, completed, or failed. In turn, the scheduler stores the status messages in an appropriate data store for use by itself and other components, some of which are described below.
[0008] In some embodiments, the disclosed system also includes a supervisor. The supervisor monitors the state data stored in the data store to determine if the reimaging workflow has failed for any computing devices. If the workflow has failed for a computing device, the supervisor initiates an auto heal job for remediating failure of the re-imaging workflow for the computing device. The auto heal job determines the reason for the failure of the reimaging workflow and applies one or more actions in an attempt to remediate the identified failure.
[0009] If the auto heal job is successful, the failed reimaging workflow for the computing device can be resumed at the point in the workflow at which it terminated due to the failure. In this way, the reimaging workflow does not need to be restarted following a failure, thereby conserving the utilization of computing resources.
[0010] If the auto heal job is unsuccessful, the workflow for re-imaging the computing device is retried after a predefined period of time has elapsed. If the workflow for re-imaging the computing device fails a predetermined number of times, state data can be stored in the data store that specifies a reason, or reasons, for failure of the reimaging workflow, a date upon which the computing device last received a security update, and potentially other types of data.
[0011] The disclosed system also includes a vulnerability manager in some embodiments. The vulnerability manager can perform various operations with respect to computing devices for which the reimaging workflow has failed a predetermined number of times. For example, and without limitation, the vulnerability manager might shut down computing devices for which the reimaging process has failed a predetermined number of times or prevent network traffic from reaching the computing devices.
[0012] The vulnerability manager might also, or alternately, identify patterns (e.g. multiple instances of the reimaging workflow failing for the same reason) based upon the state data and initiate various actions based upon the identified patterns such as, for example, creating trouble tickets, generating alerts, or other types of reporting. The proper engineering team can then take action to remediate the failure. Once the failure of the reimaging workflow for these computing devices has been remediated, the state data for the computing devices can be updated to indicate that the devices are again available for application of the reimaging workflow to these devices in a typical fashion.
[0013] It should be appreciated that the subject matter disclosed herein can be implemented as a computer-controlled apparatus, a computer-implemented method, a computing device, or as an article of manufacture such as a computer readable medium. These and various other features will be apparent from a reading of the following Detailed Description and a review of the associated drawings.
[0014] This Summary is provided to introduce a brief description of some aspects of the disclosed technologies in a simplified form that are further described below in the Detailed Description. This Summary is not intended to identify key features or essential features of the claimed subject matter, nor is it intended that this Summary be used to limit the scope of the claimed subject matter. Furthermore, the claimed subject matter is not limited to implementations that solve any or all disadvantages noted in any part of this disclosure.
BRIEF DESCRIPTION OF THE DRAWINGS
[0015] FIG. 1 is a computing architecture diagram showing an overview of an illustrative configuration for a system for supervised reimaging of vulnerable computing devices, according to one embodiment;
[0016] FIG. 2A is a computing architecture diagram illustrating aspects of the operation of a scheduler utilized in embodiments disclosed herein;
[0017] FIG. 2B is a flow diagram showing a routine that illustrates aspects of the operation of the scheduler shown in FIG. 2A, according to one embodiment disclosed herein;
[0018] FIG. 3A is a computing architecture diagram illustrating aspects of the operation of a supervisor utilized in embodiments disclosed herein;
[0019] FIG. 3B is a flow diagram showing a routine that illustrates aspects of the operation of the supervisor shown in FIG. 3A, according to one embodiment disclosed herein;
[0020] FIG. 4A is a computing architecture diagram illustrating additional aspects of the operation of the supervisor shown in FIG. 3A;
[0021] FIG. 4B is a flow diagram showing a routine that illustrates additional aspects of the operation of the supervisor shown in FIGS. 3A and 4A, according to one embodiment disclosed herein;
[0022] FIG. 5A is a computing architecture diagram illustrating aspects of the operation of a vulnerability manager, according to one embodiment disclosed herein;
[0023] FIG. 5B is a flow diagram showing a routine that illustrates additional aspects of the operation of the vulnerability manager shown in FIG. 5A, according to one embodiment disclosed herein;
[0024] FIG. 6 is a computer architecture diagram illustrating an illustrative computer hardware and software architecture for a computing system capable of implementing aspects of the techniques and technologies presented herein;
[0025] FIG. 7 is a network diagram illustrating an illustrative distributed computing environment capable of implementing aspects of the techniques and technologies presented herein; and
[0026] FIG. 8 is a computer architecture diagram illustrating a computing device architecture for a computing device capable of implementing aspects of the techniques and technologies presented herein.
DETAILED DESCRIPTION
[0027] The following detailed description is directed to technologies for supervised reimaging of vulnerable computing devices with prioritization, auto healing, and pattern detection. As discussed briefly above and in greater detail below, implementations of the disclosed technologies can prevent or quickly remediate failures that occur during the reimaging of computing devices, thereby maximizing the up time of these devices, conserving computing resources, and improving security. Other technical benefits not specifically identified herein can also be realized through implementations of the disclosed technologies.
[0028] While the subject matter described herein is primarily presented in the context of the reimaging of server computers in a data center, those skilled in the art will recognize that the disclosed technologies can be utilized to reimage other types of computing devices in other types of environments. Those skilled in the art will also appreciate that the subject matter described herein can be practiced with various computer system configurations, including hand-held devices, multiprocessor systems, microprocessor-based or programmable consumer electronics, computing or processing systems embedded in devices (such as wearable computing devices, automobiles, home automation etc.), minicomputers, mainframe computers, and the like.
[0029] In the following detailed description, references are made to the accompanying drawings that form a part hereof, and which are shown by way of illustration specific configurations or examples. Referring now to the drawings, in which like numerals represent like elements throughout the several FIGS., aspects of various technologies for supervised reimaging of vulnerable computing devices with prioritization, auto healing, and pattern detection will be described.
[0030] FIG. 1 is a computing architecture diagram showing an overview of an illustrative configuration for a system 100 for supervised reimaging of vulnerable computing devices, according to one embodiment. As discussed briefly above, the disclosed system 100 for re-imaging computing devices includes a scheduler 102 in some embodiments. The scheduler 102 is a software component that implements a reimaging workflow 104 for reimaging computing devices 106, such as server computers.
[0031] As also discussed above, the reimaging workflow 104 can include, among other things, installing security patches (which might also be referred to herein as “security updates” or “software patches”) that address security vulnerabilities on computing devices 106, preferably within a predetermined amount of time (e.g. one month from the time a security patch becomes available). Accordingly, the computing devices 106 might be referred to herein as “vulnerable computing devices 106” or “vulnerable devices 106.” In this regard, it is to be appreciated that while the embodiments disclosed herein are described primarily in the context of reimaging server computers, the technologies disclosed herein can be utilized to reimage other types of computing devices in a similar manner. Additional details regarding the operation of the scheduler 102 and the reimaging workflow 104 will be provided below with regard to FIGS. 2A and 2B.
[0032] In some embodiments, the system 100 also includes a supervisor 110. The supervisor 110 is a software component that monitors state data stored in a data store 108 (which might be referred to herein as the “data store 108” or the “machine state data store 108”) to determine if the reimaging workflow 104 has failed for any computing devices 106. If the workflow 104 has failed for a computing device 106, the supervisor 110 initiates an auto heal job 112 for remediating failure of the re-imaging workflow 104 for the computing device 106.
[0033] The auto heal job 112 determines the reason for the failure of the reimaging workflow 104 and applies one or more actions in an attempt to remediate the failure. If the workflow 104 for re-imaging a computing device 106 fails a predetermined number of times, the supervisor 110 can store state data 114 in the data store 108 that specifies a reason, or reasons, for failure of the reimaging workflow 104, a date upon which the computing device 106 last received a security update, and potentially other information. Additional details regarding the operation of the supervisor 110 and the auto heal job 112 will be provided below with regard to FIGS. 3A-4B.
[0034] The system 100 also includes a vulnerability manager 116 in some embodiments. The vulnerability manager 116 is a software component that can perform various operations with respect to computing devices 106 for which the reimaging workflow 104 has failed a predetermined number of times. For example, and without limitation, the vulnerability manager 116 might shut down computing devices 106 for which the reimaging workflow 104 has failed a predetermined number of times or remove network traffic from the computing devices 106. Additional details regarding the operation of the vulnerability manager are provided below with regard to FIGS. 5A and 5B.
[0035] FIG. 2A is a computing architecture diagram illustrating additional aspects of the operation of a scheduler 102 utilized in embodiments disclosed herein. As discussed briefly above, the scheduler 102 is a software component that implements a reimaging workflow 104 for reimaging computing devices 106, such as server computers. As also discussed above, the reimaging workflow 104 can include, among other things, installing security patches that address security vulnerabilities on computing devices 106. The reimaging workflow 104 can also include performing other operations on the computing devices 106 prior or subsequent to the installation of the security patches. Some of these operations will be describe below.
[0036] In one embodiment, the scheduler 102 controls and coordinates the operation of software agents 206 (which might be referred to herein as “agents”) that implement steps of the reimaging workflow 104. For instance, the agents 206 might provide functionality for vacating (i.e. removing) virtual machines (“VMs”) from the computing devices 106, performing quality checks on the computing devices 106, installing the security patches, or installing other software components such as operating system, security, or instrumentation components. Quality checks might include, for example, determining if a dynamic host configuration protocol (“DHCP”) reservation having the correct media access control (“MAC”) address is present on a DHCP server, determining whether the serial number of a computing device 106 is the same as that stored in an asset management system, or determining whether the hardware of the computing device 106 is healthy enough for reimaging. Other types of quality checks might also be performed.
[0037] In order to reimage the computing devices 106, the scheduler 102 can retrieve identifiers (“IDs”) for computing devices 106 that are to be reimaged from a queue 202. The scheduler 102 can then give priority to the reimaging of computing devices 106 that suffer from security vulnerabilities as opposed to those that require reimaging for other non- security related purposes.
[0038] The scheduler 102 can also instantiate software agents 206 in order to implement the steps of the reimaging workflow 104 for the prioritized computing devices 106. As the agents 206 perform their respective tasks, the agents 206 can provide status messages 208 to the scheduler 102 that include state data 204 describing various aspects of the status of the reimaging workflow 104. For instance, the state data 204 might indicate that a step in the reimaging workflow 104 has started, completed, or failed.
[0039] In turn, the scheduler 102 stores the state data 204 in an appropriate data store 108 for use by itself and other components, some of which are described below. The scheduler 102 can also instantiate other agents 206 based on the state data 204. For instance, the scheduler 102 might start a new agent 206 when another agent 206 indicates that it has completed its operation successfully.
[0040] FIG. 2B is a flow diagram showing a routine 250 that illustrates aspects of the operation of the scheduler 102 shown in FIGS. 1 and 2A, according to one embodiment disclosed herein. It should be appreciated that the logical operations described herein with regard to FIG. 2B, and the other FIGS., can be implemented (1) as a sequence of computer implemented acts or program modules running on a computing device and/or (2) as interconnected machine logic circuits or circuit modules within a computing device.
[0041] The particular implementation of the technologies disclosed herein is a matter of choice dependent on the performance and other requirements of the computing device. Accordingly, the logical operations described herein are referred to variously as states, operations, structural devices, acts, or modules. These states, operations, structural devices, acts and modules can be implemented in hardware, software, firmware, in special-purpose digital logic, and any combination thereof. It should be appreciated that more or fewer operations can be performed than shown in the FIGS. and described herein. These operations can also be performed in a different order than those described herein. It should also be understood that the methods described herein can be ended at any time and need not be performed in their entireties.
[0042] Some or all operations of the methods described herein, and/or substantially equivalent operations, can be performed by execution of computer-readable instructions included on a computer-storage media, as defined below. The term “computer-readable instructions,” and variants thereof, as used in the description and claims, is used expansively herein to include routines, applications, application modules, program modules, programs, components, data structures, algorithms, and the like. Computer-readable instructions can be implemented on various system configurations, including single-processor or multiprocessor systems, minicomputers, mainframe computers, personal computers, hand-held computing devices, microprocessor-based, programmable consumer electronics, combinations thereof, and the like.
[0043] Thus, it should be appreciated that the logical operations described herein are implemented (1) as a sequence of computer implemented acts or program modules running on a computing system and/or (2) as interconnected machine logic circuits or circuit modules within the computing system. The implementation is a matter of choice dependent on the performance and other requirements of the computing system. Accordingly, the logical operations described herein are referred to variously as states, operations, structural devices, acts, or modules. These operations, structural devices, acts, and modules might be implemented in software, in firmware, in special purpose digital logic, and any combination thereof
[0044] As described herein, in conjunction with the FIGS. described herein, the operations of the routine 250 are described herein as being implemented, at least in part, by an application, component, and/or circuit. Although the following illustration refers to the components of FIGS. 1 and 2A, it can be appreciated that the operations of the routine 250 might be also implemented in many other ways. For example, the routine 250 might be implemented, at least in part, by a computer processor or a processor or processors of another computer. In addition, one or more of the operations of the routine 250 might alternatively or additionally be implemented, at least in part, by a computer working alone or in conjunction with other software modules.
[0045] For example, the operations of routine 250 (and the operations illustrated in the other FIGS.) are described herein as being implemented, at least in part, by an application, component, and/or circuit, which are generically referred to herein as modules. In some configurations, the modules can be a dynamically linked library (“DLL”), a statically linked library, functionality produced by an application programing interface (“API”), a compiled program, an interpreted program, a script or any other executable set of instructions. Data and/or modules, such as the data and modules disclosed herein, can be stored in a data structure in one or more memory components. Data can be retrieved from the data structure by addressing links or references to the data structure.
[0046] The routine 250 begins at operation 252, where the scheduler 102 retrieves IDs for the computing devices 106 that are to be reimaged. The routine 250 then proceeds from operation 252 to operation 254, where the scheduler 102 instantiates the software agents 206 to implement the reimaging workflow 104 in the manner described above.
[0047] From operation 254, the routine 250 proceeds to operation 256, where the agents 206 perform their respective tasks and return status messages that include state data 204 to the scheduler 102. As discussed above, the state data 204 describes various aspects of the status of the reimaging workflow 104. For instance, the state data 204 might indicate that a step in the reimaging workflow 104 has started, completed, or failed.
[0048] The scheduler 102 stores the state data 204 received from the agents 206 in the data store 108 at operation 258. Other components in the system 100 can utilize the state data 204 in various ways, some of which will be described in detail below. From operation 258, the routine 250 proceeds back to operation 252, where the scheduler 102 can continue to reimage computing devices 106 in the disclosed manner.
[0049] FIG. 3A is a computing architecture diagram illustrating aspects of the operation of a supervisor 110 utilized in embodiments disclosed herein. As discussed briefly above with regard to FIG. 1, supervisor 110 is a software component that monitors state data 204 stored in a data store 108 to determine if the reimaging workflow 104 has failed for any computing devices 106. If the workflow 104 has failed for a computing device 106, the supervisor 110 initiates an auto heal job 112 for remediating failure of the re-imaging workflow 104 for the particular computing device 106.
[0050] The auto heal job 112 is a software component that determines the reason for the failure of the reimaging workflow 104 for a computing device 106 based upon the state data 204 stored in the data store 108. The auto heal job 112 can then apply one or more actions in an attempt to remediate the identified failure. For example, and without limitation, an auto heal job 112 might determine that the reimaging workflow 104 for a device 106 failed because a DHCP reservation with a valid MAC address for the device 106 is not present on a DHCP server. This might occur, for instance, if the device 106 was repaired and a network adapter with a different MAC address was installed in the machine. In this case, the auto heal job 112 can remediate the failure by updating the DHCP reservation with the correct MAC address at the DHCP server.
[0051] If the auto heal job 112 is successful, the reimaging workflow 104 for the computing device 106 can be resumed at the point in the workflow 104 at which it terminated due to the failure. In this way, the reimaging workflow 104 does not need to be restarted following a failure, thereby conserving the utilization of computing resources.
[0052] If the auto heal job 112 fails, a variable indicating the number of retries for the particular computing device 106 can be incremented. The scheduler 102 can then retry the workflow 104 for re-imaging the computing device 106 after a predefined period of time has elapsed. In this way, failures in the workflow 104 caused by conditions remediated during the predefined period of time will not occur again on a subsequent retry. For instance, a VM might be manually vacated from a device 106 during the predefined period of time, thereby enabling the workflow 104 to complete properly on a subsequent attempt. This process can be repeated until the retry count indicates that the workflow 104 has been attempted a predetermined number of times.
[0053] If the workflow 104 for re-imaging a computing device 106 fails a predetermined number of times, the supervisor 110 can store state data 114 in the data store 108 that specifies a reason, or reasons, for failure of the reimaging workflow 104, a date upon which the computing device 106 last received a security update, and potentially other information. Details regarding the utilization of this data will be described below with regard to FIGS. 4A and 4B.
[0054] FIG. 3B is a flow diagram showing a routine 350 that illustrates aspects of the operation of the supervisor 110 shown in FIGS. 1 and 3A, according to one embodiment disclosed herein. The routine 350 begins at operation 352, where the supervisor 110 monitors the state data 204 stored in the data store 108 to identify devices 106 for which the reimaging workflow 104 has failed. The routine 350 then proceeds from operation 352 to operation 354, where the supervisor 110 invokes the auto heal job 112 for a failing device 106 in an attempt to remediate the failure of the reimaging workflow 104.
[0055] From operation 354, the routine 350 proceeds to operation 356, where the supervisor 110 determines whether the auto heal job 112 was successful. If so, the routine 350 proceeds from operation 356 to operation 358, where the workflow 104 can resume at the point at which it originally failed. The routine 350 then proceeds from operation 358 to operation 352, where the process described above can be repeated.
[0056] If the auto heal job 112 was not successful, the routine 350 proceeds from operation 356 to operation 360, where the retry count for the computing device 106 for which the workflow 104 failed is incremented. The routine 350 then proceeds from operation 360 to operation 362, where the scheduler 102 retries the reimaging workflow 104 after a predetermined period of time has elapsed since the last attempt. The routine 350 then proceeds from operation 362 to operation 352, where the process described above can be repeated.
[0057] FIG. 4A is a computing architecture diagram illustrating additional aspects of the operation of the supervisor 110 shown in FIGS. 1 and 3A. As discussed briefly above, if the reimaging workflow 104 fails a predetermined number of times, the supervisor 110 can store state data 114 in the data store 108 that specifies a reason, or reasons, for failure of the reimaging workflow 104, a date 402 upon which the computing device 106 last received a security update, and potentially other information.
[0058] FIG. 4B is a flow diagram showing a routine 450 that illustrates additional aspects of the operation of the supervisor 110 shown in FIGS. 1, 3A, and 4A, according to one embodiment disclosed herein. The routine 450 begins at operation 452, where the supervisor 110 determines if the reimaging workflow 104 has failed a predetermined number of times for a particular computing device 106.
[0059] If the reimaging workflow 104 has failed a predetermined number of times for a particular computing device 106, the routine 450 proceeds from operation 454 to operation 456, where the supervisor 110 stores state data 114 in the data store 108 indicating that the workflow 104 for the device has failed and that specifies a reason, or reasons, for failure of the reimaging workflow 104. The routine 450 then proceeds from operation 456 to operation 458, where the supervisor 110 obtains the date upon which the computing device 106 last received a security update and also stores this information in the data store 108. As will be described in greater detail below with regard to FIGS. 5A and 5B, a vulnerability manager 116 can utilize this data to initiate various types of operations with regard to devices 106 for which the reimaging workflow 104 has failed a predetermined number of times.
[0060] FIG. 5A is a computing architecture diagram illustrating aspects of the operation of the vulnerability manager 116, according to one embodiment disclosed herein. As discussed briefly above, the vulnerability manager 116 is a software 16 component that monitors the state data stored in the data store 108 to identify computing devices 106 for which the reimaging workflow 104 has failed more than a predetermined number of times.
[0061] The vulnerability manager 116 can also monitor security compliance deadlines and generate trouble tickets, if necessary. For example, the security compliance deadlines might specify an amount of time within which security patches are to be applied to computing devices 106. Trouble tickets or other types of notifications can then be generated that are directed to engineering personnel for those devices 106 that have not been patched within the specified amount of time.
[0062] The vulnerability manager 116 might also, or alternately, utilize machine learning (“ML”) or other techniques to identify patterns (e.g. multiple instances of the reimaging workflow 104 failing for the same reason) based upon the state data 204 and initiate various actions based upon the identified patterns such as, for example, creating trouble tickets, generating alerts, or other types of reporting. The proper engineering team can then take action to remediate the failure such as, for example, shutting down the computing devices 106 or removing network traffic from the computing devices 106.
[0063] Once the failure of the reimaging workflow 104 for a computing device 106 has been remediated, a security compliance report 502 will be generated indicating that the computing device 106 is no longer vulnerable. When the security compliance report 502 indicates that a device 106 is no longer vulnerable, the vulnerability manager 116 will update the state data 204 for the computing devices 106 to indicate that the device 106 has been returned to its normal functioning state.
[0064] FIG. 5B is a flow diagram showing a routine 550 that illustrates additional aspects of the operation of the vulnerability manager shown in FIG. 5A, according to one embodiment disclosed herein. The routine 550 begins at operation 552, where the vulnerability manager 116 monitors security compliance deadlines and generates trouble tickets if necessary, such as, for example, when a device 106 has not been patched within a specified amount of time (e.g. 30 days from the time a patch is released). The routine 550 then proceeds from operation 552 to operation 554.
[0065] At operation 554, the vulnerability manager 116 utilizes ML or other technologies to analyze and identify patterns of failure for devices 106 for which the reimaging workflow 104 has failed. If patterns of failure can be identified, the vulnerability manager 116 can generate trouble tickets or other types of notifications to an engineering time that identify the patterns and suggest actions for remedying the failures.
……
……
……

