Single Person Brings Down Cloud Computing Cluster
According to a recent PC World report, a cloud provider experienced an extended outage when a single administrator inadvertently enacted a change that involved rebooting all of the virtual machines in the organization's East Coast data center. Speaking to a variety of news outlets, a representative of the company explained what happened and what the cloud provider is doing about it moving forward. The overall consensus of these comments was that while human error was the direct cause, there are also secondary issues in play that contributed to the outage.
The representative explained that the reboot has been underway and began quickly after the incident, but the strain that the process puts on the system is forcing a careful, measured approach to getting virtual machines running again, PC World explained. As such, the process of getting systems running again takes a long time, but the organization plans to take a closer look at all of the factors that impacted the outage and recovery once everything is running again.
Three ITSM Tools that Help Eliminate Human Error
Mistakes happen and there isn't anything to be gained by trying to embarrass organizations or individuals when they happen. However, there is a lot to be learned from these mistakes. In the case of this outage, it is clear that human error is a major threat in complex IT environments, whether they are cloud systems or non-cloud, and organizations need to develop methods to avoid such problems.
A few IT service management tools that can be invaluable in this area include:
- Change management: A change management platform can help organizations schedule and coordinate change operations more effectively. This introduces oversight and visibility making it easier to avoid human error because any change has been carefully scheduled and, as such, verified as a reliable alteration to the IT setup.
- CMDB: A CMDB solutions doesn't just show you everything that is in the IT configuration, it shows you how those systems relate. As such, IT workers can see how making an alteration in one area will impact other parts of the setup. This can eliminate human error by ensuring users understand the full implications of any change they make and understand when one small adjustment may lead to a major problem.
- Built-in authorizations: Advanced ITSM systems can also build in authorizations and notifications in CMDB and change management solutions. These authorizations, for example, can take the form of messages sent to supervisors when a system admin wants to make a change. The supervisor can then look into the change management and CMDB consoles and verify that the change will not have any negative consequences and approve the effort.
Eradicating human error in its entirety may be impossible, but businesses can work to eliminate these issues by investing in ITSM tools that focus on change processes and introduce a series of checks and balances that help users avoid errors.