Due to call the coding I’ve been doing lately, I haven’t had the most time to write and update this blog.
So, aside from coding in my free time, I’ve also been very busy at work dealing with an elevated number of system outages we’ve seen in our server environment lately. These outages are the subject of this blog post today.
Recently we’ve seen outages in our Cisco UCS environment nearly weekly, causing production outages that have persisted for an average of five to six hours at a time.
Being a young engineer in the server environment, these outages have been my first exposure to what a massive outage can mean to a business. These outages have helped to shed even further light on the need for the business to have a STABLE server environment as customers and the business ultimately rely on these servers to keep their day-to-day functions running without issue.
While these outages are negative events, they’ve served the positive function of allowing me to enhance my technical skills and to learn even more about how upper management deals with an outage by interfacing with vendors and the operational teams tasked with keeping the systems online.
Seeing the upper management deal with these outages has opened my eyes to what makes a manager good and bad at working with a critical outage. We typically open up telephone bridges that have one to four upper level managers on the line and simply by listening in to these conversations I’ve been able to take mental notes of the big ticket items a manager focuses on during an outage and the problems that caused the outage to begin with.
Some of the take-aways I’ve seen have been that being calm during an outage is one of the most valuable traits a manager can have. While the manager may not know the technical details of the outage he should understand how that system directly interfaces with the services they are used to support. The valuable calmness on a bridge often does come from the managers who understand the true business impact and the ways that they can get the systems back online in a timely manner. This has recently been shown with one of the largest outages we faced where the manager was able to focus on the number of operations resources we might need if and when the system came online to get the servers back online.
Prior to hearing that comment, my managerial mindset hadn’t even considered that, but realizing that the resources we currently had on hand wouldn’t be able to get the servers online in a timely manner.
The best managers have also been able to apply the appropriate amount of pressure the vendors (in highly professional ways) so that they would recognize how important it is to our company that the vendor solve the flaws in their firmware and software code to be sure the outages don’t happen again.
I image as time progresses and I continue to troubleshoot and support more system outages I will be able to gain even more management knowledge to ensure these don’t happen again. My goal is to take these learning with me to my own ventures and to my management career! While the outages are terrible, the lessons I take from them are second to none.