An IT service outage can be an enormous problem for any organization. When services go down, it can prevent employees from completing work and customers from completing transactions—bringing the business to a grinding halt.
The cost of IT downtime can vary significantly depending on the size and nature of the organization. For example, Atlassian estimated that the average cost of downtime for a small to midsize business (SMB) was between $137 and $427 per minute. However, the Ponemon Institute estimated that the average cost of downtime for all businesses was $9,000 per minute—this includes the cost of downtime to large multinational conglomerates, which helps to explain the increase in cost.
However, even at the minimal cost of $137/minute, IT downtime can quickly become expensive. After a single day (or 1,440 minutes), the cost of downtime would be $197,280.
When it comes to downtime, an ounce of prevention can be worth a pound of cure. This means understanding the causes of IT downtime and how to avoid them. In this article, we’ll discuss single points of failure (SPOFs)—one of the most common issues in IT failures.
You might be wondering “what does a single point of failure mean?” In IT management, a single point of failure is any component in an IT infrastructure or resource that, if it fails, would render the infrastructure or resource unusable. This leads to IT downtime—preventing intended users from being able to access the resources they need.
The sheer complexity of modern IT solutions can lead to the creation of single points of failure in IT networks without anyone realizing it. It’s an old engineering maxim—the more complicated a system is, the more risk there is of something going wrong. When that something is a single point of failure in the system, the system goes down.
Before you can know how to avoid single point of failure issues in your IT, it’s necessary to know where these failure points occur.
Single points of failure can exist in hardware or software. They can occur because of internal or external factors. They can cause issues as small as a single app or program not working or as major as an entire data center being unavailable to all users.
Some common examples of single points of failure include:
Organizations often use multiple software solutions from different vendors for their business workflows. However, when two solutions don’t have a native integration, the organization needs to create one to allow the two programs to interface.
These custom application program interfaces can be created by one of the two vendors or by the organization’s own IT team. However, these APIs may not always work consistently.
Why? Because, when one vendor makes sweeping changes to their software solution, the API may not always account for that. So, after each update for either software, the API needs to be tested and modified to ensure that it can still correctly serve as an interface between both software solutions.
Servers—the physical hardware used to host software and data—can experience unexpected component failures. Whether these failures are caused by power surges, lack of maintenance, cyberattacks, or sudden physical damage to the hardware, any data stored on the server or programs that it runs will no longer be available for use.
So, if you have mission-critical software or data that only exists on a single server, that may prove to be a single point of failure that you need to address.
If you have multiple servers in a single data center or spread across multiple data centers, then the next most common point of failure would be the network routers and load balancers that you use to direct traffic to each server.
A load balancer is a tool that helps organizations spread traffic loads across multiple resources so that no single server or database is overloaded. For example, let’s say you have ten servers with identical resources that can each handle 10,000 simultaneous users, and you have an average of 50,000 simultaneous users during peak traffic times. That would be five times as much as any single server could handle, but only half of your maximum capacity when spread across all ten servers.
If the load balancer malfunctions and starts sending all of your traffic to a single server, that would overload the target server. This would make it so that users couldn’t access resources on that server (or experience significant latency) despite the fact that you would technically still be at half your actual maximum traffic capacity.
Cloud-based IT resources and other remote technology resources are dependent on having consistent internet access. This creates a potential external single point of failure for these IT assets.
If an internet service provider (ISP) suffers technical issues, a local fiber-optic network cable gets cut, or a natural disaster causes widespread damage to network infrastructure, that could disrupt the internet service the IT asset requires to communicate with users.
Here, using multiple data centers spread across a wide geographic area could help limit your risk. If one data center loses internet access, the other one would still be available. For example, you could have one data center in Washington state, and the other in Texas to limit the risk of both going down to disastrous events at the same time.
The above examples are just a few of the potential SPOFs that can exist in an IT infrastructure. The question is: how can you identify single points of failure in your IT? An IT audit can be a good place to start.
Conducting an audit of your IT resources and evaluating them for potential single points of failure can be invaluable for any business. By proactively finding these vulnerabilities, you can take the most economical path to resolving them or at least minimizing their impact if they do cause an IT failure.
Here’s a quick outline of the process:
Need help eliminating single points of failure in your IT infrastructure? Reach out to IT Proactive to start improving your IT today!