The Top 10 Cluster Availability Risks - Continuity Software

This article appears on the blob of Continuity Software, Inc.

The Top 10 Cluster Availability Risks
By Roy Goffer, director of online marketing

IT systems exist in a highly complex, fluid and dynamic framework. HA systems are particularly vulnerable to misconfigurations and drifts that are caused by the daily maintenance done on the entire cluster infrastructure. In this post, we will examine the Top 10 cluster availability risks that exist in datacenters worldwide. We will examine what causes these problems, what impact they have on the IT infrastructure, and then explore possible solutions.

Here are the top 10 cluster availability risks.

#1 Incomplete Storage Access by Cluster Nodes
If you experience this problem, it is because the SAN volume that was added to the active cluster node was mistakenly not configured on the passive node of that cluster. The problem that you typically see is an inaccessible SAN volume to the passive cluster node. Fortunately, this problem is easily resolved by mapping the SAN volume to the passive cluster node.

#2 SAN Fabric With Single Point of Failure
If you experience this issue, it is likely the result of coordination problems among different teams, perhaps even human error. It is difficult to maintain path redundancy across multiple layers such as switch networks, ports and arrays. The problem that you will likely see is a single point of failure on the I/O path from the server to a storage array. This can result in an outage and increased downtime. The problem can be resolved by configuring multiple I/O paths which make use of different SAN switches, array ports and FC adapters.

#3 Business Concentration Risk in Private Cloud
This problem can be the result of VMs moving from one host to another. They may even end up running on the same host, leading to single points of failure. This can potentially shut down critical business services, as you certainly don’t want to have all your proverbial eggs in one basket. The resolution to this problem is simple enough: use multiple hosts with your VMs.

#4 Erroneous Cluster Configuration
This can be the result of alterations to the mount point directory of an existing file system on an active node. If changes were not performed on the passive node, an erroneous cluster configuration will arise. What you will then see is a mount point directory that does not exist on all the cluster nodes. To rectify this problem, simply create the missing directory on the passive cluster node.

#5 No Database File Redundancy
This problem can arise when the end-to-end data path is not visible to the database team. Single point of failures can be avoided by creating multiple instances of transaction log files and control files. If you store all these files on an unprotected disk, you run the risk of a single point of failure. To resolve this issue, it’s best to migrate control files and transaction log files to different file systems, for storage on separate disks and file systems.

#6 Unauthorised Access to Storage
This problem arises when an HBA on a server is incorrectly configured with access permissions to storage volumes of other business services or servers. One of the scenarios leading to this risk is when an HBA – FC adapter – is removed from a server that is no longer used, and installed in another production server; the new server can now access the storage volumes of the retired server, resulting in downtime and data corruption. The resolution of this issue requires reconfiguration of storage masking and zoning.

#7 Network Configuration With Single DNS
When DNS configuration creates a single point of failure, it can result in downtime. This occurs when the DNS server list contains a single server, or alternatively when the DNS server list contains multiple servers with only one of them being valid. For example, if the Name Server is unavailable, then name resolution will fail. These issues can be rectified by correctly updating the directory service settings with DNS server IP configuration.

#8 Geo Cluster With Erroneous Replication Configurations
When the Storage Device Group definitions are not updated after a new volume is added to the active node this problem arises. The problem you will see is that the Device Group will not have the SAN volumes used by the active node. You run the risk of data corruption on the SAN volumes, when clusters fail. Since the SAN volumes are not in the Device Group, replication will not stop for the missing volumes when a fail-over command will be triggered, thus rendering the data inconsistent and unusable to the remote cluster node. To resolve this issue, the storage device group configuration must be refreshed and any missing storage volumes must be included.

#9 Inconsistent I/O Settings
Hidden misconfigurations may occur since there is limited visibility into I/O multipathing. This problem arises when the passive node of the cluster is configured with 2 x I/O paths without load balancing and the active cluster node is configured with 4 x I/O paths which are load balanced for shared SAN volumes. This degrades performance and can result in service disruption after a failover. To fix this problem, extra I/O paths must be created on the passive node and the I/O policy should be set to load balancing.

#10 Host Configuration Differences VBtween Cluster Nodes
Updates are common to hardware products and software products. If an error occurs and changes are not recorded, a misalignment of cluster nodes occurs with installed packages, products, defined users, kernel parameters, DNS settings, configuration files, users and user groups. To fix this, installations or upgrades of hardware and software should be undertaken to remedy gaps between cluster nodes, or parameters and configuration files should be revised to include the correct values.

The Top 10 Cluster Availability Risks – Continuity Software

#1 incomplete storage access by cluster nodes