Top ten hidden service availability threats
Disaster recovery preparations transitioning to service availability attention
- |
- Written by Yaniv Valik
As the IT landscape continues to grow in size and complexity, it is practically impossible for any bank’s IT team to ensure 100% adherence to vendor best-practices in all areas of production and high availability (HA) environments.
In addition, ongoing and frequent configuration changes across all layers of the IT infrastructure inevitably result in variability and inconsistency in clusters (i.e., a group of connected computers that are viewed as a single system, in order to improve performance, minimize downtime, and help ensure data protection) and other HA setups, introducing risks that often remain hidden until manmade or natural disaster strikes.
The costs have been well-documented across the banking industry, with outages that shut down ATMs, disrupt online and mobile banking services, and impact trading systems—costing banks around the world millions of dollars every year. From regulatory noncompliance, to lost productivity, lost revenue, and damaged brand reputation when IT disaster strikes, the business suffers the pain.
While traditionally the conversation has centered around disaster recovery, it is transitioning rapidly to a service availability focus instead. Service availability (a.k.a. business availability) has moved beyond just talking about maintaining and recovering IT systems, to sustaining and recreating business services. Clearly, a proactive approach to service availability assurance is essential. The criticality of such planning has become agonizingly evident given the natural and manmade disasters over the past two decades, not to mention the well-publicized IT mishaps.
History has shown that a great plan on paper is no guarantee, and that testing is critical. Still, the frequency and level of testing is often dependent on a bank’s size—larger banks often test multiple times each year, while smaller banks may only be able to test annually. But, what happens the day, week, or month after the test? What changes have or will be made that will once again render the bank vulnerable?
Clearly, as the complexity and scale of IT infrastructures and related business services continue to grow, the key lies within a bank’s ability to maintain continuous visibility into downtime and data loss risks across the entire organization. Luckily, there are in fact solutions on the market today that will allow a bank to do just that. By deploying such a service availability management solution, banks can mitigate downtime and data loss risks by monitoring production and high availability/disaster recovery environments to detect hidden vulnerabilities and gaps. In doing so, users can be confident their service availability and data protection goals will be met on a constant basis.
The following is a listing of today’s top ten service availability risks lurking in virtually every financial institution’s data center, as well as details regarding the risk cause, potential impact, and recommended/best practices resolutions.
- No. 1: Incomplete storage access by cluster nodes (a node is a computer used as a server)
The problem: A file system “mount” (i.e., to mount is to make a group of files in a file system structure accessible to a user or user group) resource is defined in the cluster; however one of the underlying storage area network (SAN) volumes is not accessible to the passive cluster node.
The cause: A new SAN volume was added to the active cluster node but inadvertently, was not configured on the passive node of the cluster.
The impact: Unsuccessful cluster failover resulting in prolonged downtime.
The Resolution: Map the newly added SAN volume to the passive cluster node.
- No. 2: SAN Fabric with a single point of failure
The problem: A single point of failure exists on the input/output (I/O) path from a server to a storage array. (All paths go through a single fibre channel (FC) adapter, SAN switch, or an array port.)
The cause: Maintaining path redundancy across multiple layers (hosts, switch network, arrays) is an extreme challenge due to high complexity. Coordination issues across various teams and/or human errors can result in: Incorrectly defining only a single path or multiple I/O paths, on the same FC adapter; connecting host FC adapters to a single switch (or to multiple switches that are connected to a single switch; or mapping a storage volume to a single port on the disk array front-end director.
The impact: Failure of a single component on the path from the server to the storage array will lead to an outage, resulting in a higher downtime risk.
The resolution: Configure multiple I/O paths that use different FC adapter, SAN switch, and array ports.
- No. 3: Private cloud business concentration risk
The problem: A particular business service relies on a group of virtual machines, all running on the same physical host.
The cause: Over time, virtual machines (VMs) may move from one host to another (either manually or automatically) and end up running all on the same host, resulting in a single point of failure.
The impact: A single host failure could completely shut down a critical business service.
The resolution: Redistribute VMs across different hosts.
- No. 4: Erroneous cluster configuration
The problem: A cluster is configured to mount a file system, however the mount point directory does not exist on all of the cluster nodes (i.e., a cluster is configured to mount a file system, however the mount point directory does not exist on all of the cluster nodes. A mount point defines a path to a directory at which a new system is made accessible. Before you can use a file system, it must be mounted.)
The cause: A new file system was added or changes were made to the mount point directory for an existing file system on the active node. Like changes were not performed on the passive node.
The impact: Unsuccessful cluster failover resulting in downtime. (A failover cluster is a group of servers that work together to maintain high availability of applications and services. “Failover” takes place when one of the servers or nodes fails, another node in the cluster can take over its workload without any downtime.)
The resolution: Add the missing mount point directory on the passive cluster node.
- No. 5: No database file redundancy
The problem: Best practices dictate that in order to avoid a single point of failure, you need to create several instances (i.e., redundant copies) of the control file and each of the transaction log files (redo log and control file multiplexing). However, in doing so multiple instances of the control files or transaction log files are stored on a single unprotected disk, creating a single point of failure. (The control file records the physical structure of the database—i.e., database name, names and locations of associated data files and redo log files, timestamp of database creation, current log sequence number, and checkpoint info.)
The reason: Database team does not have visibility into the entire end-to-end data path (database–host–storage).
The impact: Higher risk of downtime and data loss due to unavailable database file.
The resolution: Move control files or transaction log files to a different file system so that each file will be stored on a separate file system and disk.
- No. 6: Unauthorized access to storage
The problem: A business service server has unauthorized access to a storage device that is connected to another business service cluster.
The cause: An FC adapter (a.k.a. HBA or host bus adaptor, which is the circuit board or adaptor that provides the I/O processing and physical connectivity between a server and storage) was removed from a retired server and installed in another production server. This server now has access to storage volumes previously used by the retired server. (There is no simple way to manually detect this hidden risk when reusing the HBA.)
The impact: Inevitable data corruption and downtime.
The resolution: Reconfigure storage zoning appropriately.
- No.7: Network configuration with a single DNS (i.e., domain name system, standard technology for managing the names of web sites and other internet domains.)
The problem: A production server is configured with a single name server.
The cause: A DNS server was replaced, but the directory service settings were not updated to reflect the change.
The impact: Higher risk of downtime due to a single point of failure: name resolution will fail if the single connected name server becomes unavailable.
The resolution: Update the directory service settings file to reflect the correct DNS server configuration.
- No. 8: Geocluster (i.e., the use of multiple redundant computing resources located in different geographic locations to form one single highly available system) with erroneous replication configuration
The problem: The cluster controls storage replication between local and remote nodes using a device group (a device group is a set of physical devices with one or more identical, often unique, capabilities). The device group does not include all the SAN volumes used by the active node.
The cause: A new volume was added to the active node but device group definitions were not updated accordingly.
The impact: Upon cluster failover, replication will not be stopped for SAN volumes not included in the device group. As a result, the target SAN volumes will not be accessible to the remote node and the application will fail to start. Furthermore, data may be corrupted on the target SAN volumes.
The resolution: Refresh the configuration of the storage device group and include any missing storage volumes.
- No. 9: Inconsistent I/O settings
The problem: While the active cluster node is configured with four I/O paths and load-balancing for shared SAN volumes, the passive node of the cluster is configured with only two I/O paths and no load-balancing.
The cause: I/O multipathing configuration may change dynamically and may be affected by changes in network infrastructure. Limited visibility into I/O multipathing may result in a hidden misconfiguration.
The impact: Significant performance degradation and probable service disruption, upon failover.
The resolution: Configure additional I/O paths on the passive node and set the I/O policy to load-balancing.
- No. 10: Host configuration differences between cluster nodes
The problem: The cluster nodes are not aligned in terms of installed products and packages, kernel parameters (i.e., a constant defined in the file that controls the configuration), defined users and groups, DNS settings, and other configuration options.
The cause: With the high complexity and frequent updates common to software and hardware products, it is easy to miss a change and fail to keep all cluster nodes with the same configuration.
The impact: Upon failover, applications may fail to start or not function as expected.
The resolution: Install or upgrade hardware and software components to close the gaps between the cluster nodes.
By Yaniv Valik, director of Professional Services and Gap Research, Continuity Software
Tagged under Technology, Core Systems, Operational Risk,