Most organizations are always hunting for higher levels of availability for applications and services. As technology has matured and consumer services have become part of our everyday life, the idea is that everything should be available anytime and anywhere.
By Brian Suhr
This quest isn't so simple because there is usually a service-level agreement (SLA) that must be met. The SLA is a measurement between the IT department and the business, or between a company and its customers. Both of these have different ramifications and some of them could be financial.
There are a number of design and architecture approaches to increase the availability of your application, service or data center. These range from building a distributed application to load-balancing parts of an application. At the infrastructure layer, we don't want any single points of failure in the compute, network and storage infrastructure.
Legacy apps present a problem
The problem is that the majority of us do not use the distributed Web scale business model. We deal with enterprise applications or internally developed applications. These applications were created and developed to run in an infrastructure environment of a decade ago. The architecture in that era was focused on increasing the uptime of a single storage array. If that array went down, it would take time to recover.
Today, the expectations of the business and the customer are dramatically different. They are no longer willing to wait long periods of time for a failover to occur. The trouble is we are still stuck with these applications and how they handle failures.
IT professionals often have to argue that simply moving legacy applications into the cloud will not fix these issues. Not only will the applications behave the same, but will likely lower the availability because of the way most infrastructure in a cloud is designed to not care about hardware failures.
What do you do next?
By now, the majority of organizations have spent a lot of time and effort to increase the availability within a single data center. While there is still work here that could be done for some organizations, but the larger opportunity lies outside of a single site.
A lot of organizations ask how they can utilize a disaster recovery site or build a secondary site and use both locations to increase availability. The desire is to treat both data centers as active so workloads could run in both sites or allow running workloads to move freely between them.
This is not to be taken lightly because some applications are monolithic in nature and typically will not behave well if dependencies are moved to the other data center location. This approach comes with a lot of concerns about network bandwidth and latency which can dramatically affect the performance of these legacy applications.
VMware has done a great job with its architecture at the hypervisor level to provide a high level of local availability. Building on this approach, VMware introduced vSphere Metro Storage Cluster (vMSC) in vSphere 5. With VMware vMSC, a single vSphere cluster stretches across two physical sites. By building a vMSC, customers gain the ability to vMotion virtual machines between sites for disaster avoidance, maintenance and high availability in case of a serious failure.
What are the design challenges?
The initial design challenge for a VMware vMSC implementation is the storage layer. There are a limited number of vendors and storage offerings that support and are certified for vMSC. To support this type of deployment, the storage offering must support synchronous replication between the sites. The maximum round-trip latency supported for this is 10ms today while several other vendors still state 5ms is the supported limit.
Within these vendors there are different options to implement a stretched storage offering. One option is having storage be read/write active on both sides, or read/write on one side and read-only on the other site until a failover occurs. Depending on the implementation, you may be required to present Fibre Channel zones between the two sites with the necessary infrastructure and bandwidth. This can also lead to a non-optimized I/O path if you are sending writes to only a single side.
After satisfying your storage design requirements you will still need to understand how to design the vSphere cluster architecture. You will need to determine what you will want the behavior to be when a VM restarts, a host fails or an entire site fails.
VMware advises using DRS rules
Since vSphere is still not aware of sites, there is no specific feature set to optimize how failures are controlled based upon the different sites. To accomplish this, VMware recommends the use of Distributed Resource Scheduler (DRS) rules to create groups comprised of the hosts at each site within the stretched cluster. You can then assign VMs to each DRS group based upon which site you plan to run the VM at. If a VM or host fails, the rules are there to guide vSphere where the VM should restart based upon the rules and available capacity.
This is where you will need to decide on the behavior. The minimum capacity that I need to run my workloads is four vSphere hosts, which does not include a host for HA capacity. This would mean that my stretched cluster would need four hosts at each of my sites resulting in an eight node cluster. This gives me 100% capacity at each of my sites should an entire site fail.
But if I were to lose a single host at Site A, for example, the VMs running on that host are most likely going to be restarted on hosts in Site B since that is where capacity is available if I run the most of my workloads in Site A. If I run both sites at 50% capacity, then, ideally, there should be enough capacity available in the site that my VMs will not restart on the other side.
You could also take the approach of adding a fifth host to the side that will be heavily used for high-availability capacity or to both sides. This would provide enough capacity to lower the chances of VMs restarting on the non-preferred site. If you run legacy enterprise applications, they are not likely to perform normally if they are randomly moved between sites. You would want to move all the dependencies with them together.
Original Article Posted on TechTarget