The Four Tenets of Application Availability

April 2013
M	T	W	T	F	S	S
1	2	3	4	5	6	7
8	9	10	11	12	13	14
15	16	17	18	19	20	21
22	23	24	25	26	27	28
29	30

Posted on April 29, 2013 by TheStorageChap in Application Availability, Continuous Operations, RecoverPoint, Video, VPLEX

Today I want to briefly describe what I am seeing as the four tenets of application availability.

The growing reliance on information, information derived from data and applications, is causing organisations to rethink their strategies for storage, application and data centre availability. The traditional norm of Primary and DR data centre design is no longer meeting the evolving business requirement for 24×7 data and application availability and in todays cost constrained economic climates under-utilised assets sitting in a DR site are an expense that no business can any longer afford. To improve asset utilisation and reduce disruption organisations are instead looking at Active/Active data centre design.

Active/Active data centre design enables organisations to use the I/O assets that they have at both datacentres for the same production applications, vastly improving asset utilisation, whilst offering new opportunities for storage and application availability.

In the effort to increase application availability, four key tenets, capabilities or requirements are emerging.

Tolerance – Applications must be tolerant of infrastructure failure including storage and data centre failures
Mobility – The ability to move data and applications not only across servers within the data centre, but also non-disruptively across data centres
Logical Corruption Protection – Data should be easily recoverable in the event of logical corruption
Out of Region Disaster Recovery – Data and application failover capability should also be available out of region to mitigate against wide scale disasters

Tolerance

For the 20% of downtime that is unplanned downtime, organisations need to build solutions that are tolerant of partial or total infrastructure failure. Traditional data centre thinking has ensured that applications run on clustered server resources and that those servers have multiple paths through redundant switches to LUNs that sit within arrays with multiple controllers. Sometimes multiple LUNs from multiple local storage arrays will be mirrored at the host level to provide storage availability within the data centre. But in an Active/Passive data centre architecture in most case the loss of a storage array or the loss of the site results in downtime and application disruption.

EMC VPLEX Metro enables organisations to break the physical boundaries of storage arrays and sites by distributing LUNs across physical arrays and physical sites.

These distributed virtual volumes have exactly the same LUN and storage IDs across both sites and are read/write accessible. Why is this important? As it enables two things.

Firstly 100% storage availability, unlike an active/passive replication solution; if an array goes offline within a data centre the applications in that site continue running with no disruption.

Secondly it enables organisations to create application clusters than can span datacentres, with non of the complexity of integrating stretched geographic clustering with your DR replication solution. With VPLEX Metro cluster nodes in both sites have R/W access to what appears to be the same LUN in two physically separate sites. For traditional Active/Passive application cluster technologies like VMware HA, Microsoft Cluster, Power HA etc. VPLEX enables fully automatic failover between the sites with no manual intervention and no requirement to involve storage DR, as the storage was already available on both sites. For Active/Active application clustering technologies like Oracle RAC or IBM PureScale it enables a simplified architecture that will maintain application availability with no disruption even in the event of a total site failure.

Mobility

80% of application down time can be contributed to Planned Downtime where servers, storage and infrastructure are taken offline to undertake such things as routine maintenance, software/patch updates, storage upgrades or data centre power work. When working on new data centre designs organisations therefore need to consider how any new solution could mitigate planned downtime as well as unplanned downtime.

The answer in this case is Mobility. EMC VPLEX enables previously disruptive storage operations such as rebalancing a LUN across arrays or a full storage refresh to be completed non-disruptively it also provides an infrastructure than enables applications virtualised under technologies such as VMware, Hyper-V, AIX VIO or Linux KVM to be non-disruptively moved between data centres. This could be for pre-emptive disaster avoidance, for example if a power outage needed to occur across a data centre or if you wanted to rebalance workloads across the available assets. All without any need for copying data first as the data is already Active/Active across both sites.

Logical Corruption Protection

Traditional methods of corruption protection such as snapshots and clones are becoming less desirable due to the inefficient use of space and the potential for data loss.

In the majority of environments the time frame between snapshots or clones is normally at least an hour, but in most cases three or more hours. In the event of logical corruption, recovering to a snapshot or a clone could result in many hours of data loss.

In a database environment, for example Oracle, if you recover to a snap or a clone from three hours ago it will take you roughly the same amount of time to replay the database logs back to the latest point in time, resulting in significant downtime.

EMC VPLEX is integrated with EMC RecoverPoint which provides granular corruption protection using RecoverPoint’s Continuous Data Protection module. This can be used in conjunction with VPLEX to provide logical corruption protection historically over a chosen period of time for every single write. This allows an organisation to recover back to literally the write before a corruption event took place, vastly reducing application downtime in the event of logical data corruption.

Out of Region Disaster Recovery

The benefits of Active/Active data centres can be off-set by the fact that in some cases these DCs will be relatively close together due to the requirement for synchronous replication and therefore may not meet business or regulatory requirements for ‘Disaster Recovery’. For organisations with these requirements the Active/Active infrastructure provided by VPLEX can be augmented with RecoverPoint Continuous Remote Replication. RecoverPoint CRR enables VPLEX distributed virtual volumes to be replicated to an out of region data centre the other side of the world if required, with the added benefit that RecoverPoint CRR also has CDP like rollback for granular corruption protection in the DR site. Dependent on how risk adverse you are, RecoverPoint CDP and RecoverPoint CRR can be combined together for logical corruption protection both locally and remotely.

But what is the cost?

At this point you may be thinking that this all sounds interesting but what is it going to cost. Well overall, not as much as you might think. When you compare DR, HA and Continuous Availability solutions the overall costs can build a compelling case in favour of Continuous Availability. The cost of procuring a continuously available application stack can be more expensive than a traditional Active/Passive disaster recovery solution. But it is not only the procurement cost that needs to be taken into account.

The operational costs associated with documenting and maintaining a traditional DR solution and DR run books etc. are more than with a Continuous Availability solution. As is the cost associated with DR testing. In fact in many cases the complexity and downtime associated with DR testing means that these test are simply not completed, putting many businesses at risk in the event of a real disaster. Then finally what about the cost of downtime. With a Continuous Availability solution there will be little or no downtime, but with alternate architectures the real financial cost of down time can far exceed the cost on the initial investment in a Continuous Availability solution. Ultimately it is simply a business decision based on cost vs. risk.