The Anti-converged System: 10 Steps to a Disaggregated Data Center

Converged IT systems—stuffing more and more functionality into smaller and smaller data center components—has been a trend for the past half-dozen years. As the world moves toward the Internet of things, data centers must be viewed as a collection of resources that are able to evolve to address the changing requirements of enterprise workloads. Thus, being able to go in and replace, repair or otherwise swap out software and components independently becomes more important. This has led to the concept of a disaggregated data center, a well-connected but intentionally decoupled IT system. Networking and storage are often purchased and configured separately from servers; disaggregating systems goes deeper to also target the processing, random-access memory and I/O subsystems. Hyperscale cloud service providers, for example, are interested in disaggregation because they see it as more flexible with fewer underutilized resources.

By Chris Preimesberger

A disaggregated data center—a well-connected but intentionally decoupled IT system—is often more flexible and has fewer underutilized resources.

Related: Getting Hyper About Storage: New Options for the Data Center

Transportable Environments

When the data center is viewed as a collection of resources, the workload that runs on those resources should be mobile and easy-to-move. Whether a workload runs in a container (e.g., Docker and Kubernetes), in a virtual machine or via a batch-processing framework, it should be responsive and largely independent of the hardware on which it runs. This allows the data center to migrate workloads and optimize resource utilization.

OSI Optimization

The network should be flexible enough to support workloads and hardware that are moving and changing. Typically, SDN and NFV are the go-to concepts in this area; as the environment shifts, the network should shift along with it, without the need for manual reconfiguration or human intervention.

Embarrassing Parallelism

The concept of "embarrassing parallelism" has flourished in regard to the disaggregated data center. In parallel computing, an embarrassingly parallel workload—or embarrassingly parallel problem—is one for which little or no effort is required to separate the problem into a number of parallel tasks. This often is the case where there exists no dependency (or communication) between those parallel tasks. In an environment where workloads are transportable and the network is flexible, it is imperative for service designers to build around the embarrassingly parallel aspects of their applications. How can a workload be divided up and distributed to a pool of data center resources that can be scaled up and down as load increases and decreases?

Related: Data Center Inefficiencies Cost Businesses Time, Money

Fault Expectancy

In any data center, failures happen. In a modern disaggregated data center, failures are to be expected. Just as a robust stand-alone application handles read/write failures, transient resource unavailability and unexpected shutdowns, services for the disaggregated data center must expect any and all resources to become temporarily unavailable, and be able to recover from and adapt to these changes in discrete resources.

Look South Into All Subparts of the Rack

The modern disaggregated data center is comprised of racks full of resources. It is imperative for service managers and designers to have the ability to programmatically look southbound into the rack–specifically, being able to enumerate, monitor and control all subparts of all components in that rack. This requires granular application programming interfaces (APIs) to allow access to this information. Ideally, a single, powerful southbound API should be made available, though in some cases, a variety of APIs are cobbled together (e.g. IPMI, SNMP, etc.). Beyond resources, placing sensors at rack-level and component-level also provides a valuable glimpse into what is going on in a rack in a disaggregated data center.

Look North of the Rack, Too

Granular southbound resource information is one thing, but it can turn into a firehose of information out of context without a northbound component to help put the pieces together. Resources in a disaggregated data center do not exist in a vacuum; workloads, environmental characteristics, and cross-rack and cross-data center factors all come in to play. A good way of looking north of the rack is to consider how to package and aggregate southbound information into a shape that is more easily consumable and actionable up the chain. However, let's not confuse this with monitoring and automation. Looking northbound really means determining what is needed to manage the resources and how to get that information there.

Related: The Challenges of Supporting a Complex IT Infrastructure

Monitor Everything

Most data centers involve varying degrees of automated monitoring; having a technician walking through the aisles with a clipboard just doesn't scale. In a disaggregated data center, it is critical to monitor every aspect of every resource: sensorification (which is surprisingly inexpensive), device-specific data points (which we get for free via device and OS APIs) and broader environmental characteristics (e.g., building management system and sensor data). The more that is monitored, the better a picture that may be drawn about the overall state of the data center: from heat maps, to resource utilization mapping, customer billing and failure postmortems. As more is monitored in a disaggregated data center, better, cheaper and more resilient services may be built.

Automate Everything

Partial data center automation is not uncommon. Tools like Puppet, Chef and Ansible have removed the pain and manual labor aspect from part of the equation, but there is always room to further take costly and error-prone human decision-making out of the decision-making process. In the disaggregated data center, with the capabilities and principles defined in previous sections, it should be possible to automate everything from workload migration to environmental and building controls based on operational insight and measurements superior to the traditional, but useful, PUE (power usage effectiveness) metric, which equals Total IT Power divided by IT Equipment Power.

Intelligent Metric-Driven Decision Making

In a modern disaggregated data center, it becomes possible to focus on metric-driven decisions. When you have well-built services that expect failure and can be easily migrated, running on hardware and in an environment that is heavily instrumented and easily orchestrated, it becomes possible to assert and drive decisions around metrics, such as performance/per watt/per dollar. In the disaggregated data center, data comes from every aspect of the data center, and based on that, intelligent decisions can be made to determine how resources are used and managed.