End-to-End Diagnostic Discovery Carnegie Mellon

Introduction

Despite these past 40 years of incredible advances in computing technology, systems still fail. In the presence of great strides in software development languages, software engineering techniques, hardware engineering, fabrication, and quality management methodologies, errors are still inevitable in the systems we deploy. As the demands on computing systems continue to increase and the complexity among interdependent components continues to grow, problems in our production environments are inevitable, and will increasingly result from unexpected interactions between software components from a range of infrastructure layers that were developed separately, at different times, and by different people, mostly with little knowledge of the requirements now placed on their component. This combination of conditions leaves the modern IT shop in the very difficult position of having precious little information about failures when they occur, and with little to go on in trying to avoid the next failure, or limit its potential impact.

The End-to-end Diagnostics DiscoverY (EDDY) effort at Carnegie Mellon has created an infrastructure to bring more data to this problem, allowing information about the operation of disparate components in the computing infrastructure to be brought together for analysis, research, and audit. This enables a system manager to more easily pinpoint problems as they occur, allows autonomic processes to assist in prediction, management, and maintenance.

Problem Definition

The core problem is that systems are generally engineered to take full advantage of their own correctness, not guard against potential mistakes. This is, of course a perfectly reasonable, pragmatic, and economical approach. It takes far more effort on the part of any system to report out what it sees and how it reacts to what it sees (let alone build the second system to figure out what it all means) and most physical systems are predictable enough that this level of reporting isn’t needed. If the tire on your bicycle goes flat, you generally see it pretty quickly, maybe wonder what you ran over, buy a new one and move on. If it happens again a week later, you may start wondering if something else is going on… Is there something in the driveway? Do I have the wrong valve stem? Is the neighbor kid busting my chops? Luckily, there are rather few things that can go wrong with your bicycle tire, and the effects of a failure are generally not too serious. If it’s the tire on your car and it goes flat while you’re speeding down the highway, the consequences may be drastically different.

Many software systems are as low risk as the bike tire, if you try a Google search and find a reference to a site that no longer exists, no big deal, you move on to the next. In other environments however, it is critically important that failures are detected as soon as possible so that catastrophic conditions are entirely avoided. Many software systems are built with failsafe constructs designed in from the beginning, but they are very expensive, and require that the system is constructed from known components with end-to-end failsafe reporting.

Unfortunately, in today’s computing infrastructure and Internet technology there is little middle ground between full safety and no safety. Open systems and pluggable components are powerful techniques for incremental improvement and collaborative development, but they are all built to trust their own correctness and generally report nothing of their own behavior, so they become common vectors for error propagation with precious little information about where the trouble started.

The situation can be described from several angles: The user gets confused by cryptic errors in their applications, and – the systems manager has very little data available to diagnose the problem, even though – the software developers could generate more log information but they’re not sure what would be useful, and – the product manager of the software developers force them to squeeze in that last feature because – the customer is buying the software based mostly on functionality of the component, trusting that the product ‘mostly works’, and – even if the software has good log information, it’s incredibly difficult for the systems manager to bring that data together or to correlate it with different components to resolve problems across the infrastructure. The EDDY project releases the logjam in this situation.

Concepts and Methodology

EDDY is intended to serve as the lingua franca for effective exchange, management, and correlation of log and event information in an infrastructure, between dependent layers and among interdependent components; within an enterprise and across collaborating federations. It defines a common form for encapsulation and a method for efficient transport of native event information from sensors to data managers to analyzers. EDDY allows for translation, selection, and projection to orchestrate the flood of diagnostic data and bring it to bear on the tasks of management and problem diagnosis in the infrastructure.

The EDDY team is not the first group to appreciate the value of audit logs and diagnostic telemetry information. However, where others have approached the problem from one application or diagnostic domain, we are approaching it from no domain, and all domains. In previous approaches, the goal of the efforts were primarily to gather information to perform domain-specific analytics – for example, collect security logs to craft a combined security view. Collecting the data was a means to another end. For EDDY, the gathering of data is an end in itself – an important end that can effectively serve the original function for domain-based analytics, but more importantly, it opens a completely new and incredibly rich possibility of correlating data across multiple domains and the invention of techniques that are completely agnostic of domain, thereby applying to any and all domains. It is exactly this style of uniform data collection and dynamic analytics that becomes the platform to break the logjam described in the previous section.

The EDDY design is informed by the following world view:

  1. Demand for end-user functionality drives innovation.
  2. Similar functionality in multiple systems leads to commonality in application designs and constructs.
  3. Common designs lead to shared components, which can then be realized as shared infrastructure. Competitors with common designs will decide to share infrastructure only when the apparent value of sharing far exceeds the perceived value of not sharing.
  4. The value to be gained in sharing log or event information far exceeds the value of keeping them separate, and there is significant potential in this approach. Software developers will have better information about the environment in which their components are operating, and will be able to leverage more effective feedback to design improved interfaces in the future. Product managers will have better information about how their products are working and interacting in their installed environments, and can build new analytics to help customers manage the product. Enterprise customers will be able to specify that products they purchase have monitoring interfaces to participate in the new diagnostic infrastructures they can now construct to more effectively manage their business flow from end to end. Users will be able to get better, more constructive support when they encounter troubles in their online environments.
  5. EDDY is the first system to enable the unification of data formats from multiple diagnostic domains so a new class of tools can be leveraged.

The EDDY architecture is grounded in the following assumptions:

  1. Performance matters and pragmatics are important – there is no end to the monitoring one might do and no limit to the variations that different sites will choose. A system must be simple enough to be useable by a small installation with only a few focused events per minute, yet flexible and fast enough to accommodate a large installation with millions of distributed events per second.
  2. It must be possible to experiment and grow into the use of a unified diagnostic data management system without mandating a ‘cutover’ day - without immediate or catastrophic change to existing diagnostic infrastructures and techniques.
  3. Leverage domain expertise, don’t reinvent the wheel – the first goal of EDDY is to provide a data orchestration function that can be leveraged by existing analytic techniques, but the end goal is to enable those techniques to be composed, modified, and extended to consider other information that hadn’t previously been available to the analysis.
  4. Models may be pretty but there has to be code – We believe that our approach makes sense and many have resonated with the ideas, but there is much to learn and no way to ‘get it exactly right’. There are many options for creating an interoperable diagnostic infrastructure. We intend to build and learn and rebuild, focusing on the standardization and reference implementation of key formats and interfaces to enable interoperation. We intend to involve many others in an open process to construct a community to guide and interact in the evolution of such an essential, common, but advanced infrastructure.

(c) 2003-2012 Carnegie Mellon. All Rights Reserved.