![]() |
![]() |
![]() |
Despite these past 40 years of incredible advances in computing technology, systems still fail. In the presence of great strides in software development languages, software engineering techniques, hardware engineering, fabrication, and quality management methodologies, errors are still inevitable in the systems we deploy. As the demands on computing systems continue to increase and the complexity among interdependent components continues to grow, problems in our production environments are inevitable, and will increasingly result from unexpected interactions between software components from a range of infrastructure layers that were developed separately, at different times, and by different people, mostly with little knowledge of the requirements now placed on their component. This combination of conditions leaves the modern IT shop in the very difficult position of having precious little information about failures when they occur, and with little to go on in trying to avoid the next failure, or limit its potential impact.
The End-to-end Diagnostics DiscoverY (EDDY) effort at Carnegie Mellon has created an infrastructure to bring more data to this problem, allowing information about the operation of disparate components in the computing infrastructure to be brought together for analysis, research, and audit. This enables a system manager to more easily pinpoint problems as they occur, allows autonomic processes to assist in prediction, management, and maintenance.
The core problem is that systems are generally engineered to take full advantage of their own correctness, not guard against potential mistakes. This is, of course a perfectly reasonable, pragmatic, and economical approach. It takes far more effort on the part of any system to report out what it sees and how it reacts to what it sees (let alone build the second system to figure out what it all means) and most physical systems are predictable enough that this level of reporting isn’t needed. If the tire on your bicycle goes flat, you generally see it pretty quickly, maybe wonder what you ran over, buy a new one and move on. If it happens again a week later, you may start wondering if something else is going on… Is there something in the driveway? Do I have the wrong valve stem? Is the neighbor kid busting my chops? Luckily, there are rather few things that can go wrong with your bicycle tire, and the effects of a failure are generally not too serious. If it’s the tire on your car and it goes flat while you’re speeding down the highway, the consequences may be drastically different.
Many software systems are as low risk as the bike tire, if you try a Google search and find a reference to a site that no longer exists, no big deal, you move on to the next. In other environments however, it is critically important that failures are detected as soon as possible so that catastrophic conditions are entirely avoided. Many software systems are built with failsafe constructs designed in from the beginning, but they are very expensive, and require that the system is constructed from known components with end-to-end failsafe reporting.
Unfortunately, in today’s computing infrastructure and Internet technology there is little middle ground between full safety and no safety. Open systems and pluggable components are powerful techniques for incremental improvement and collaborative development, but they are all built to trust their own correctness and generally report nothing of their own behavior, so they become common vectors for error propagation with precious little information about where the trouble started.
The situation can be described from several angles: The user gets confused by cryptic errors in their applications, and – the systems manager has very little data available to diagnose the problem, even though – the software developers could generate more log information but they’re not sure what would be useful, and – the product manager of the software developers force them to squeeze in that last feature because – the customer is buying the software based mostly on functionality of the component, trusting that the product ‘mostly works’, and – even if the software has good log information, it’s incredibly difficult for the systems manager to bring that data together or to correlate it with different components to resolve problems across the infrastructure. The EDDY project releases the logjam in this situation.
EDDY is intended to serve as the lingua franca for effective exchange, management, and correlation of log and event information in an infrastructure, between dependent layers and among interdependent components; within an enterprise and across collaborating federations. It defines a common form for encapsulation and a method for efficient transport of native event information from sensors to data managers to analyzers. EDDY allows for translation, selection, and projection to orchestrate the flood of diagnostic data and bring it to bear on the tasks of management and problem diagnosis in the infrastructure.
The EDDY team is not the first group to appreciate the value of audit logs and diagnostic telemetry information. However, where others have approached the problem from one application or diagnostic domain, we are approaching it from no domain, and all domains. In previous approaches, the goal of the efforts were primarily to gather information to perform domain-specific analytics – for example, collect security logs to craft a combined security view. Collecting the data was a means to another end. For EDDY, the gathering of data is an end in itself – an important end that can effectively serve the original function for domain-based analytics, but more importantly, it opens a completely new and incredibly rich possibility of correlating data across multiple domains and the invention of techniques that are completely agnostic of domain, thereby applying to any and all domains. It is exactly this style of uniform data collection and dynamic analytics that becomes the platform to break the logjam described in the previous section.
The EDDY design is informed by the following world view:
The EDDY architecture is grounded in the following assumptions:
![]()
(c) 2003-2012 Carnegie Mellon. All Rights Reserved.