End-to-End Diagnostic Discovery Carnegie Mellon

Architecture

The EDDY architecture was designed fundamentally to be simple, efficient, and highly extensible. We started from the assumption that there was much we didn’t know, and any initial design was likely to require several iterations before the components of value would become apparent. In that spirit then, we started from a small kernel of functionality to satisfy some simple requirements and constraints.

Event Definition and Basics

The following provides the basic motivation and description of the elements and format of the EDDY Common Event Record (CER) architecture, the basic data construct of the EDDY diagnostic infrastructure.

  1. Minimal common event elements – To maintain efficiency and maximize generality, we felt that only a few elements were essential
  2. A few extra common elements – Allow immediate experimentation with generic analysis and correlation across multiple sources and event types.
  1. Encapsulation of the transported event - No constraints on domain-based definition of events.
  1. Event representation – variations in representation and exposure of key event elements can allow for pipeline processing and real-time insights not previously possible.

Transport Services

The following provides a summary description of the elements and form of the EDDY transport architecture, the basic method for moving Common Event Records.

  1. Event Channel - We chose to emulate a UNIX pipeline model for the fundamental mechanism of transporting events from source through selection and translation to analysis or storage. This data-driven model affords some unique and desirable capabilities, but it is a relatively unusual semantic in diagnostic practice today. There are several key values we immediately gain through a data-driven approach:
  1. Query Channel - a process-driven model for event processing
    This is the conventional data access approach where a diagnostic investigator writes queries against a data store to acquire information and process it This is an essential approach for diagnostic discovery and analysis, enabling a data mining approach to validate new questions, and perhaps enabling the conversion of some of those to data-driven methods for real-time detection.

    There are two basic approaches to constructing a rich query capability:

Both the Event Channel and the Query Channel are essential elements to the EDDY architecture. Each has its strengths and weaknesses, but enabling the use of either as appropriate maximizes the interaction and potential leverage of the data at hand

Performance Considerations

In the section above on event representation, we expressed our preference for XML and offered that it provided both the flexibility we wanted and allowed for optimization where necessary. In the conversations with practitioners through our work, performance issues have come up again and again, both as a general complication and as a specific barrier to feasibility of this strategy. To that end, we wanted to say a few words in this architecture section about strategies for high performance (and manageability at any speed) to expose some of the known strategies and hopefully allay some of those concerns. We will not supply any in-depth treatment of these strategies, but supply some valued approaches that can be easily accommodated. The approach through our early implementations has been to demonstrate capability and serve the application requirements first, and optimize second.

  1. Basic throughput targets across the event channel
    Our initial target rate was 10,000 records per second between two modern desktop hosts through the transport system. This number was chosen because it mapped to the event rate for a network flow probe on a moderately loaded enterprise backbone. Transaction rates for most other logging systems are generally substantially less than this, though one can imagine many scenarios where this would be woefully inadequate.
  1. Horizontal scaling
    There are several approaches to horizontal scaling that are likely to help significantly:
  1. Format and Data Optimizations

Security Considerations

The security ramifications of collecting and correlating activity data cannot be overstated. It is our opinion that we must address the issues up-front to allow for open dialogue about the risks inherent in this style of activity, but also to weigh the value of new methods against the risk of abuse, and to openly encourage work to maximize value and minimize risk. Significant valuable work has already been accomplished in research, but it needs to find an outlet in deployed systems. By proposing the EDDY framework to enable this analysis, we can also instrument that framework to maximize security and privacy as we enable these new features. It is essential that any design or specific implementation of this EDDY architecture address and describe the security mechanisms they employ to address these serious concerns.

Agent Definitions

Base Functionality

The fundamental processing method for a CER is an EDDY agent. They are built using the EDDY Agent Framework which provides the following basic set of functions for all agents: transporting CERs between agents across the EDDY backplane; converting the external CER representation to internal variables and vice versa; filter semantics for selection of CERs along the backplane. The combination of transport and filtering amounts to a routing function for EDDY CERs. The agent architecture was intended to be extremely simple as well as flexible to accommodate a wide variety of diagnostic orchestration scenarios. The following are the major design principles that drove the present agent architecture,

When the developer is building a new agent the philosophy of their design should be based on the following axioms,

Agent Classes

The advantage of using an agent based design is the ability to combine them to extract and interject new types of events that represent higher order events. When creating an EDDY backplane using agents, it becomes apparent that groups of agents have functions that are very similar. The following is a description of these agents grouped into their area functionality.

Agent Management

All agents are managed via an agent manager process, whose functionality consists of starting, stopping and providing the status of each agent. Multiple agents can exist on one host or they can be highly distributed throughout the backplane. The EDDY agent framework provides the following functionality for all agents.

(c) 2003-2012 Carnegie Mellon. All Rights Reserved.