The EDDY architecture was designed fundamentally to be simple, efficient,
and highly extensible. We started from the assumption that there was much
we didn’t know, and any initial design was likely to require several
iterations before the components of value would become apparent. In that spirit
then, we started from a small kernel of functionality to satisfy some simple
requirements and constraints.
The following provides the basic motivation and description of the elements
and format of the EDDY Common Event Record (CER) architecture, the basic data
construct of the EDDY diagnostic infrastructure.
- Minimal common event elements – To maintain efficiency
and maximize generality, we felt that only a few elements were essential
- Versioning for the common record information
- Unique reference identifier per record
- Timestamp for event creation
- Location where the event was seen
- Location where the event was introduced into the backplane
- Indicator for the type of enclosed event – the type indicator is
an identifier mapped to the content and format of the enclosed event. Event
type associations should be discoverable. In the short term, a simple, central
registration mechanism is almost certainly the most pragmatic.
- A few extra common elements – Allow immediate
experimentation with generic analysis and correlation across multiple sources
and event types.
- Parameter to indicate severity of the event – an easily identifiable
and discoverable parameter to assist in high-level classification of events;
- An explicit, extensible user ‘tag’ attribute – representative
to the desire to easily leverage (and route) important event semantics
- Encapsulation of the transported event - No constraints
on domain-based definition of events.
- Opaque encapsulation of external schema – many domains have ongoing
efforts to define control and audit information appropriate for their
own components. The basic idea is to capture the external event in its
entirety, attach the required common elements, calculate and attach the
extra common elements, and forward it along, now enabled for normalized
processing.
- Ad hoc schema – in addition to pre-existing events, we anticipate
that a pipelining of events and event processors will give rise to new
event types, borne from creative analysis of sets of event flows. Possibilities
include:
- Composing events - merging a set of events, perhaps mapping to summary
information to convey more concise information about the origin events;
for example, a statistical representation of a set of similar events
- Analyzed events – it is expected that certain sequences of
events will be indicative of a condition worthy of reporting as a
separate event (e.g. SYN flood as a DDoS)
- Event representation – variations in representation
and exposure of key event elements can allow for pipeline processing and real-time
insights not previously possible.
- Common event elements: We decided in the beginning to use XML for formatted
representation of the common event elements. This has its tradeoffs, of
course, but we feel that the flexibility afforded by this syntax will
serve us well through the experimentation phase at least, allowing us
to easily experiment and still optimize when needed.
- Encapsulated events
- Raw events – In our discussion of event basics above,
we indicated that we could directly represent the complete (opaque)
syntax of an origin event. We call this form a Raw event, and use
it as the simplest style of leveraging EDDY semantics without sacrificing
any origin semantics.
- Cooked event elements - As we gain experience with EDDY
semantics and pipelining capabilities, it is important to have a mechanism
for an intermediate processing agent to recognize key elements in
an event record without necessarily understanding the full syntax
(or semantics) of the Raw event. To support this partial exposure
of event information, we invented the notion of Cooked elements –
these are an XML representation of specific components from the raw
record, transformed into a tagged syntax to enable generic treatment.
- Common schema elements for generic processing – in considering
the methods used for generic correlation among a variety of event types,
it became clear that some standardization of schema elements would become
valuable. So far, we have focused on two forms of standardization:
- Simple data types (e.g. integer, string) – It is
obvious that using simple data types for event elements will allow
us to perform a great number of generic analyses for substantial reporting
value completely agnostic of event domain. Instance counts and simple
statistical analyses can easily be done and are often the best first
results from these reporting capabilities. For example: how many mail
messages are we sending/receiving; what’s the mean and standard
deviation for sizes of mail messages; with which domains do we exchange
most email.
- Common diagnostic objects (e.g. host, credential use)
– As we learn more, we will discover objects and elements that
are the common subjects and actors for events across a great number
of domains. If we hope to correlate events based on commonality of
subject or actor, it will be essential to share syntax and semantic
descriptions for those objects. We have little preconceived notion
for the extent of this need, and expect there may be several models
for experimentation before we may make significant progress, but as
an example, if we hope to collect the activities related to a given
host over a specific time period, it seems necessary to have a common
syntax for describing that target host (whether name, address, or
some other factors).
The following provides a summary description of the elements and form of the
EDDY transport architecture, the basic method for moving Common Event Records.
- Event Channel - We chose to emulate a UNIX pipeline model
for the fundamental mechanism of transporting events from source through selection
and translation to analysis or storage. This data-driven model affords some
unique and desirable capabilities, but it is a relatively unusual semantic
in diagnostic practice today. There are several key values we immediately
gain through a data-driven approach:
- It enables real-time analysis and detection of predetermined conditions,
arbitrarily shortening the latency between circumstance and detection, allowing
remediation to begin as quickly as possible.
- It allows the deferral of certain sticky issues regarding data discovery,
location, and authorization. If we presume to push the appropriate data to
those with interest and authorization to receive it, we do not have to immediately
design nor implement universal data access and protection protocols, it simplifies
the problem – at least in those aspects.
- It enables a very rich and natural set of optimizations to assist with
data management problems in diagnostic infrastructures, not the least of which
is the ability to deal with large streams of situational data without storage,
lookup, and access control structures. There are a great many questions that
can be very effectively and efficiently answered in this way.
- It is entirely compatible and quite complementary with a process-driven
model, allowing both to coexist and leveraging their relative strengths as
appropriate.
The default model for the event channel is to forward the event unmodified
and in its entirety, from source to destination. The EDDY architecture enhances
this basic view by explicit definition of selection and replication capabilities
at each point along the pipeline, and also the creation of a control method
for a downstream component to interact with an upstream process to modify
the content of the data stream (we describe these functions only briefly
here, we encourage the reader to refer to the Illustrative Example section
below for a better view on how and why these services will be beneficial).
- Selection – it is expected that a site may choose to configure the
diagnostic data flows to feed a variety of real-time and archival functions.
For example, basic network flow information may be captured in its entirety
for network and security management, but those flows related to email transactions
will likely be useful for the email administrator also. From a data-flow perspective
then, it is natural to plan for the flow monitor to forward all flows to the
network and security groups, but select only email-related flows for forwarding
to the email group. This selection capability is therefore fundamental and
required for any requirement that would leverage common information for use
in disparate diagnostic environments.
- Replication – the previous example also illustrates the need for
inline replication. Where more than one service seeks a particular record,
it is necessary to replicate that record for forwarding downstream.
- Controlling the event pipeline – finally, in a data-driven model,
there are circumstances where a downstream component might benefit from changes
in the characteristics of the data being directed its way. This expectation
of flexibility gave rise to the definition of a Control Channel to allow a
downstream component to request a change in the stream of data it is receiving.
We have little preconceived notion of the summary value of this feature, but
one trivial usage example would be for flow-control where the downstream component
might ask for some reduction in the volume of data to allow it to better manage
its workload.
- Query Channel - a process-driven model for event processing
This is the conventional data access approach where a diagnostic investigator
writes queries against a data store to acquire information and process it
This is an essential approach for diagnostic discovery and analysis, enabling
a data mining approach to validate new questions, and perhaps enabling the
conversion of some of those to data-driven methods for real-time detection.
There are two basic approaches to constructing a rich query capability:
- Static centralization of data – With this method, you collect the
log data in one (or a few) large, central data stores where subsequent processing
can occur. Characteristics of this approach include:
- [plus] Diagnosticians know what data they have and where it is.
- [plus] Ad hoc queries and historical analyses are easy to do, incremental
refinement and investigative forensics can be done on a known body of
data
- [minus] The burden and risk of collecting all the data in one place
can be substantial, in terms of data processing load for data insert/access/aging
and policy management load if there is any diversity in access control
or data management policy.
- [minus] Depending on the data storage technology used, it may be relatively
difficult to extend the storage schema to accommodate new data types.
- Dynamic data location – This method addresses some of the difficulties
of the static approach, but brings some complexities also.
- [plus] The burden and risk of data collection can be mitigated significantly
by separating the data collection function – this can assist with
difference in access policy by separating date with different policy;
it can assist with data processing loads of insert, access, and aging
by distributing the load horizontally across multiple servers.
- [minus] the key negative in this approach is that it requires the creation
of some reliable, secure method of describing and discovering the location
of data of interest. There are several options for providing this data
location in the literature, but there are few proven standards and still
much to learn.
The EDDY architecture does not dictate either approach in this area. They
are essentially two options along the similar path. It seems sensible to
start with the former because it is easier and faster in providing functional
contact with the data. The second seems like a sensible enhancement; it
requires more work but affords significant flexibility and carries less
risk, but they are essentially equivalent in functional capabilities from
a query point of view.
Both the Event Channel and the Query Channel are essential elements to the
EDDY architecture. Each has its strengths and weaknesses, but enabling the use
of either as appropriate maximizes the interaction and potential leverage of
the data at hand
In the section above on event representation, we expressed our preference for
XML and offered that it provided both the flexibility we wanted and allowed
for optimization where necessary. In the conversations with practitioners through
our work, performance issues have come up again and again, both as a general
complication and as a specific barrier to feasibility of this strategy. To that
end, we wanted to say a few words in this architecture section about strategies
for high performance (and manageability at any speed) to expose some of the
known strategies and hopefully allay some of those concerns. We will not supply
any in-depth treatment of these strategies, but supply some valued approaches
that can be easily accommodated. The approach through our early implementations
has been to demonstrate capability and serve the application requirements first,
and optimize second.
- Basic throughput targets across the event channel
Our initial target rate was 10,000 records per second between two modern desktop
hosts through the transport system. This number was chosen because it mapped
to the event rate for a network flow probe on a moderately loaded enterprise
backbone. Transaction rates for most other logging systems are generally substantially
less than this, though one can imagine many scenarios where this would be
woefully inadequate.
- Horizontal scaling
There are several approaches to horizontal scaling that are likely to help
significantly:
- Selective projection – A default situation for forensic analytics
of log data is to capture the whole log file (or stream) and process the components
of interest. EDDY allows an analyst to provide a filter to the upstream component
so it can limit the content of the incoming stream to only contain records
of interest. This delivers the possibility of substantially reducing the total
bandwidth and processing requirements of the system.
- Partial stream commonality – EDDY is built on the assumption of extended
utility of local data sets. If several analysts are interested in seeing the
same data set, duplicating feed processors can be an expensive, wasteful proposition.
Using EDDY, you could use a single probe and replicate the feed easily. In
addition, if different analysts are interested in slightly different, overlapping
slices of data or attributes, a stream of processors can be constructed to
minimize the data movement and translation required to produce the desired
result.
- Parallel striping – The two previous performance features can be
combined to enable striped processing of an input feed where that may be appropriate.
The write filter feature could be used to stripe a stream of events to a set
or event processors; if appropriate, the stream could then be rejoined again
after the expensive filter service has been performed.
- Format and Data Optimizations
- Compression – XML’s expressiveness tends to lead to rather long
variable names and attribute representations. If one considers an EDDY installation
as a closed system with defined boundaries, it is easy to presume that one
might choose to use shortcut representations within the closed system and
expand only on export – this has great potential to reduce the number
of ascii bytes that have to be forwarded.
- Factoring – The EDDY event channel is created through construction
of individual flow streams of one or more EDDY sources to one or more EDDY
sinks. Each flow contains Event Records with a substantial amount of redundant
information. It is easy to imagine an optimization where at least some of
that redundant information is factored into state information maintained between
the source and sink in the flow. Again, this information would have to be
restored if/as the records are exported to elements without the shared state.
The security ramifications of collecting and correlating activity data cannot
be overstated. It is our opinion that we must address the issues up-front
to allow for open dialogue about the risks inherent in this style of activity,
but also to weigh the value of new methods against the risk of abuse, and
to openly encourage work to maximize value and minimize risk. Significant
valuable work has already been accomplished in research, but it needs to find
an outlet in deployed systems. By proposing the EDDY framework to enable this
analysis, we can also instrument that framework to maximize security and privacy
as we enable these new features. It is essential that any design or specific
implementation of this EDDY architecture address and describe the security
mechanisms they employ to address these serious concerns.
Base Functionality
The fundamental processing method for a CER is an EDDY agent. They are built
using the EDDY Agent Framework which provides the following basic set of functions
for all agents: transporting CERs between agents across the EDDY backplane;
converting the external CER representation to internal variables and vice
versa; filter semantics for selection of CERs along the backplane. The combination
of transport and filtering amounts to a routing function for EDDY CERs. The
agent architecture was intended to be extremely simple as well as flexible
to accommodate a wide variety of diagnostic orchestration scenarios. The following
are the major design principles that drove the present agent architecture,
When the developer is building a new agent the philosophy of their design
should be based on the following axioms,
-
Agents should be as simple as possible so they may be combined
with others to form a higher order processing units where the sum is greater
then the parts to exemplify their reuse
-
A corollary to point one, the philosophy of reuse of agents
should not be a hard axiom such that it impedes a single agent being built
rapidly to achieve its end goal
Agent Classes
The advantage of using an agent based design is the ability to combine them
to extract and interject new types of events that represent higher order events.
When creating an EDDY backplane using agents, it becomes apparent that groups
of agents have functions that are very similar. The following is a description
of these agents grouped into their area functionality.
-
Normalizers accept external events (such
as net flow or a record in a HTTPD log) and inject them into the EDDY backplane
as rapidly as possible. This is done by normalizing the external event into
a CER (raw or cooked) pertaining to the type of event. Examples of types
event that normalizers inject are,
-
Security domain- Snort, IDS, network flow, security
logs
-
Network domain - network flow, SNMP, RMON
-
System domain - RAID events, Unix dmesg records
-
Application domain - HTTP, sendmail, spam engines,
DNS records, DHCP
-
Environmental domain - control actuators (steppers,
pneumatic, etc.) sensors (video, temperature, motion, light, sound,
etc.)
-
Transformation agents change the contents
of a CER. They can be very simple where they change one specific field within
a CER to another, or complex where they create an entirely new type of CER,
or change a raw CER to a cooked. Examples of transformation agents are,
-
Remove the User ID portion from a SSH CER.
-
From a stream of NetFlow cooked CERs create a new CER
that indicates what are the top 100 hosts within the last five minutes
that produce the most bytes, flows and packets on the network.
-
Cook a raw event from a Apache HTTP CER.
-
Anonymizing the source and destination fields of a
NetFlow cooked CER
-
Storage agents are the repositories for
CERs. The have the base functionality of all agents but also have the additional
functionality where a query channel is included to provide an API for selecting
specific CERs. There are three sub-classes of storage agents, which are,
-
Basic - free form schema best representing the needs
of the diagnostic application. The schema can include all the fields
of a CER to limit it to specific fields.
-
High performance or archive - stores the complete CER
and can be designed to accommodate high event flow rates greater the
10K/second.
-
Directory - acts as the location and/or discovery service
for resources within the EDDY backplane. Examples of which are finding
specific types of event repositories or the topology of the EDDY cluster.
-
Application agents are the membrane between
the EDDY backplane and external systems. Application agents export events
(represented as CERs or another type of format) to external management or
analysis systems. They can also use the EDDY query channel to query storage
agents to gather information about events. Examples of application agents
are,
-
Sending Email to an alerting system
-
Issuing an SNMP trap to a network system
-
Posting a SOAP event to an external system
-
Display agents massage CERs and prepare
them to be used by a visualization system. They are very similar to application
agents but they provide a service where a display client (such as a web
browser of a Java applet) can connect and retrieve data prepared for the
specific visualization application. Examples are,
-
A real-time application that displays the “top
talker” hosts across both the commodity and Internet2 egress networks.
-
Visualization of where most of the events with CER
warning levels greater then high are occurring within the enterprise.
- Analysis agents use both the EDDY event and query channels
to provide some diagnosis of events and inject a higher order of event back
into the backplane. Examples are,
o Observing BGP events, noticing that the routing fabric has changed and injecting
a new event that describes the new routing structure.
o Observing network flow and Shibboleth events and generating a new event
that suggests that the Shire has been misconfigured.
Agent Management
All agents are managed via an agent manager process, whose functionality
consists of starting, stopping and providing the status of each agent. Multiple
agents can exist on one host or they can be highly distributed throughout the
backplane. The EDDY agent framework provides the following functionality for
all agents.
-
Configuration, management
-
Transport: initializing the EDDY event channel to and from
other agents in the event pipeline
-
Filtering: defining what CERs to accept or reject
-
Routing: defining what CERs need to be routed to other
agents in the event pipeline
(c) 2003-2008 Carnegie Mellon. All Rights Reserved.