(EPSRC Grant EP/F035217/1 July 2008 to June 2011)
Project Motivation
Real-time stream data has begun to play an increasingly important role
on the Internet. One of the causes for this is the proliferation of
geographically-distributed stream data sources such as sensor networks,
scientific instruments, pervasive computing environments and web feeds
connected to the Internet. Potentially millions of users world-wide
want to take advantage of the availability of this data. Therefore they
require a convenient way to process real-time stream data at a global
scale through applications that perform Internet-scale stream processing (ISSP).
Similar to the ease of relational queries in DBMS, stream-processing
systems allow users to access and manipulate distributed data streams
through declarative queries. However, the scale of an Internet-wide
system poses substantial challenges when it comes to providing a dependable
service. Any such system must gracefully handle the failure of network
links and processing hosts while managing a large pool of CPU and
network resources.
For example, astronomers want to detect
transient sky events, such as gamma-ray bursts, in real-time. To detect
such events, they must correlate real-time image streams from
geographically-distributed radio telescopes. These events only last for
minutes and, after an event has been detected, instruments need to be
re-aligned to focus on on-going occurrences. The left figure below
shows an ISSP system, executing a query that takes images from radio
telescopes, processes them in real-time and delivers data about
transient anomalies to two astronomers. The processing is done by a
distributed set of hosts (or data centres). The logical structure of
the query implementing the application is shown on the right. The
system must ensure that image data is transported reliably between
processing sites. However, some data loss is acceptable, as long as no
transient sky events are missed as a consequence.
![]() |
![]() |
Project Overview
The DISSP project investigates how to build dependable
Internet-scale stream-processing systems for interconnecting tomorrow's
pervasive sensor systems and global scientific experiments. We argue
that Internet-scale stream processing needs new models for achieving
dependability. Achieving dependability in this context is a significant
challenge for several reasons: (1) failure will be the common case in
the system. Due to its size, a fraction of Internet paths and hosts
will be unavailable at any time; (2) the real-time nature of the data
means that there is little time for recovery from failure; (3) a shared
infrastructure, such as an ISSP system, will experience high
utilisation. Consequently the additional resource demand during
recovery can overload the system. The traditional wisdom of
substantially over-provisioning a system to compensate for failure is
infeasible in such a shared, federated platform.
Therefore we
believe that we need to depart from the hard dependability guarantees
of traditional DBMSs and today's stream-processing systems. Ensuring no
tuple loss at all times may be feasible within a single data centre,
but we cannot hope to achieve this at an Internet-scale. Instead, we
explore dependability guarantees that are driven by application
requirements. Many sensing applications can cope with a controlled degradation of result quality. While result quality is reduced, the system provides constant feedback
to users on the achieved level of service. Feedback is expressed in a
domain-specific way, e.g., by notifying a scientific user about the
reduction in detection confidence of events of interest. This feedback
also drives an adaptive fault-tolerance mechanism
allowing the DISSP system to strategise about resource allocation in
order to minimise the reduction in service quality of a maximum number
of users.
Project Objectives
The aims of the DISSP project are to:
- Investigate and develop a novel reliability model for DISSP that includes user-perceived quality of query results and provides feedback to users on quality degradation due to unmaskable network and host failures.
- Provide adaptive fault-tolerance mechanisms for DISSP systems that select appropriate strategies for maximising result quality over the lifetime of potentially long-running queries (eg over several months).
- Design, implement and evaluate a scalable prototype system for DISSP using controlled experiments on the Emulab network testbed.
- Deploy an open, global and shared platform for DISSP as a public service on the PlanetLab research network and thus to facilitate and encourage the use of DISSP across research communities.
The specific research issues addressed by the project include:
- How to choose a realistic failure model for a DISSP system in terms of correlated and uncorrelated host and network failures over long periods of time?
- What are appropriate data and query models for DISSP systems that may suffer from failure of the infrastructure? How to specify queries that can be dynamically adapted to failures of query operators and data sources?
- How to relate low-level failures to application-level degradation of result quality? How to design application-specific and -independent utility functions for expressing user-perceived benefit of long-running queries in DISSP?
- What is a suitable set of proactive and reactive techniques for a DISSP system to compensate or mask short- and long-term host and network failures?
- What are light-weight and scalable mechanisms to reconcile inconsistent state of stream processing operators after failure recovery?
- How to leverage approaches from autonomous computing, control theory and machine learning to design scalable and adaptive feedback mechanisms for selecting between different fault-tolerance techniques?
- What level of functionality and interfaces are needed when providing a shared DISSP system for a global, inter-disciplinary user base?

