Full Program »
Software bugs and vulnerabilities cause serious problems to both home users and the Internet infrastructure, limiting the availability of Internet services, causing loss of data, and reducing system integrity. Software self-healing using rescue points (RPs) is a known mechanism for recovering from unforeseen errors. However, applying RP-based self-healing on multitier architectures is problematic because certain actions, like transmitting data over the network, cannot be undone. We propose cascading rescue points (CRPs) to address the state inconsistency issues that can arise when using traditional RPs to recover from errors in interconnected applications. With CRPs, when an application executing within a RP transmits data, the remote peer is notified to also perform a checkpoint, so the communicating entities checkpoint in a coordinated, but loosely coupled way. Notifications are also sent when RPs successfully complete execution, and when recovery is initiated, so that the appropriate action is performed by remote party. We developed a tool that implements CRPs by dynamically instrumenting binaries and transparently injecting CRP-related notifications in already established TCP channels between applications. We tested our tool with various applications, including the MySQL and Apache servers, and show that it allows them to successfully recover from errors, while incurring moderate overhead between 4.54% and 71.56% with the tested applications.
Author(s):
Angeliki Zavou
Columbia University
United States
Georgios Portokalidis
Columbia University
United States
Angelos D. Keromytis
Columbia University
United States