Full Program »
Kakute: A Precise, Unified Information Flow Analysis System for Big-data Security
Big-data frameworks (e.g., Spark) enable computations on tremen- dous data records generated by third parties, which introduces vari- ous security and reliability problems such as information leakage and programming bugs. Existing systems for big-data security (e.g., Titian) track data transformations in a record level, so they are impre- cise and too coarse-grained for these problems. For instance, when we ran Titian to drill down input records that produced a buggy output record, Titian reported 3 to 9 orders of magnitude more input records than the actual ones. Information Flow Tracking (IFT) is a conventional approach for precise information control. However, extant IFT systems are neither efficient nor complete for big-data frameworks, because theses frameworks are data-intensive, and data flowing across hosts is often ignored by IFT. This paper presents KAKUTE, the first precise, fine-grained infor- mation flow analysis system for big-data. Our insight on making IFT efficient is that most fields in a data record often have the same IFT tags, and we present two new efficient techniques called Reference Propagation and Tag Sharing. In addition, we design an efficient, complete cross-host information flow propagation approach. Eval- uations on 7 diverse big-data programs (e.g., WordCount) shows that KAKUTE has merely 32.3% overhead even when fine-grained information control is enabled. Compared with Titian, KAKUTE precisely drilled down the actual bug inducing input records, a huge reduction of 3 to 9 orders of magnitude. KAKUTE’s performance overhead is comparable with Titian. Furthermore, KAKUTE effec- tively detected 13 real-world security and reliability bugs in 4 diverse problems, including information leakage, data provenance, program- ming and performance bugs. KAKUTE’s source code is available at https://github.com/acsac17-p78/kakute.