Course M3 – Code Transformation Techniques for Software Protection
Dr. Christian Collberg, University of Arizona
Dr. Jasvir Nagra, Google Inc.
Monday, December 5th, Full Day
In this professional development course we will describe techniques for software protection, i.e. techniques for protecting secrets contained in computer programs from being discovered, modified, or re-distributed. Important applications include protecting against software piracy, license check tampering, and cheating in on-line multi-player games. The attack model is very liberal: we assume that an adversary can study our program's code (maybe first disassembling or decompiling it), execute it to study its behavior (perhaps using a debugger), or alter it to make it do something different than what we intended (such as bypassing a license check). In a typical defense scenario we use code transformation techniques to add confusion to our code to make it more difficult to analyze (statically or dynamically), tamper-protection to prevent modification, and watermarking to assert our intellectual property rights (by embedding a hidden copyright notice or unique customer identifier).
Software protection is a fairly new branch of computer security. It's a field that borrows techniques not only from computer security, but also from many other areas of Computer Science such as cryptography, steganography, media watermarking, software metrics, reverse engineering, and compiler optimization. The problems we work on are different from other branches of computer security: we are concerned with protecting the secrets contained within computer programs. We use the word secrets loosely, but the techniques we present in this course (code obfuscation, software watermarking and fingerprinting, tamperproofing, and birthmarking) are typically used to prevent others from exploiting the intellectual effort invested in producing a piece of software. For example, software fingerprinting can be used to trace software pirates, code obfuscation can be used to make it more difficult to reverse engineer a program, and tamperproofing can make it harder for a hacker to remove a license check.
In the most general sense, to obfuscate a program means to transform it into a form that is more difficult for an adversary to understand or change than the original code. "Difficult," means that the obfuscated program requires more human time, more money, or more computing power to analyze than the original program. Under this definition, to distribute a program in a compiled form rather than as source is a form of obfuscation, since analyzing binary machine code is more demanding than reading source. Similarly, we would consider a program that has been optimized more obfuscated than one that has not, since many code optimizations make analysis (both by humans and tools such as disassemblers and decompilers) more onerous.
However, the tools and techniques we present is this course go further than compilation and optimization to make a program hard to understand. In contrast to an optimizer which rearranges code for the purposes of efficiency, a code obfuscator transforms code for the sole purpose of making it difficult to analyze. A negative by-product of obfuscating transformations is that the resulting code often becomes larger, slower, or both.
Code obfuscation can be both static and dynamic. Static techniques reorganize the program text itself to make it hard to analyze using static analysis techniques. Dynamic techniques use self-modifying code to attempt to thwart an adversary's dynamic analysis (such as using a debugger).
What's particularly interesting about code obfuscation is that it's a double-edged sword: bad guys use it to protect their malware from discovery, good guys use it to protect their programs from reverse engineering, and bad guys can also use it to destroy secrets (such as watermarks) stored in the good guys' programs.
The goal of obfuscation is to add so much confusion to our program that an adversary will give up trying to understand or modify it. In addition to obfuscating the code we can also tamperproof it. This means that if the adversary tries to make modifications to program he will be left with a program with unintended side effects: the cracked program may simply refuse to run, it could crash randomly, it could phone home and tell us about the attack, etc.
In general, a tamperproofing algorithm performs two duties. First it has to detect that the program has been modified. A common strategy is to compute a checksum over the code and compare it to an expected value. An alternative strategy is check that the program is in an acceptable execution state by examining the values of variables. Once tampering has been detected the second duty of a tamperproofing algorithm is to execute the tamper response mechanism, for example causing the program to fail. This is actually fairly tricky: we don't want to alert the attacker to the location of the tamperproofing code since that will make it easier for him to disable it.
A particularly interesting application of tamperproofing is remote tamperproofing, when the code we want to protect from attack resides on a remote machine and is in constant contact with our own server. For example, consider a multi-player online computer game where, for performance reasons, the client program caches information (such as local maps) that it won't let the player see. A player who can hack around such limitations will get an unfair advantage over his competitors. Another application of remote tamperproofing is protecting the Internet from rampant attacks on the TCP/IP code stack. An adversary who can hack the networking code to subvert the back-off protocol and distribute it widely could have a devastating effect on the network. Remote tamperproofing also has applications to grid computing (someone buying cycles on a supercomputer needs to protect the privacy of their inputs and outputs and the integrity of the code itself), wireless sensor networks (the nodes must collectively detect if one or more sensors have been compromised), and digital medical records (servers must be able to detect tampering of terminals in doctors' offices).
Watermarking code is a way to track programs by inserting unique identifiers into them. Specifically, given a program P, a watermark w (an integer or string), and a key k, a software watermark embedder produces a new program Pw. We want Pw to be semantically equivalent to P (have the same input/output behavior), be only marginally larger and slower than P, and, of course, contain the watermark w. The extractor takes Pw and the key k as input and returns w. The mark also has to be resilient to attacks. An adversary could, for example, subject Pw to semantics-preserving transformations (code optimization and obfuscation) in order to destroy the mark.
Watermarking algorithms are also based on code transformations. For example, algorithms have been designed based on permuting the code, inserting non-functional code, or as solutions to static analysis problems.
Designing obfuscation, watermarking and tamper-proofing algorithms requires analysis of programming languages to identify the security properties that must be maintained, understanding the limits of analysis and creating transformations which exploit these limits. As such, software protection provides a rich source of challenge problems to motivate theoretical as well as practical and experimental research.
Outline
- Introduction. What is software protection? What problems do we work on?
- Attack Models. Who is our adversary? What techniques are at his disposal?
- Code Obfuscation. Code transformation techniques for preventing malicious reverse engineering of programs. How do we defeat static analysis? How do we defeat dynamic analysis? How can adversaries use obfuscation to affect the results of electronic voting?
- Obfuscation Theory. Theoretical background to obfuscation. What can we hide in a program? What can't we hide in a program?
- Tamperproofing. Techniques for preventing modifications of programs. How can we stop the removal of licensing checks? How can we stop cheating in on-line games? How can we prevent attacks against the TCP stack that could potentially take down the Internet?
- Watermarking. Techniques for embedding unique identifiers in programs to prevent software piracy.
- Conclusion. Directions for future research.
Prerequisites
An understanding of basic compiler/program analyis techniques is helpful, but not necessary.
About the Instructors
Dr. Christian Collberg received a BSc in Computer Science and Numerical Analysis and a Ph.D. in Computer Science from Lund University, Sweden. He is currently an Associate Professor in the Department of Computer Science at the University of Arizona and has also worked at the University of Auckland, New Zealand, and the Chinese Academy of Sciences in Beijing. Prof. Collberg is a leading researcher in the intellectual property protection of software, and also maintains an interest in compiler and programming language research. In his spare time he writes songs, sings, and plays guitar for The Zax and hopes one day to finish up his Great Swedish Novel.
Dr. Jasvir Nagra received his B.Sc. in Mathematics and Computer Science and a Ph.D. in Computer Science from the University of Auckland, New Zealand. He's been a Post Doctoral scholar on the RE-TRUST project at the University of Trento where his focus was on applying obfuscation, tamperproofing and watermarking techniques to protect the integrity of software executing on a remote untrusted platform. His research interests also include the design of programming languages and its impact on the security of applications. He's currently with Google, Inc where he is building Caja, a open-sourced, secure-subset of javascript. In his spare time Jasvir dabbles with Lego and one day hopes to finish building his Turing machine made entirely out of Lego blocks.