Many software providers operate crash reporting services to automatically collect failures in deployed software from millions of customers. Triaging and debugging such failures is critical because they impact real users and customers. However, it is notoriously hard in practice because developers have to rely on limited information such as memory dumps. In this talk, I will present two systems we built at Microsoft Research to address the challenges in triaging and debugging failures in deployed software. Both systems were deployed inside Microsoft as a major solution for triaging and debugging software failures. I will also share our experiences in developing and deploying these solutions in practice.
First, I will present RETracer, the first system to triage software failures based on program semantics reconstructed from memory dumps. RETracer is designed to meet the requirements of large-scale crash reporting services. It performs binary-level backward taint analysis without a recorded execution trace to understand how functions on the stack contribute to the failure. When comparing it with the previous crash triaging tool used by Microsoft, we find that RETracer eliminates two thirds of triage errors based on a manual analysis of 140 bugs fixed in Microsoft Windows and Office.
Second, I will present REPT, a practical system that enables reverse debugging of failures in deployed software. REPT reconstructs the execution history with high fidelity by combining online lightweight hardware tracing of a program’s control flow with offline binary analysis that recovers its data flow. It is seemingly impossible to recover data values thousands of instructions before the failure due to information loss and concurrent execution. REPT tackles these challenges by iteratively performing forward and backward execution with error correction and constructing a partial execution order with timestamps logged by hardware. When evaluating it on 16 real-world bugs, we find that REPT can recover data values accurately (93% on average) and efficiently (less than 20 seconds) for these bugs, and enables effective reverse debugging for 14 of them.
Weidong Cui is a Principal Researcher managing the Systems Security and Privacy Research group in the Microsoft Research Redmond lab. Weidong enjoys building real-world systems to tackle hard problems. His current passion is on ensuring Microsoft Azure is the most secure cloud. Weidong and his team built REPT, the first lightweight record and replay solution that is widely deployed to enable reverse debugging of software failures. Weidong led the development of RETracer, a technology that improves the triaging accuracy of access violations significantly. Weidong and his collaborators introduced controlled-channel attacks that can steal rich information from secure enclaves. Weidong led the development of KOP, a Windows kernel rootkit detection system that still represents the state-of-the-art after many years. Weidong is also known for his early work on automatic protocol reverse engineering. Weidong received his Ph.D. and M.S. degrees from UC Berkeley, and his M.E. and B.E. degrees from Tsinghua University.