Friday, April 03, 2015

Memory bugs and finding them

I am working on creating a JIT compiler for Lua and Ravi. Ravi is a Lua derived language with some enhancements to help improve performance. The JIT compilation is being implemented using LLVM.

As I implement more parts of the Lua language I am able to run more of the standard Lua test suites in JIT mode. A few days ago I encountered a nasty bug - one of the test cases failed on Windows with a run-time exception that seemed to imply an invalid or misaligned stack. Now this is a particularly hard bug to find as at runtime there is a mix of code compiled by the MSVC compiler (the Lua C code) and the code generated by LLVM. The LLVM generated code has no debugging information right now as I haven't yet implemented the required metadata. So the problem is how to investigate the issue if the fault is in LLVM generated code.

Confusingly the error only occurred on Windows - but not on Linux, so I was initially led to believe that the error may be due to some issue with how LLVM was using the stack on Windows.

The problem with memory errors is that they are Heisenbugs. Any change you do to investigate the issue such as adding a debug print may help the bug to hide. So investigation is particularly hard.

My initial attempt at finding the root cause was to run the tests under Valgrind. Valgrind reported some possible memory leaks but not in my code - the reported potential leaks were in LLVM code. There were no reports of buffer overflow or memory overwrites.

I guess that if the problem is some portion of the stack being overwritten then Valgrind cannot find it as it is more a tool for analyzing issues with heap allocation.

My next stop was to use Address Sanitizer. This is a tool created by Google engineers - it works by instrumenting the code using some compiler extensions. Fortunately this capability is available in GCC 4.8.2 which is the compiler I am using on Ubuntu. Just by adding a compiler option (-fsanitize=address) one can get instrumented C code. The tool not only checks heap allocations but also the stack for invalid reads/writes.

When I ran the test suite with the instrumented version asan reported an issue and aborted the program. Fortunately, you can just run the application under GDB, and break at the point where asan reports the error. And if the code has appropriate debug information, then GDB will tell you the line of code that caused the error.

To cut a long story short - I found that the memory error was being caused because I had not modified the Lua Debug API to work with JITed code. I had incorrectly assumed that the Debug API is only used if invoked explicitly. When an error is raised in Lua, the error routine uses the Debug API to obtain information about the code running at the time. This relies upon the 'savedpc' field which is not populated in JITed code - so any calls to the Debug API that relies on this can lead to unexpected memory access. The fix I implemented was to treat JITed functions as if they are C functions.

The reason for this post is however that I found Address Sanitizer to be an amazing tool. It is a life saver for C/C++ programmers.