Saturday, July 5, 2014

Could you be .... the most beautiful bug in the world ....


That's right. I've spent almost 3 weeks to fix a bug which has been haunting me all this time, but now it looks like the most "beautiful" I've ever encountered.

The context

We have an nUnit test for a medical streaming application that basically reads volumes (depicting body parts) from a disk, loads them in memory and them streams them to the rendering pipeline to be displayed on the screen.
The test consists of two parts:
1. initialization - where, among others, we force garbage collection, by calling GC.Collect(2)
2. the streaming itself, during which we perform some measurements; one of these is the number of generation 2 collections, which should stay on 0 for the whole duration of the streaming (we tolerate only generation 0 and 1 collections, but since generation 2 collections are expensive, we absolutely want to avoid them occurring during our streaming); let's call this measurement GC2; so we expect GC2 = 0

The issue I faced was that suddenly GC2 was 1, so the test was failing because of that.

The approach

First I used .NET Memory Profiler 4.5 to analyze my nUnit test. But before doing that, I needed to run nUnit in process, because otherwise it was starting the nunit-agent.exe process and then I was analyzing the wrong process. In order to run nUnit (version 2.6.2) in process, I changed the nunit-x86.exe.config like this:

<?xml version="1.0" encoding="utf-8"?>
<configuration>
  <startup useLegacyV2RuntimeActivationPolicy="true">
    <supportedRuntime version="v4.0.30319" />
  </startup>

  <runtime>
    <legacyUnhandledExceptionPolicy enabled="1" />

    <loadFromRemoteSources enabled="true" />
    
    <assemblyBinding xmlns="urn:schemas-microsoft-com:asm.v1">
      <probing privatePath="lib;addins" />
    </assemblyBinding>
  </runtime>
</configuration>


With this in place, I could analyze the nunit-x86 process and noticed the garbage collection pattern as in the figure below:


As I said, I knew that during the initialization phase we triggered several generation 2 collections and on top of those there was another one causing the test to fail. How to identify it ?
And then I've got this simple idea: I measure the number of collections of gen 2 at the beginning of the streaming (after all initialization has taken place) and then, for each sample streamed during the test, I count the number of gen 2 collections again and if it increased, I produce a beep:

g2 = GC.CollectionCount(2);
....
if (GC.CollectionCount(2) - g2 > 0) {
    Console.Beep(500, 200);
}

Yes, a simple beep would tell me when the extra gen 2 collection was produced.
With this simple approach, I analyzed the test by taking snapshots of memory immediately before and after the beep. Namely, I've used the following tools:
- .NET Memory Profiler 4.5
- CLRProfiler
- ANTS
- WinDbg
- PerfView (with this one I could inspect memory dumps that I generated from the code)

All these tools showed me the same thing: there was no suspicious garbage that could have triggered a new gen 2 collection. So what was causing this extra collection after all ?

The other approach - back to basics

After many hours of investigating where could the garbage come from and not being able to find anything, I decided to get back to basics: I started the good, old perfmon and then it struck me - I noticed the "# Induced GC" counter under ".NET CLR Memory" category and I though let's give it a try.
I ran nunit under perfmon and when I heard the "beep" (see above), I also saw the "Induced GC" growing by one. So that was it ! The generation 2 collection was induced, it wasn't due to any extra garbage.
The next step was to find out who induced it. For this I used JetBrains dotTrace Performance 5.3.1 and this is what I've discovered:


Thus, the induced gen 2 collection is called from Microsoft code, namely from a GDI Bitmap method.
This discovery finally led me to the solution for my bug, but I'll spare you the details ....

Conclusion

From the very beginning, I've been put on the wrong foot. I was trying to look for any suspicious garbage that would trigger my gen 2 collection, but I couldn't find any. I used many tools and I spent a lot of time to analyze this, without finding anything. Because there was nothing to find. My gen 2 collection was induced, i.e. forced by Microsoft code.
Getting back to basics proved to lead me to the solution: simple tools like perfmon can't be ignored. Actually, you should start with this, be it for the fact that you might find some hints. Armed with this knowledge, you can then use any other more complex tools to analyze it further.