How can I speed up my code coverage tool?
I wrote a little code coverage utility to log which basic blocks go into the x86 executable. It works without source code or symbol debugging for the target, and just takes the lost base blocks that it controls.
However, it becomes a bottleneck in my application, which involves repeated snapshots of the same executable image.
It went through a couple of stages when I tried to speed it up. I started by simply placing INT3 at the beginning of each basic block, attaching it as a debugger and logging hits. I then tried to improve performance by fixing any block larger than 5 bytes (JMP REL32 size) in the counter. I wrote a little stub ("mov [blah], 1 / jmp backToTheBasicBlockWeCameFrom") in the process memory space and fixed the JMP. This makes things much faster as there is no exception and debugger debugging, but I would like to speed things up.
I am thinking of one of the following:
1) Preconfigure the target binary with my fixed counters (I'm doing this at runtime for now). I could create a new partition in PE, throw my counters into it, pay all the hooks I need, and then just read the data from the same partition with my debugger after each execution. This will give me some speed (around 16% according to my estimate), but there are also those annoying INT3s I need to have in smaller blocks that will really hurt performance.
2) Cast the binary to its own UnhandledExceptionFilter and handle its own int3 in combination with the above. This would mean no process switching from debuggee to my cover tool on every int3, but it would still throw a breakpoint exception and then a kernel jump. Am I correct in thinking that it would not bring me much performance?
3) Try to do something smart with Intel Industry Hardware Profiling Guides. It sounds pretty intimidating, but I don't understand how I would go about it - is it possible in a windows usermode application? I could go so far as to write a kernel mode driver if it's pretty simple, but I'm not a kernel coder (I got a little wobbly) and would probably give me a lot of headaches. Are there any other projects using this approach? I see that the Linux kernel has control over the kernel itself, which makes me think that monitoring a particular usermode application would be difficult.
4) Use a ready-made application. It should work without any source or debug symbols, be scriptable (so I can run in batches), and preferably free (I'm pretty stingy). However, payment tools are just around the corner (if I can spend less on the tool and increase productivity to avoid buying new hardware, that would be a good excuse).
5) Something else. I am running VMWare on Windows XP, on fairly old hardware (Pentium 4-ish) - is there anything I missed or any conclusions I should read? Can I get JMP REL32 down to less than 5 bytes (and catch smaller blocks without the need for int3)?
Thank.
source to share
If you insist on using the tool binaries, pretty much your fastest coverage is a 5 byte bounce jump. (You are using the standard binary tools framework.)
INT 3 solution will always include a trap. Yes, you could handle the trap in your own space instead of the debugger space and that would speed it up, but it never comes close to competitive versus jumping / backward. In any case, you might need this as a backup if the function you are using is shorter than 5 bytes (eg "inc eax / ret"), because then you don't have 5 bytes to fix.
What you can do to optimize the situation a bit is to examine the corrected code. Without such verification with source code:
instrn 1
instrn 2
instrn N
next:
fixed, in general, to look like this:
jmp patch
xxx
next:
should have a patch:
patch: pushf
inc count
popf
instrn1
instrn2
instrnN
jmp back
If all you want is coverage, you don't need to increase it, which means you don't need to keep the flags:
patch: mov byte ptr covered,1
instrn1
instrn2
instrnN
jmp back
To keep the size of the patch, you must use a byte, not a word. You must align the patch on the cache line so that the processor does not have to fetch 2 cache lines to execute the patch.
If you insist on counting, you can parse instrn1 / 2 / N to see if they are interested in flags where "inc" cheats and only pushf / popf if needed, or you can insert an increment between the two instructions in the patch that doesn't care. You have to parse them to some extent to deal with complexities like instn ret ; you can create a better patch (not jmp back for example).
You may find that adding a counter 1 is faster than inc count because it avoids partial conditional code updates and subsequent pipeline locks. This will slightly affect your cc-impact-analysis as inc doesn't set the carry bit, but adds .
Another possibility is PC sampling. Don't use code at all; just interrupt the stream periodically and assume the approximate value of PC. If you know where the base units are, a sample PC anywhere on the base unit indicates that the entire unit has been executed. This will not necessarily provide accurate coverage data (you may miss critical PC values), but the overhead is fairly low.
If you want to fix the source code, you can do better: just paste "cover [i] = true;" at the beginning of the i-th base block and let the compiler take care of all the different optimizations. No fixes required. The very interesting part of this is that if you have basic blocks inside nested loops, and you insert original probes like this, the compiler will notice that the probe assignments are idempotent with respect to the loop and will pull the plug out of the loop. Viola, null probe overhead inside the loop. What else do you want?
source to share