c++ - How to demonstrate the impact of instruction cache limitations -
my orginial idea give elegant code example, demonstrate impact of instruction cache limitations. wrote following piece of code, creates large amount of identical functions, using template metaprogramming.
volatile int checksum; void (*funcs[max_funcs])(void); template <unsigned t> __attribute__ ((noinline)) static void work(void) { ++checksum; } template <unsigned t> static void create(void) { funcs[t - 1] = &work<t - 1>; create<t - 1>(); } template <> void create<0>(void) { } int main() { create<max_funcs>(); (unsigned range = 1; range <= max_funcs; range *= 2) { checksum = 0; (unsigned = 0; < workload; ++i) { funcs[i % range](); } } return 0; }
the outer loop varies amount of different functions called using jump table. each loop pass, time taken invoke workload
functions measured. results? following chart shows average run time per function call in relation used range. blue line shows data measured on core i7 machine. comparative measurement, depicted red line, carried out on pentium 4 machine. yet when comes interpreting these lines, seem somehow struggling...
the jumps of piecewise constant red curve occur total memory consumption functions within range exceed capacity of 1 cache level on tested machine, has no dedicated instruction cache. small ranges (below 4 in case) however, run time still increases amount of functions. may related branch prediction efficiency, since every function call reduces unconditional jump in case, i'm not sure if there should branching penalty @ all.
the blue curve behaves quite differently. run time constant small ranges , increases logarithmic thereafter. yet larger ranges, curve seems approaching constant asymptote again. how can qualitative differences of both curves explained?
i using gcc mingw win32 x86 v.4.8.1 g++ -std=c++11 -ftemplate-depth=65536
, no compiler optimization.
any appreciated. interested in idea on how improve experiment itself. in advance!
first, let me how you've approached problem, neat solution intentional code bloating. however, there might still several possible issues test -
you measure warmup time. didn't show you've placed time checks, if it's around internal loop - first time until reach range/2 you'd still enjoy warmup of previous outer iteration. instead, measure warm performance - run each internal iteration several times (add loop in middle), , take timestamp after 1-2 rounds.
you claim have measure several cache levels, l1 cache 32k, graph ends. assuming counts in terms of "range", each function ~21 bytes (at least on gcc 4.8.1), you'll reach @ 256kb, scratching size of l2.
you didn't specify cpu model (i7 has @ least 4 generations in market now, haswell, ivybridge, sandybridge , nehalem). differences quite large, example additional uop-cache since sandybrige complicated storage rules , conditions. baseline complicating things, if recall correctly p4 had trace cache might cause sorts of performance impacts. should check option disable them if possible.
don't forget tlb - though doesn't play role here in such tightly organized code, number of unique 4k pages should not exceed itlb (128 entries), , before may start having collisions if os did not spread physical code pages enough avoid itlb collisions.
Comments
Post a Comment