I wrote this initially in 2007 or so and then accidentally deleted it a few weeks ago. Thanks to Google cache, though, I now have the chance to see if anything changed in the past six years and update the post.
Not many people seem to realize just how expensive is memory sharing and cache bounce in parallel systems. The following code is intended as a simple demonstration/benchmark of the bounce costs.
The main thread spawns NTHREADS worker threads. Each worker thread proceeds to independently increment a distinct memory location, NLOOPS times. In the first variant, the memory locations are in different cache lines. In the second variant (obtained by compiling with -DBOUNCE), the memory locations are adjacent and in the same cache line.
I ran the benchmark on GNU/Linux, kernel 2.6.15, 4x (2x dual-core) Opteron 2.2Mhz (Sun Fire X4200) and obtained the following results:
Taking the minimum of each measurement gives 9.957/0.615 == 16.9, in other words the cache bounces cause close to 1700% slowdown.
The chart is not really surprising, the cache bounces cripple the otherwise perfect linear scalability.
Except the last datapoint! With all the 8 hyper-threads working, the performance is back in line. What's going on here?
For now, I have only one explanation.
A wizard did it.
Back in 2007
Not many people seem to realize just how expensive is memory sharing and cache bounce in parallel systems. The following code is intended as a simple demonstration/benchmark of the bounce costs.
The main thread spawns NTHREADS worker threads. Each worker thread proceeds to independently increment a distinct memory location, NLOOPS times. In the first variant, the memory locations are in different cache lines. In the second variant (obtained by compiling with -DBOUNCE), the memory locations are adjacent and in the same cache line.
#include <stdio.h> #include <stdint.h> #include <pthread.h> #ifndef NTHREADS #define NTHREADS 8 #endif #define NLOOPS 1000000000 static void * worker (void *_cnt) { uint64_t i; volatile uint64_t *cnt = _cnt; *cnt = 0; for (i = 0; i < NLOOPS; ++i) (*cnt)++; return 0; } // No more then 8 threads !!! static uint64_t cnt [8][128]; int main () { int i; uint64_t sum; pthread_t t [NTHREADS]; for (i = 0; i < NTHREADS; ++i) { #ifdef BOUNCE pthread_create (&t [i], 0, worker, &cnt [0][i]); #else pthread_create (&t [i], 0, worker, &cnt [i][0]); #endif } sum = 0; for (i = 0; i < NTHREADS; ++i) { pthread_join (t [i], 0); #ifdef BOUNCE sum += cnt [0][i]; #else sum += cnt [i][0]; #endif } printf ("%llu\n", sum); return 0; }
I ran the benchmark on GNU/Linux, kernel 2.6.15, 4x (2x dual-core) Opteron 2.2Mhz (Sun Fire X4200) and obtained the following results:
$ gcc -O3 cache-test.c -lpthread $ time ./a.out 800000000 real 0m0.648s user 0m2.396s sys 0m0.000s $ time ./a.out 800000000 real 0m0.615s user 0m2.384s sys 0m0.004s $ time ./a.out 800000000 real 0m0.622s user 0m2.436s sys 0m0.004s $ gcc -O3 -DBOUNCE cache-test.c -lpthread $ time ./a.out 800000000 real 0m9.991s user 0m29.374s sys 0m0.008s $ time ./a.out 800000000 real 0m9.957s user 0m29.310s sys 0m0.016s $ time ./a.out 800000000 real 0m9.974s user 0m29.330s sys 0m0.012s $
Taking the minimum of each measurement gives 9.957/0.615 == 16.9, in other words the cache bounces cause close to 1700% slowdown.
Fast forward to 2013
Again, the program was compiled with and without `-DBOUNCE`, but this time I varied the number of threads, from 1 to 8. The processor is Intel Core i7-3820, with 4 physical cores and 2 hyper-threads per core. The data follows:Threads | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 |
---|---|---|---|---|---|---|---|---|
bounce | 1.963 | 1.993 | 2.023 | 2.027 | 2.071 | 2.076 | 2.101 | 2.177 |
no bounce | 1.970 | 1.992 | 2.256 | 2.453 | 3.767 | 4.394 | 5.499 | 2.794 |
Except the last datapoint! With all the 8 hyper-threads working, the performance is back in line. What's going on here?
For now, I have only one explanation.
A wizard did it.
No comments:
Post a Comment