Consider the following C code:

#include <pthread.h>
#include <stdio.h>
#include <stdint.h>
#include <stdlib.h>
#include <sys/time.h>

#define NUM_THREADS 4
#define NUM_ITERATIONS (1L << 32)

long *data;

void *thread_function( void *args )
{
long i;
long j = (long)args;

for ( i = 0; i < NUM_ITERATIONS; i++ )
{
data[j] = i;
}

return NULL;
}

int main( int argc, char **argv )
{
long i, j;
pthread_t ids[NUM_THREADS-1];
struct timeval start, end;
double d;

data = (long*)malloc( sizeof(long)*NUM_THREADS );

for ( j = 1; j <= NUM_THREADS; j++ )
{
for ( i = 0; i < NUM_THREADS; i++ )
{
data[i] = 0;
}

gettimeofday( &start, NULL );
for ( i = 0; i < j-1; i++ )
{
pthread_create( ids+i, NULL, thread_function, (void*)i );
}

thread_function( (void*)i );

for ( i = 0; i < j-1; i++ )
{
pthread_join( ids[i], NULL );
}

gettimeofday( &end, NULL );

d = end.tv_sec - start.tv_sec + (1.0e-6)*end.tv_usec - (1.0e-6)*start.tv_usec;

printf( "%ld) %g %g ips\n", j, d, ((double)j*NUM_ITERATIONS)/d );
}

return 0;
}

Basically this code

   for ( i = 0; i < NUM_ITERATIONS; i++ )
{
data[j] = i;
}

runs in parallel for 1 to NUM_THREADS threads and prints how long it takes to run and the number of iterations executed per second. data is an array and j is a thread specific index into the array. Thus each thread is given its own memory address. This is important to recognize, the threads are not sharing data.

As we increase the number of threads (up to the number of cores), the running time should stay the same, but the number of iterations per second should increase linearly (remember each thread does NUM_ITERATIONS, so adding a new thread adds NUM_ITERATIONS of work). Let's see what happens when we actually run this code:

1) 15.758 2.72558e+08 ips
2) 28.8921 2.97311e+08 ips
3) 40.1948 3.20561e+08 ips
4) 46.6236 3.6848e+08 ips

Well, that is nothing like what we were expecting. The running time increases significantly and the iterations per second increase only slightly. It would appear as though the threads are interfering with each other, however we can see that they don't actually share any data. What's happening?

This is an example of false sharing. False sharing occurs when multiple threads modify data that share the same cache line. I'm not going to go into too much detail about caching, see this wikipedia article for more details. Briefly, each core has a small private cache that stores a copy of a region of main memory (accessing the local copy is faster than accessing main memory). When the corresponding region in main memory changes, the value stored in the cache is no longer correct. Thus if the core wants to use anything in that memory region, it needs to make a new copy. When multiple threads share a single cache line, modification of the memory in one thread will force the other threads to refresh their caches. This refresh forces the threads to wait for memory transfers far more then they would if the thread was running exclusively. If there are enough threads with enough memory requests it is also possible to saturate the memory bus, which can delay memory transfers even more.

To fix this we need to make sure that the threads aren't sharing cache lines. The easiest way to fix a problem like this is to add padding around the data.

#include <pthread.h>
#include <stdio.h>
#include <stdint.h>
#include <stdlib.h>
#include <sys/time.h>

#define NUM_THREADS 4
#define NUM_ITERATIONS (1L << 32)

#define CACHE_SIZE 128

typedef struct {
long d;
char buffer[CACHE_SIZE-sizeof(long)];
} Data;

Data *data;

void *thread_function( void *args )
{
long i;
long j = (long)args;

for ( i = 0; i < NUM_ITERATIONS; i++ )
{
data[j].d = i;
}

return NULL;
}

int main( int argc, char **argv )
{
long i, j;
pthread_t ids[NUM_THREADS-1];
struct timeval start, end;
double d;

data = (Data*)malloc( sizeof(Data)*NUM_THREADS );

for ( j = 1; j <= NUM_THREADS; j++ )
{
for ( i = 0; i < NUM_THREADS; i++ )
{
data[i].d = 0;
}

gettimeofday( &start, NULL );
for ( i = 0; i < j-1; i++ )
{
pthread_create( ids+i, NULL, thread_function, (void*)i );
}

thread_function( (void*)i );

for ( i = 0; i < j-1; i++ )
{
pthread_join( ids[i], NULL );
}

gettimeofday( &end, NULL );

d = end.tv_sec - start.tv_sec + (1.0e-6)*end.tv_usec - (1.0e-6)*start.tv_usec;

printf( "%ld) %g %g ips\n", j, d, ((double)j*NUM_ITERATIONS)/d );
}

return 0;
}

We have changed our array from type long, to type Data, a struct that contains padding. This padding guarantees that one struct's data won't share a cache line with other structs, in particular structs used in other threads. Now let's run this code:

1) 15.6487 2.74462e+08 ips
2) 15.0241 5.71744e+08 ips
3) 13.5294 9.52364e+08 ips
4) 12.5411 1.36989e+09 ips

This is what we were expecting (perhaps even a little better).

So how does this effect Maple? I originally tried to write these examples in Maple, but I didn't get significant problems. This is because the overhead of Maple evaluating a statement probably clobbers the shared cache line so it will need to be reloaded in either case. However, as we make Maple faster, it may be possible for false sharing to become a problem in your Maple code.

If you want to download and try executing this code for yourself, be careful of compiler optimizations.  The function that we execute in the threads can easily be optimized into one that does not use the shared memory in each iteration.

I built this example on Linux, using

gcc falsesharing.c -lpthread -o falsesharing

Please Wait...