Wednesday, April 16, 2014

A little fun with L1/L2/L3 cache RAM fetch speed

I know it has been a while since I've posted. And that's pretty much a reflection of being super busy as well as not having a little nugget at the tips of my fingers to post.

So I'm back.

I wrote the little program below to test the cost of fetching values from lookup tables of various sizes. Presumably this would show up in performance differences once the lookup table exceeds the size of the various cache it could easily fit within.

So, my little experiment went as followed:


Size of LUT:      Clock ticks: 
10k               100k 
100k              110k
1m                90k 
10m               760k
100m              1000k
1g                1800k

From end to end there's a roughly 20x performance difference for the same number of operations. Presumably this is the difference between the 10k, 100k, 1m size array fitting comfortably in the L3 cache, while the 1g fits in RAM. I'm running on a haswell 4770S which has 256kb of L1, 1m of L2, and 8m of L3 cache size, and since the lookup table contains a table of pointers (8 bytes). The LUT table size should be multipled by 8 to get the total memory requirements.

Interesting (but not surprising) with the 20x performance differential when abusing memory.


int main()
{
   uint32_t *tbl;

   //size of table below, i.e. 1G
   uint32_t  size = 1000 * 1000 * 1000;

   tbl = calloc(sizeof(uint32_t*) * size, 1);
   uint32_t i;
   for (i = 0; i < size; ++i) {
     if (rand() % 2)
       tbl[i] = 1;
   }

   uint32_t ct = 100000000;

   //start timing here
   int accum = 0;
   clock_t start, end;
   start = clock();

   while (ct) {
     accum += tbl[ct % size];
     --ct;
   }

   //done with timing here
   end =    clock();
   printf("done: %ld\n", end - start);
   printf("really done: %d\n", accum);
}

No comments:

Post a Comment