Monday, October 21, 2013

Non-recursive inorder and postorder tree traversal

Depth-first traversal using a stack, is a well-known technique for traversing large trees, when the process stack may not be large enough. For reference, here is one simple implementation:

#include <stack>

struct tree {
    int value;
    tree *left;
    tree *right;
};

void preorder (tree *t, void (*visit) (tree *)) {
    std::stack<tree *> work;

    if (t == nullptr)
        return;

    work.push (t);
    while (! work.empty ()) {
        t = work.top ();
        work.pop ();

        visit (t);

        if (t->right)
            work.push (t->right);
        if (t->left)
            work.push (t->left);
    }
}
However, this algorithm does a pre-order traversal, i.e. the root of each subtree is visited before its children. But how about a post-order or an in-order traversal? When we pop a node from the stack, we are in two different situations: we may need to visit the node now or we may need to push its children and visit it later. The insight is to use a secondary stack, in which to keep track of the parent nodes, which have to be visited later. When a node comes out of the stack for the first time, we put it back, together with its children, in the appropriate order, but we also put it on the secondary stack. After we have processed all of the node's children (or all of the left subtree, for an in-order traversal), the node pops off the primary stack again, but this time it's also at the top of the secondary stack! Now we know the time has come to process it. The following two functions implement this idea for in-order and post-order traversals:
void inorder (tree *t, void (*visit) (tree *)) {
    std::stack<tree *> parents;
    std::stack<tree *> work;
    
    if (t == nullptr)
        return;

    work.push (t);

    while (! work.empty ()) {
        t = work.top ();
        work.pop ();

        if (! parents.empty () && t == parents.top()) {
            parents.pop ();
            visit (t);
        } else {
            parents.push (t);
            if (t->right)
                work.push (t->right);
            work.push (t);
            if (t->left)
                work.push (t->left);
        }
    }
}

void postorder (tree *t, void (*visit)(tree *)) {
    std::stack<tree *> parents;
    std::stack<tree *> work;
    
    if (t == nullptr)
        return;

    work.push (t);

    while (! work.empty ()) {
        t = work.top ();

        if (! parents.empty () && t == parents.top ()) {
            work.pop ();
            parents.pop ();
            visit (t);
        } else {
            parents.push (t);
            if (t->right)
                work.push (t->right);
            if (t->left)
                work.push (t->left);
        }
    }
}

Sunday, October 6, 2013

Geometric meaning of quaternion addition

I was reading some posts on StackOverflow about representing a rotation from one vector to another (
... calculate the quaternion which results, in twice the required rotation (as detailed in the other solution), and find the quaternion half-way between that and zero degrees. ... Calculating the half-way quaternion is simply a matter of summing the quaternions and normalizing the result, just like with vectors.
OK, I didn't remember anything about the geometrical interpretation of quaternion addition in my favourite references Real-time rendering or Physically Based Rendering: From Theory To Implementation .

Reached to the pencil ..

We have two unit quaternions, representing different rotations around the same axis:

$$  Q = <\vec{a}sin\frac{\theta}{2}, cos\frac{\theta}{2}> \\ P = <\vec{a}sin\frac{\phi}{2}, cos\frac{\phi}{2}> $$

The sum of the quaternions is

$$ P + Q = <\vec{a}(sin \frac{\theta}{2} + sin \frac{\phi}{2}), cos \frac{\theta}{2} + cos \frac{\phi}{2}> $$

Let's normalize this quaternion, using the trigonometry identities

$$\begin{align}
    u = & \frac{\theta}{2} \\
    v = & \frac{\phi}{2} \\
    sin u + sin v = & 2sin\frac{u+v}{2}cos\frac{u-v}{2} \\
    cos u + cos v = & 2cos\frac{u+v}{2}cos\frac{u-v}{2}
\end{align}
 $$

For the norm of the quaternion we have

$$\begin{align}
|P+Q|^2 = & |\vec{a}|^24sin^2\frac{u+v}{2}cos^2\frac{u-v}{2} + 4cos^2\frac{u+v}{2}cos^2\frac{u-v}{2} = \\
 = & 4|\vec{a}|^2cos^2\frac{u-v}{2}(sin^2\frac{u+v}{2} + cos^2\frac{u+v}{2}) = \\
 = & 4cos^2\frac{u-v}{2} \\
 |P + Q| = & 2cos\frac{u-v}{2}
\end{align}
$$

Dividing each quaternion component by the norm, gives us

$$
< \vec{a}\frac{2sin\frac{u+v}{2}cos\frac{u-v}{2}}{2cos\frac{u-v}{2}}, \frac{2cos\frac{u+v}{2}cos\frac{u-v}{2}}{2cos\frac{u-v}{2}}> = \\
=  <\vec{a}sin\frac{u+v}{2}, cos\frac{u+v}{2}> = \\
= <\vec{a}sin\frac{\theta+\phi}{4}, cos\frac{\theta+\phi}{4}>
$$

 which is exactly the quaternion for the rotation around axis \(\vec{a}\) by angle  \(\frac{\theta+\phi}{2}\) - the angle halfway between \(\theta\) and \(\phi\).

The cost of the false sharing/cache bounce

I wrote this initially in 2007 or so and then accidentally deleted it a few weeks ago. Thanks to Google cache, though, I now have the chance to see if anything changed in the past six years and update the post.

Back in 2007


Not many people seem to realize just how expensive is memory sharing and cache bounce in parallel systems. The following code is intended as a simple demonstration/benchmark of the bounce costs.

The main thread spawns NTHREADS worker threads. Each worker thread proceeds to independently increment a distinct memory location, NLOOPS times. In the first variant, the memory locations are in different cache lines. In the second variant (obtained by compiling with -DBOUNCE), the memory locations are adjacent and in the same cache line.
#include <stdio.h>
#include <stdint.h>
#include <pthread.h>

#ifndef NTHREADS
#define NTHREADS 8
#endif
#define NLOOPS 1000000000

static void *
worker (void *_cnt)
{
  uint64_t i;
  volatile uint64_t *cnt = _cnt;

  *cnt = 0;

  for (i = 0; i < NLOOPS; ++i)
    (*cnt)++;

  return 0;
}

// No more then 8 threads !!!
static uint64_t cnt [8][128];

int
main ()
{
  int i;
  uint64_t sum;
  pthread_t t [NTHREADS];

  for (i = 0; i < NTHREADS; ++i)
    {
#ifdef BOUNCE
        pthread_create (&t [i], 0, worker, &cnt [0][i]);
#else
        pthread_create (&t [i], 0, worker, &cnt [i][0]);
#endif
    }

  sum = 0;
  for (i = 0; i < NTHREADS; ++i)
    {
      pthread_join (t [i], 0);
#ifdef BOUNCE
      sum += cnt [0][i];
#else
      sum += cnt [i][0];
#endif
    }

  printf ("%llu\n", sum);
  return 0;
}

I ran the benchmark on GNU/Linux, kernel 2.6.15, 4x (2x dual-core) Opteron 2.2Mhz (Sun Fire X4200) and obtained the following results:
$ gcc -O3 cache-test.c -lpthread
$ time ./a.out
800000000

real    0m0.648s
user    0m2.396s
sys     0m0.000s
$ time ./a.out
800000000

real    0m0.615s
user    0m2.384s
sys     0m0.004s
$ time ./a.out
800000000

real    0m0.622s
user    0m2.436s
sys     0m0.004s
$ gcc -O3 -DBOUNCE cache-test.c -lpthread
$ time ./a.out
800000000

real    0m9.991s
user    0m29.374s
sys     0m0.008s
$ time ./a.out
800000000

real    0m9.957s
user    0m29.310s
sys     0m0.016s
$ time ./a.out
800000000

real    0m9.974s
user    0m29.330s
sys     0m0.012s
$

Taking the minimum of each measurement gives 9.957/0.615 == 16.9, in other words the cache bounces cause close to 1700% slowdown.

Fast forward to 2013

Again, the program was compiled with and without `-DBOUNCE`, but this time I varied the number of threads, from 1 to 8.  The processor is Intel Core i7-3820, with 4 physical cores and 2 hyper-threads per core. The data follows:

Threads12345678
bounce 1.963 1.993 2.023 2.027 2.0712.0762.1012.177
no bounce 1.970 1.992 2.256 2.453 3.7674.394 5.4992.794


The chart is not really surprising, the cache bounces cripple the otherwise perfect linear scalability.

Except the last datapoint! With all the 8 hyper-threads working, the performance is back in line. What's going on here?

For now, I have only one explanation.

A wizard did it.

Fast SSE quaternion multiplication

I needed the other day some quaternion multiplication code for a certain pet project of mine. A quick Google search later, I came to this one at StackOverlow: How to multiply two quaternions with minimal instructions?
#include <pmmintrin.h> /* SSE3 intrinsics */

/* multiplication of two quaternions (x, y, z, w) x (a, b, c, d) */

__m128 _mm_cross4_ps(__m128 xyzw, __m128 abcd)
{
    /* The product of two quaternions is:                                 */
    /* (X,Y,Z,W) = (xd+yc-zb+wa, -xc+yd+za+wb, xb-ya+zd+wc, -xa-yb-zc+wd) */

    __m128 wzyx = _mm_shuffle_ps(xyzw, xyzw, _MM_SHUFFLE(0,1,2,3));
    __m128 baba = _mm_shuffle_ps(abcd, abcd, _MM_SHUFFLE(0,1,0,1));
    __m128 dcdc = _mm_shuffle_ps(abcd, abcd, _MM_SHUFFLE(2,3,2,3));

    /* variable names below are for parts of componens of result (X,Y,Z,W) */
    /* nX stands for -X and similarly for the other components             */

    /* znxwy  = (xb - ya, zb - wa, wd - zc, yd - xc) */
    __m128 ZnXWY = _mm_hsub_ps(_mm_mul_ps(xyzw, baba), _mm_mul_ps(wzyx, dcdc));

    /* xzynw  = (xd + yc, zd + wc, wb + za, yb + xa) */
    __m128 XZYnW = _mm_hadd_ps(_mm_mul_ps(xyzw, dcdc), _mm_mul_ps(wzyx, baba));

    /* _mm_shuffle_ps(XZYnW, ZnXWY, _MM_SHUFFLE(3,2,1,0)) */
    /*      = (xd + yc, zd + wc, wd - zc, yd - xc)        */
    /* _mm_shuffle_ps(ZnXWY, XZYnW, _MM_SHUFFLE(2,3,0,1)) */
    /*      = (zb - wa, xb - ya, yb + xa, wb + za)        */

    /* _mm_addsub_ps adds elements 1 and 3 and subtracts elements 0 and 2, so we get: */
    /* _mm_addsub_ps(*, *) = (xd+yc-zb+wa, xb-ya+zd+wc, wd-zc+yb+xa, yd-xc+wb+za)     */

    __m128 XZWY = _mm_addsub_ps(_mm_shuffle_ps(XZYnW, ZnXWY, _MM_SHUFFLE(3,2,1,0)),
                                _mm_shuffle_ps(ZnXWY, XZYnW, _MM_SHUFFLE(2,3,0,1)));

    /* now we only need to shuffle the components in place and return the result      */
    return _mm_shuffle_ps(XZWY, XZWY, _MM_SHUFFLE(2,1,3,0));

    /* operations: 6 shuffles, 4 multiplications, 3 compound additions/subtractions   */ 
} 

Which is pretty pretty nice, compiling only to 18 instructions. There are some things of note, though:
  • the `_mm_shuffle_ps` intrinsic generates the `shufps` insn, which on recent Intel processors has 1 cycle latency and 1 cycle throughput. When swizzling a single register, though, the `_mm_shuffle_epi32` (the `pshufd` insn) seems preferable, because it has one cycle latency, but 0.5 cycles throughput, which, in my understanding, means that in a single cycle two of these insns can be issued to the same port;
  • the `_mm_hadd_ps/_mm_hsub_ps` are a bit on the heavy side for addition/subtraction with 5 cycles of latency and 2 cycles throughput, i.e. such insn can be issued every other cycle and it takes 5 cycles until their result is available for subsequent insns.  Other addition/subtraction insns (`addps`, `addsubps`, etc) take 3 cycles with 1 cycle of throughput.
And I got busy coding, having a nice ballpark figure (18 insns) to aim fo. In the end, I came up with the following code:

#include <immintrin.h>

#define _mm_pshufd(r,i) __m128 (_mm_shuffle_epi32 (__m128i (r), i))

/* Nehalem/Westmere/SandyBidge/IvyBridge insn timings.  */
__m128
qmul (__m128 abcd, __m128 xyzw)
{
  __m128 t0 = _mm_pshufd (abcd, _MM_SHUFFLE (3, 3, 3, 3)); /* 1, 0.5 */
  __m128 t1 = _mm_pshufd (xyzw, _MM_SHUFFLE (2, 3, 0, 1)); /* 1, 0.5 */

  __m128 t3 = _mm_pshufd (abcd, _MM_SHUFFLE (0, 0, 0, 0)); /* 1, 0.5 */
  __m128 t4 = _mm_pshufd (xyzw, _MM_SHUFFLE (1, 0, 3, 2)); /* 1, 0.5 */

  __m128 t5 = _mm_pshufd (abcd, _MM_SHUFFLE (1, 1, 1, 1)); /* 1, 0.5 */
  __m128 t6 = _mm_pshufd (xyzw, _MM_SHUFFLE (2, 0, 3, 1)); /* 1, 0.5 */

  /* [d,d,d,d]*[z,w,x,y] = [dz,dw,dx,dy] */
  __m128 m0 = _mm_mul_ps (t0, t1); /* 5/4, 1 */

  /* [a,a,a,a]*[y,x,w,z] = [ay,ax,aw,az]*/
  __m128 m1 = _mm_mul_ps (t3, t4); /* 5/4, 1 */

  /* [b,b,b,b]*[z,x,w,y] = [bz,bx,bw,by]*/
  __m128 m2 = _mm_mul_ps (t5, t6); /* 5/4, 1 */

  /* [c,c,c,c]*[w,z,x,y] = [cw,cz,cx,cy] */
  __m128 t7 = _mm_pshufd (abcd, _MM_SHUFFLE (2, 2, 2, 2)); /* 1, 0.5 */
  __m128 t8 = _mm_pshufd (xyzw, _MM_SHUFFLE (3, 2, 0, 1)); /* 1, 0.5 */

  __m128 m3 = _mm_mul_ps (t7, t8); /* 5/4, 1 */

  /* 1 */
  /* [dz,dw,dx,dy]+-[ay,ax,aw,az] = [dz+ay,dw-ax,dx+aw,dy-az] */
  __m128 e = _mm_addsub_ps (m0, m1); /* 3, 1 */

  /* 2 */
  /* [dx+aw,dz+ay,dy-az,dw-ax] */
  e = _mm_pshufd (e, _MM_SHUFFLE (1, 3, 0, 2)); /* 1, 0.5 */

  /* [dx+aw,dz+ay,dy-az,dw-ax]+-[bz,bx,bw,by] = [dx+aw+bz,dz+ay-bx,dy-az+bw,dw-ax-by]*/
  e = _mm_addsub_ps (e, m2); /* 3, 1 */

  /* 2 */
  /* [dz+ay-bx,dw-ax-by,dy-az+bw,dx+aw+bz] */
  e = _mm_pshufd (e, _MM_SHUFFLE (2, 0, 1, 3)); /* 1, 0.5 */

  /* [dz+ay-bx,dw-ax-by,dy-az+bw,dx+aw+bz]+-[cw,cz,cx,cy]
     = [dz+ay-bx+cw,dw-ax-by-cz,dy-az+bw+cx,dx+aw+bz-cy] */
  e = _mm_addsub_ps (e, m3); /* 3, 1 */

  /* 2 */
  /* [dw-ax-by-cz,dz+ay-bx+cw,dy-az+bw+cx,dx+aw+bz-cy] */
  e = _mm_pshufd (e, _MM_SHUFFLE (2, 3, 1, 0)); /* 1, 0.5 */
  return e;
} 

Also 18 insns, and, according to my rough calculations and timing with `rdtsc` - 18 cycles per quaternion multiplication. In the meantime, I also tweaked the other code a bit, replacing `shufps` with `pshufd`, where possible, reducing it to 16 insns. I'm not entirely sure, which version is faster, but the difference seems insignificant for any application of this code.

Cheers.