Wednesday, November 6, 2013

A Dynamic Programming Primer

Consider the problem of printing a paragraph neatly on a printer, with fixed width font. The input text is a sequence of n words of lengths \(l_1, l_2, \dots, l_n\) . The line length is \(M\) (the maximum number of characters per line). We expect that the paragraph is left justified, that all first words on a line start at the leftmost position and that there is exactly one space between any two words on the same line. We want the uneven right ends of all the lines to be together as neat as possible. Our criterion of neatness is that we wish to minimize the sum, over all lines except the last, of the cubes of the numbers of extra space characters at the ends of the lines. Note that if a printed line contains words \(i\) through \(j\), then the number of spaces at the end of the line is $$ f(i, j) = M - j + i - \sum_{k=i}^{j}{l_k} $$ This problem was sourced from http://www.cs.uiuc.edu/~jeffe/teaching/algorithms/ , thanks Mr. Erickson!

How do we position the line breaks? A recursive approach immediately springs in mind: we put а line break at some, yet undecided, place and then optimally break into lines the rest of the words:
place_breaks (words[1..n]) 
   t = choose_break_position()
   place_breaks (words[1..t])
In a sequence of \(n\) words we may put a break before the first word, before the second word, before the third, ... and so on, for a total of \(n\) possible positions, as long as the words after the break fit on a single line, which is equivalent to the condition \(f(i,j) \gt 0\). Of all the allowed positions to place a break, we choose the position, which minimizes the cost of placing a break there. If we denote by \(C(k)\) the minimum cost of splitting words \(1\) through \(k\), we have for the cost function $$ C(k) = \min_{0 \le t < k, f(t,k-1) \ge 0} \lbrace C(t) + f(t, k-1)^3 \rbrace $$ However, note that the spaces at the end of our last line do not participate in evaluating neatness. Therefore, for the last line only, our cost function \(T(n)\) would look like $$ T(n) = \min_{0 \le t < n, f(t,n-1) \ge 0} \lbrace C(t) \rbrace $$ Boundary case for the cost function would be \(C(0) = 0\) and now we have all the ingredients for implementing an efficient and correct algorithm. The cost function can be evaluated bottom up (\(C(0), C(1), \dots , C(n)\)) using \(O(n)\) additional space for storing intermediate results and each time we choose a break position, we will record it.
class line_breaker {
public:
    void break_lines (size_t, const std::vector<std::string> &);

protected:
    long f (size_t, size_t);
    void calculate_C (size_t);
    size_t calculate_T (size_t);

    std::vector<size_t> S;
    struct solution {
        long cost;
        size_t pos;
    };
    std::vector<solution> C;
    size_t M;
};
This is how the implementation will look like: we encapsulate the functions and the additional space required in a class to avoid passing lots of parameters around. The member function f is our penalty function, according the definition above and the member functions calculate_T and calculate_C, well, calculate the cost function for the last line break and for all the other line breaks, respectively. To speed up the calculation of the penalty function, we use the well-known trick of keeping in the vector S the accumulated lengths of the words, so \(S_i = \sum_{k=0}^i{l_k}\). The implementation of the helper functions comes directly form the formulae above:
long
line_breaker::f (size_t i, size_t j) {
    long x;
    if (i >= j + 1)
        return -1;
    else {
        long y = i == 0 ? S[j] : S[j] - S[i-1];
        x = M - j + i - y;
        return x * x * x;
    }
}

void
line_breaker::calculate_C (size_t n) {
    C[0].cost = 0;
    C[0].pos = 0;

    for (size_t i = 1; i <= n; ++i) {
        size_t t, min_t = 0;
        long min = LONG_MAX;
        
        for (t = 0; t < i; ++t) {
            if (f (t, i-1) >= 0 && C[t].cost + f (t, i-1) < min) {
                min = C[t].cost + f(t, i-1);
                min_t = t;
            }
        }
        
        C[i].cost = min;
        C[i].pos = min_t;
    }
}

size_t
line_breaker::calculate_T (size_t n) {
    size_t t, min_t = 0;
    long min = LONG_MAX;

    calculate_C (n);
    for (t = 0; t < n; ++t) {
        if (f(t, n-1) >= 0 && C[t].cost < min) {
            min = C[t].cost;
            min_t = t;
        }
    }

    return min_t;
}
And now for the function, that performs the actual splitting of the text:
void
line_breaker::break_lines (size_t m, const std::vector<std::string> &text) {
We pass the length of a line in the parameter m and the text to format in the parameter text
    S.clear();
    C.clear();
Clean up whatever is left from a previous invocation.
    size_t s = 0, n;
    for (const auto &str : text) {
        n = str.size();
        if (n > m)
            return;
        s += n;
        S.push_back (s);
    }

    n = text.size();
    C.resize (n+1);
    M = m;
Prepare the various working areas. Note that we check that each word length is less than or equal to the line length - this will guarantee that a solution to the problem in fact exists.
    size_t t = calculate_T (n);
... and invoke the function to place the line breaks, starting from the last one.
    std::vector<size_t>div;
    div.push_back (n);
    while (t > 0) {
        div.push_back (t);
        t = C[t].pos;
    }
    std::reverse (div.begin(), div.end());
The member function calculate_T will return the index of the last break placed. The corresponding element of the vector C will give us the position of the previous break and so on, until we reach position zero. Now we have collected all the line breaks in vector div and have only to reverse it ...
    size_t i = 0;
    for (auto j : div) {
        while (i < j) {
            std::cout << text[i] << " ";
            ++i;
        }
        std::cout << "\n";
    }
        
    std::cout << std::endl;
}
... and output the formatted lines. And that's it. The source follows in its entirety.
#include <iostream>
#include <vector>
#include <climits>
#include <algorithm&gtl
#include <string>

class line_breaker {
public:
    void break_lines (size_t, const std::vector<std::string> &);

protected:
    long f (size_t, size_t);
    void calculate_C (size_t);
    size_t calculate_T (size_t);

    std::vector<size_t> S;
    struct solution {
        long cost;
        size_t pos;
    };
    std::vector<solution> C;
    size_t M;
};

long
line_breaker::f (size_t i, size_t j) {
    long x;
    if (i >= j + 1)
        return -1;
    else {
        long y = i == 0 ? S[j] : S[j] - S[i-1];
        x = M - j + i - y;
        return x * x * x;
    }
}

void
line_breaker::calculate_C (size_t n) {
    C[0].cost = 0;
    C[0].pos = 0;

    for (size_t i = 1; i <= n; ++i) {
        size_t t, min_t = 0;
        long min = LONG_MAX;
        
        for (t = 0; t < i; ++t) {
            if (f (t, i-1) >= 0 && C[t].cost + f (t, i-1) < min) {
                min = C[t].cost + f(t, i-1);
                min_t = t;
            }
        }
        
        C[i].cost = min;
        C[i].pos = min_t;
    }
}

size_t
line_breaker::calculate_T (size_t n) {
    size_t t, min_t = 0;
    long min = LONG_MAX;

    calculate_C (n);
    for (t = 0; t < n; ++t) {
        if (f(t, n-1) >= 0 && C[t].cost < min) {
            min = C[t].cost;
            min_t = t;
        }
    }

    return min_t;
}

void
line_breaker::break_lines (size_t m, const std::vector<std::string> &text) {
    S.clear();
    C.clear();
    
    size_t s = 0, n;
    for (const auto &str : text) {
        n = str.size();
        if (n > m)
            return;
        s += n;
        S.push_back (s);
    }

    n = text.size();
    C.resize (n+1);
    M = m;

    size_t t = calculate_T (n);

    std::vector<size_t> div;
    div.push_back (n);
    while (t > 0) {
        div.push_back (t);
        t = C[t].pos;
    }
    std::reverse (div.begin(), div.end());

    size_t i = 0;
    for (auto j : div) {
        while (i < j) {
            std::cout << text[i] << " ";
            ++i;
        }
        std::cout << "\n";
    }
        
    std::cout << std::endl;
}

std::vector<std::string> text = {
    "Lorem", "Ipsum", "is", "simply", "dummy", "text", "of", "the",
    "printing", "and", "typesetting", "industry.", "Lorem", "Ipsum",
    "has", "been", "the", "industry's", "standard", "dummy", "text",
    "ever", "since", "the", "1500s", "when", "an", "unknown", "printer",
    "took", "a", "galley", "of", "type", "and", "scrambled", "it", "to",
    "make", "a", "type", "specimen", "book.", "It", "has", "survived",
    "not", "only", "five", "centuries,", "but", "also", "the", "leap",
    "into", "electronic", "typesetting,", "remaining", "essentially",
    "unchanged.", "It", "was", "popularised", "in", "the", "1960s",
    "with", "the", "release", "of", "Letraset", "sheets", "containing",
    "Lorem", "Ipsum", "passages,", "and", "more", "recently", "with",
    "desktop", "publishing", "software", "like", "Aldus", "PageMaker",
    "including", "versions", "of", "Lorem", "Ipsum" };

int
main () {
    line_breaker br;
    br.break_lines (30, text);
}

Monday, October 21, 2013

Non-recursive inorder and postorder tree traversal

Depth-first traversal using a stack, is a well-known technique for traversing large trees, when the process stack may not be large enough. For reference, here is one simple implementation:

#include <stack>

struct tree {
    int value;
    tree *left;
    tree *right;
};

void preorder (tree *t, void (*visit) (tree *)) {
    std::stack<tree *> work;

    if (t == nullptr)
        return;

    work.push (t);
    while (! work.empty ()) {
        t = work.top ();
        work.pop ();

        visit (t);

        if (t->right)
            work.push (t->right);
        if (t->left)
            work.push (t->left);
    }
}
However, this algorithm does a pre-order traversal, i.e. the root of each subtree is visited before its children. But how about a post-order or an in-order traversal? When we pop a node from the stack, we are in two different situations: we may need to visit the node now or we may need to push its children and visit it later. The insight is to use a secondary stack, in which to keep track of the parent nodes, which have to be visited later. When a node comes out of the stack for the first time, we put it back, together with its children, in the appropriate order, but we also put it on the secondary stack. After we have processed all of the node's children (or all of the left subtree, for an in-order traversal), the node pops off the primary stack again, but this time it's also at the top of the secondary stack! Now we know the time has come to process it. The following two functions implement this idea for in-order and post-order traversals:
void inorder (tree *t, void (*visit) (tree *)) {
    std::stack<tree *> parents;
    std::stack<tree *> work;
    
    if (t == nullptr)
        return;

    work.push (t);

    while (! work.empty ()) {
        t = work.top ();
        work.pop ();

        if (! parents.empty () && t == parents.top()) {
            parents.pop ();
            visit (t);
        } else {
            parents.push (t);
            if (t->right)
                work.push (t->right);
            work.push (t);
            if (t->left)
                work.push (t->left);
        }
    }
}

void postorder (tree *t, void (*visit)(tree *)) {
    std::stack<tree *> parents;
    std::stack<tree *> work;
    
    if (t == nullptr)
        return;

    work.push (t);

    while (! work.empty ()) {
        t = work.top ();

        if (! parents.empty () && t == parents.top ()) {
            work.pop ();
            parents.pop ();
            visit (t);
        } else {
            parents.push (t);
            if (t->right)
                work.push (t->right);
            if (t->left)
                work.push (t->left);
        }
    }
}

Sunday, October 6, 2013

Geometric meaning of quaternion addition

I was reading some posts on StackOverflow about representing a rotation from one vector to another (
... calculate the quaternion which results, in twice the required rotation (as detailed in the other solution), and find the quaternion half-way between that and zero degrees. ... Calculating the half-way quaternion is simply a matter of summing the quaternions and normalizing the result, just like with vectors.
OK, I didn't remember anything about the geometrical interpretation of quaternion addition in my favourite references Real-time rendering or Physically Based Rendering: From Theory To Implementation .

Reached to the pencil ..

We have two unit quaternions, representing different rotations around the same axis:

$$  Q = <\vec{a}sin\frac{\theta}{2}, cos\frac{\theta}{2}> \\ P = <\vec{a}sin\frac{\phi}{2}, cos\frac{\phi}{2}> $$

The sum of the quaternions is

$$ P + Q = <\vec{a}(sin \frac{\theta}{2} + sin \frac{\phi}{2}), cos \frac{\theta}{2} + cos \frac{\phi}{2}> $$

Let's normalize this quaternion, using the trigonometry identities

$$\begin{align}
    u = & \frac{\theta}{2} \\
    v = & \frac{\phi}{2} \\
    sin u + sin v = & 2sin\frac{u+v}{2}cos\frac{u-v}{2} \\
    cos u + cos v = & 2cos\frac{u+v}{2}cos\frac{u-v}{2}
\end{align}
 $$

For the norm of the quaternion we have

$$\begin{align}
|P+Q|^2 = & |\vec{a}|^24sin^2\frac{u+v}{2}cos^2\frac{u-v}{2} + 4cos^2\frac{u+v}{2}cos^2\frac{u-v}{2} = \\
 = & 4|\vec{a}|^2cos^2\frac{u-v}{2}(sin^2\frac{u+v}{2} + cos^2\frac{u+v}{2}) = \\
 = & 4cos^2\frac{u-v}{2} \\
 |P + Q| = & 2cos\frac{u-v}{2}
\end{align}
$$

Dividing each quaternion component by the norm, gives us

$$
< \vec{a}\frac{2sin\frac{u+v}{2}cos\frac{u-v}{2}}{2cos\frac{u-v}{2}}, \frac{2cos\frac{u+v}{2}cos\frac{u-v}{2}}{2cos\frac{u-v}{2}}> = \\
=  <\vec{a}sin\frac{u+v}{2}, cos\frac{u+v}{2}> = \\
= <\vec{a}sin\frac{\theta+\phi}{4}, cos\frac{\theta+\phi}{4}>
$$

 which is exactly the quaternion for the rotation around axis \(\vec{a}\) by angle  \(\frac{\theta+\phi}{2}\) - the angle halfway between \(\theta\) and \(\phi\).

The cost of the false sharing/cache bounce

I wrote this initially in 2007 or so and then accidentally deleted it a few weeks ago. Thanks to Google cache, though, I now have the chance to see if anything changed in the past six years and update the post.

Back in 2007


Not many people seem to realize just how expensive is memory sharing and cache bounce in parallel systems. The following code is intended as a simple demonstration/benchmark of the bounce costs.

The main thread spawns NTHREADS worker threads. Each worker thread proceeds to independently increment a distinct memory location, NLOOPS times. In the first variant, the memory locations are in different cache lines. In the second variant (obtained by compiling with -DBOUNCE), the memory locations are adjacent and in the same cache line.
#include <stdio.h>
#include <stdint.h>
#include <pthread.h>

#ifndef NTHREADS
#define NTHREADS 8
#endif
#define NLOOPS 1000000000

static void *
worker (void *_cnt)
{
  uint64_t i;
  volatile uint64_t *cnt = _cnt;

  *cnt = 0;

  for (i = 0; i < NLOOPS; ++i)
    (*cnt)++;

  return 0;
}

// No more then 8 threads !!!
static uint64_t cnt [8][128];

int
main ()
{
  int i;
  uint64_t sum;
  pthread_t t [NTHREADS];

  for (i = 0; i < NTHREADS; ++i)
    {
#ifdef BOUNCE
        pthread_create (&t [i], 0, worker, &cnt [0][i]);
#else
        pthread_create (&t [i], 0, worker, &cnt [i][0]);
#endif
    }

  sum = 0;
  for (i = 0; i < NTHREADS; ++i)
    {
      pthread_join (t [i], 0);
#ifdef BOUNCE
      sum += cnt [0][i];
#else
      sum += cnt [i][0];
#endif
    }

  printf ("%llu\n", sum);
  return 0;
}

I ran the benchmark on GNU/Linux, kernel 2.6.15, 4x (2x dual-core) Opteron 2.2Mhz (Sun Fire X4200) and obtained the following results:
$ gcc -O3 cache-test.c -lpthread
$ time ./a.out
800000000

real    0m0.648s
user    0m2.396s
sys     0m0.000s
$ time ./a.out
800000000

real    0m0.615s
user    0m2.384s
sys     0m0.004s
$ time ./a.out
800000000

real    0m0.622s
user    0m2.436s
sys     0m0.004s
$ gcc -O3 -DBOUNCE cache-test.c -lpthread
$ time ./a.out
800000000

real    0m9.991s
user    0m29.374s
sys     0m0.008s
$ time ./a.out
800000000

real    0m9.957s
user    0m29.310s
sys     0m0.016s
$ time ./a.out
800000000

real    0m9.974s
user    0m29.330s
sys     0m0.012s
$

Taking the minimum of each measurement gives 9.957/0.615 == 16.9, in other words the cache bounces cause close to 1700% slowdown.

Fast forward to 2013

Again, the program was compiled with and without `-DBOUNCE`, but this time I varied the number of threads, from 1 to 8.  The processor is Intel Core i7-3820, with 4 physical cores and 2 hyper-threads per core. The data follows:

Threads12345678
bounce 1.963 1.993 2.023 2.027 2.0712.0762.1012.177
no bounce 1.970 1.992 2.256 2.453 3.7674.394 5.4992.794


The chart is not really surprising, the cache bounces cripple the otherwise perfect linear scalability.

Except the last datapoint! With all the 8 hyper-threads working, the performance is back in line. What's going on here?

For now, I have only one explanation.

A wizard did it.

Fast SSE quaternion multiplication

I needed the other day some quaternion multiplication code for a certain pet project of mine. A quick Google search later, I came to this one at StackOverlow: How to multiply two quaternions with minimal instructions?
#include <pmmintrin.h> /* SSE3 intrinsics */

/* multiplication of two quaternions (x, y, z, w) x (a, b, c, d) */

__m128 _mm_cross4_ps(__m128 xyzw, __m128 abcd)
{
    /* The product of two quaternions is:                                 */
    /* (X,Y,Z,W) = (xd+yc-zb+wa, -xc+yd+za+wb, xb-ya+zd+wc, -xa-yb-zc+wd) */

    __m128 wzyx = _mm_shuffle_ps(xyzw, xyzw, _MM_SHUFFLE(0,1,2,3));
    __m128 baba = _mm_shuffle_ps(abcd, abcd, _MM_SHUFFLE(0,1,0,1));
    __m128 dcdc = _mm_shuffle_ps(abcd, abcd, _MM_SHUFFLE(2,3,2,3));

    /* variable names below are for parts of componens of result (X,Y,Z,W) */
    /* nX stands for -X and similarly for the other components             */

    /* znxwy  = (xb - ya, zb - wa, wd - zc, yd - xc) */
    __m128 ZnXWY = _mm_hsub_ps(_mm_mul_ps(xyzw, baba), _mm_mul_ps(wzyx, dcdc));

    /* xzynw  = (xd + yc, zd + wc, wb + za, yb + xa) */
    __m128 XZYnW = _mm_hadd_ps(_mm_mul_ps(xyzw, dcdc), _mm_mul_ps(wzyx, baba));

    /* _mm_shuffle_ps(XZYnW, ZnXWY, _MM_SHUFFLE(3,2,1,0)) */
    /*      = (xd + yc, zd + wc, wd - zc, yd - xc)        */
    /* _mm_shuffle_ps(ZnXWY, XZYnW, _MM_SHUFFLE(2,3,0,1)) */
    /*      = (zb - wa, xb - ya, yb + xa, wb + za)        */

    /* _mm_addsub_ps adds elements 1 and 3 and subtracts elements 0 and 2, so we get: */
    /* _mm_addsub_ps(*, *) = (xd+yc-zb+wa, xb-ya+zd+wc, wd-zc+yb+xa, yd-xc+wb+za)     */

    __m128 XZWY = _mm_addsub_ps(_mm_shuffle_ps(XZYnW, ZnXWY, _MM_SHUFFLE(3,2,1,0)),
                                _mm_shuffle_ps(ZnXWY, XZYnW, _MM_SHUFFLE(2,3,0,1)));

    /* now we only need to shuffle the components in place and return the result      */
    return _mm_shuffle_ps(XZWY, XZWY, _MM_SHUFFLE(2,1,3,0));

    /* operations: 6 shuffles, 4 multiplications, 3 compound additions/subtractions   */ 
} 

Which is pretty pretty nice, compiling only to 18 instructions. There are some things of note, though:
  • the `_mm_shuffle_ps` intrinsic generates the `shufps` insn, which on recent Intel processors has 1 cycle latency and 1 cycle throughput. When swizzling a single register, though, the `_mm_shuffle_epi32` (the `pshufd` insn) seems preferable, because it has one cycle latency, but 0.5 cycles throughput, which, in my understanding, means that in a single cycle two of these insns can be issued to the same port;
  • the `_mm_hadd_ps/_mm_hsub_ps` are a bit on the heavy side for addition/subtraction with 5 cycles of latency and 2 cycles throughput, i.e. such insn can be issued every other cycle and it takes 5 cycles until their result is available for subsequent insns.  Other addition/subtraction insns (`addps`, `addsubps`, etc) take 3 cycles with 1 cycle of throughput.
And I got busy coding, having a nice ballpark figure (18 insns) to aim fo. In the end, I came up with the following code:

#include <immintrin.h>

#define _mm_pshufd(r,i) __m128 (_mm_shuffle_epi32 (__m128i (r), i))

/* Nehalem/Westmere/SandyBidge/IvyBridge insn timings.  */
__m128
qmul (__m128 abcd, __m128 xyzw)
{
  __m128 t0 = _mm_pshufd (abcd, _MM_SHUFFLE (3, 3, 3, 3)); /* 1, 0.5 */
  __m128 t1 = _mm_pshufd (xyzw, _MM_SHUFFLE (2, 3, 0, 1)); /* 1, 0.5 */

  __m128 t3 = _mm_pshufd (abcd, _MM_SHUFFLE (0, 0, 0, 0)); /* 1, 0.5 */
  __m128 t4 = _mm_pshufd (xyzw, _MM_SHUFFLE (1, 0, 3, 2)); /* 1, 0.5 */

  __m128 t5 = _mm_pshufd (abcd, _MM_SHUFFLE (1, 1, 1, 1)); /* 1, 0.5 */
  __m128 t6 = _mm_pshufd (xyzw, _MM_SHUFFLE (2, 0, 3, 1)); /* 1, 0.5 */

  /* [d,d,d,d]*[z,w,x,y] = [dz,dw,dx,dy] */
  __m128 m0 = _mm_mul_ps (t0, t1); /* 5/4, 1 */

  /* [a,a,a,a]*[y,x,w,z] = [ay,ax,aw,az]*/
  __m128 m1 = _mm_mul_ps (t3, t4); /* 5/4, 1 */

  /* [b,b,b,b]*[z,x,w,y] = [bz,bx,bw,by]*/
  __m128 m2 = _mm_mul_ps (t5, t6); /* 5/4, 1 */

  /* [c,c,c,c]*[w,z,x,y] = [cw,cz,cx,cy] */
  __m128 t7 = _mm_pshufd (abcd, _MM_SHUFFLE (2, 2, 2, 2)); /* 1, 0.5 */
  __m128 t8 = _mm_pshufd (xyzw, _MM_SHUFFLE (3, 2, 0, 1)); /* 1, 0.5 */

  __m128 m3 = _mm_mul_ps (t7, t8); /* 5/4, 1 */

  /* 1 */
  /* [dz,dw,dx,dy]+-[ay,ax,aw,az] = [dz+ay,dw-ax,dx+aw,dy-az] */
  __m128 e = _mm_addsub_ps (m0, m1); /* 3, 1 */

  /* 2 */
  /* [dx+aw,dz+ay,dy-az,dw-ax] */
  e = _mm_pshufd (e, _MM_SHUFFLE (1, 3, 0, 2)); /* 1, 0.5 */

  /* [dx+aw,dz+ay,dy-az,dw-ax]+-[bz,bx,bw,by] = [dx+aw+bz,dz+ay-bx,dy-az+bw,dw-ax-by]*/
  e = _mm_addsub_ps (e, m2); /* 3, 1 */

  /* 2 */
  /* [dz+ay-bx,dw-ax-by,dy-az+bw,dx+aw+bz] */
  e = _mm_pshufd (e, _MM_SHUFFLE (2, 0, 1, 3)); /* 1, 0.5 */

  /* [dz+ay-bx,dw-ax-by,dy-az+bw,dx+aw+bz]+-[cw,cz,cx,cy]
     = [dz+ay-bx+cw,dw-ax-by-cz,dy-az+bw+cx,dx+aw+bz-cy] */
  e = _mm_addsub_ps (e, m3); /* 3, 1 */

  /* 2 */
  /* [dw-ax-by-cz,dz+ay-bx+cw,dy-az+bw+cx,dx+aw+bz-cy] */
  e = _mm_pshufd (e, _MM_SHUFFLE (2, 3, 1, 0)); /* 1, 0.5 */
  return e;
} 

Also 18 insns, and, according to my rough calculations and timing with `rdtsc` - 18 cycles per quaternion multiplication. In the meantime, I also tweaked the other code a bit, replacing `shufps` with `pshufd`, where possible, reducing it to 16 insns. I'm not entirely sure, which version is faster, but the difference seems insignificant for any application of this code.

Cheers.