Sunday, July 19, 2009

Another two weeks worth of stuff included in this post.

I implemented support for the int8 and int16 types for add/multiply. The framework had to be extended somewhat. Sounds like no big deal, but the 8bit stuff uncovered some bugs that took a while to fix -- mainly due to the fact that I'm unrolling 16/32 elements worth of operations in the main loop, which is more than the 4/8 I had been doing until now.

Finally went and reworked the multicore support to do things properly. I split the work into double-quadword (16 byte) sized blocks. This is the size of a single SIMD vector on x86, and it is important to work on 16-byte aligned data for better performance. So what I wanted to do was make sure each thread's sub-range to work on was 16-byte aligned regardless of datatype and total array length. Getting all the math right was pretty tricky! I think (hope :)) I got it all correct now. The last thread tends to get extra work when the array length rounded to a 16-byte block times the number of threads -- I think this is OK though, the extra work is never more than num_threads - 1 16-byte blocks. How many elements this is depends on the size of the data type -- a 16-byte block contains only two int64/float64s, or 16 int8s.

Something I've also wanted to do for a while was set up persistent threads and farm out work to them, instead of creating/joining threads in every ufunc call. This reduces the overhead incurred by using multiple threads in a ufunc, and is visible for smaller array sizes (< ~10-100,000 elements). However I'm not sure this is worthwhile, as the multi-thread performance on these smaller array lengths is still worse than just using a single core. I think I need to investigate the overheads more.

I have also generalized the framework to support single-input ufuncs. This was actually easier than I expected, and ended up doing most of the work needed to support > 2 inputs as well. All that remains for >2 input support is register assignment (easy) and sorting out whether reduction makes sense, and how it should be done (probably not too hard).

I revived the cosine ufunc, which had bit-rotted some, and updated it to use the new single-input support in the framework. My testing harness fails the cosine ufunc, but hand inspection shows the results are correct if you tolerate some rounding error. More investigation needed here.

Side note -- I discovered AMD's Software Optimization Guide for AMD64 Processors -- looks like it has some good information in it, including actual instruction latencies! Find it here:

Sunday, July 5, 2009

Okay, two weeks worth of updates here. I've been mostly (still) ironing out various bugs and corner cases.

First off, I ran into some problems with add/multiply reduction over integer types -- in this case only, NumPy converts the input array to the native integer word size (64bit on x86_64) for the reduction. Rather than redescribing the whole issue (including its solution), I'll just link to the numpy-discuss thread:

Last time I mentioned having problems with single-element arrays. I finally just sat down and re-thought out the exact condition in which a reduction occurs, and wrote a code sequence to detect that case (and only that case). These sorts of problems should be put to rest now.

However something I'm going to have to deal with is the possibility of array elements not being contiguous -- for example when performing operations over non-contiguous dimensions of a multi-dimensional array. This should be fairly simple -- compare the step to the size of the basic datatype. If they match, we can go crazy with SIMD, otherwise we need to fall back to an element-wise case or perhaps some code to shuffle disjoint elements together into a single SIMD, and back out again. I think the shuffling will only make sense for operations that are computationally exepensive, and aren't bandwidth-limited.

I've added a multiply ufunc, supporting int32/int64/float32/float64. There doesn't seem to be any suitable SIMD integer multiply instructions, so I had to fall back to using single-element general-purpose multiplication there. The float types are SIMD-ized though, and were actually an easy translation from the add ufuncs. On the integer side, I hit a well-hidden bug in CorePy. Turns out that the operands for the two-register imul instruction are encoded backwards compared to the normal operand encoding order on x86. This took a while to find and fix, as there was no error, just registers with unexpected values.

Back to floats, I did have to deal with initializing an identity value into my 'accumulator' register for multiplication reductions. Before I had just hardcoded the identity to zero (using a shorthand: xor reg, reg). Clearly this won't work for multiply though :) For the floating point types, I used a little hack, since there is no direct way to load immediate values into SSE registers. I figured out the integer binary/hex representation of a 1.0 as a float32 and float64 type, and loaded that as an immediate into a general purpose (GP) register. Then I used movd/movq to copy an value from the GP register to the XMM (SSE) register. From there I just treat it like a float, and everything works. Tricks like this are really nice on architectures that use the same registers for integer and float types. Anyway, I parameterized the loading of the identity value by requiring a function be passed in to the ufunc code generator that generates the identity loading code.

The general ufunc code generation framework is coming along. The general element-wise operation case is pretty good I think; I've simplified/generalized it a little since last post. The reduction code I think is complicated, but necessarily so. In particular the final atomic update is kind of ugly, but I don't have anything better right now. At this point any two operand ufunc should be very easy to implement using the framework.

Next along the lines of the framework is to revisit single-operand functions like cos(). I need to take a stab at generalizing all single-operand functions, then look at what that has in common with the generic two-operand code and again see what I can generalize out into a framework.

The address alignment problem with multiple threads/cores remains; I'll look at that depending on how things go with the framework progress.