Okay, two weeks worth of updates here. I've been mostly (still) ironing out various bugs and corner cases.
First off, I ran into some problems with add/multiply reduction over integer types -- in this case only, NumPy converts the input array to the native integer word size (64bit on x86_64) for the reduction. Rather than redescribing the whole issue (including its solution), I'll just link to the numpy-discuss thread:
Last time I mentioned having problems with single-element arrays. I finally just sat down and re-thought out the exact condition in which a reduction occurs, and wrote a code sequence to detect that case (and only that case). These sorts of problems should be put to rest now.
However something I'm going to have to deal with is the possibility of array elements not being contiguous -- for example when performing operations over non-contiguous dimensions of a multi-dimensional array. This should be fairly simple -- compare the step to the size of the basic datatype. If they match, we can go crazy with SIMD, otherwise we need to fall back to an element-wise case or perhaps some code to shuffle disjoint elements together into a single SIMD, and back out again. I think the shuffling will only make sense for operations that are computationally exepensive, and aren't bandwidth-limited.
I've added a multiply ufunc, supporting int32/int64/float32/float64. There doesn't seem to be any suitable SIMD integer multiply instructions, so I had to fall back to using single-element general-purpose multiplication there. The float types are SIMD-ized though, and were actually an easy translation from the add ufuncs. On the integer side, I hit a well-hidden bug in CorePy. Turns out that the operands for the two-register imul instruction are encoded backwards compared to the normal operand encoding order on x86. This took a while to find and fix, as there was no error, just registers with unexpected values.
Back to floats, I did have to deal with initializing an identity value into my 'accumulator' register for multiplication reductions. Before I had just hardcoded the identity to zero (using a shorthand: xor reg, reg). Clearly this won't work for multiply though :) For the floating point types, I used a little hack, since there is no direct way to load immediate values into SSE registers. I figured out the integer binary/hex representation of a 1.0 as a float32 and float64 type, and loaded that as an immediate into a general purpose (GP) register. Then I used movd/movq to copy an value from the GP register to the XMM (SSE) register. From there I just treat it like a float, and everything works. Tricks like this are really nice on architectures that use the same registers for integer and float types. Anyway, I parameterized the loading of the identity value by requiring a function be passed in to the ufunc code generator that generates the identity loading code.
The general ufunc code generation framework is coming along. The general element-wise operation case is pretty good I think; I've simplified/generalized it a little since last post. The reduction code I think is complicated, but necessarily so. In particular the final atomic update is kind of ugly, but I don't have anything better right now. At this point any two operand ufunc should be very easy to implement using the framework.
Next along the lines of the framework is to revisit single-operand functions like cos(). I need to take a stab at generalizing all single-operand functions, then look at what that has in common with the generic two-operand code and again see what I can generalize out into a framework.
The address alignment problem with multiple threads/cores remains; I'll look at that depending on how things go with the framework progress.