Closing in on the finish here. I think I've achieved all the goals described in my original proposal, and ended up going a little further using the framework to write custom ufuncs. To refresh, the SVN is here:
I cleaned up the framework a little, and extended it to support multiple output arrays. In fact, any combination of inputs and outputs is supported as long as the total is <= 5. 5 is a limitation of the x86_64 register file -- 2 registers are required for each input or output (a pointer and a element 'step' or increment value), and I've managed to clear up 10 registers for this purpose. I could support 10 inputs/outputs if I pushed the step values onto the stack (or chose not to use them); I wouldn't expect anything more than a minor performance hit from this, if any.
I've experimented with a couple more ufuncs. Wanting to try conditionals, I implemented a 'greater' ufunc and ran into an issue -- I don't know of a good way to return a boolean array (Python boolean objects) from a CorePy-based ufuncs; all I have is the native integer/float types. If 1/0 integer values were suitable as truth values, the conditional ufuncs would be no problem.
Moving along, I tested the multiple-output support by writing an 'addsub' ufunc. This ufunc takes two arrays as inputs, and outputs two arrays. The first output array contains the sum of each pair of elements, while the second output array contains the difference. Nothing fancy there, just a simple test to make sure my code did what I wanted.
Mark Jeffree suggested vector normalization as a good example for a custom ufunc:
L = 1.0/sqrt(x**2 + y**2)
X = x * L; Y = y * L
This ended up pretty fun to work with, as it is simple, but just involved enough to allow playing around with various optimization tricks. What I ended up doing was using the AMD optimization manual's suggestion of using the reciprical-squareroot instruction combined with a single iteration of newton-raphson to get accuracy slightly less than IEEE, but with better performance than the sqrt instruction. It's kind of funny how a sequence of ~10 instructions is faster than one single instruction! Another advantage is that the longer instruction sequence pipelines much better when unrolling loop iterations. The sqrt instruction has a latency of something like 18 cycles (might be wrong, that's off the top of my head), and a new sqrt instruction can only be issued every 15 cycles. On the other hand, reciprocal-squareroot with an iteration of newton-raphson uses a series of instructions with 4 cycles or less of latency and all can be issued every cycle.
The approach the AMD optimization manual described combined recprocal-squareroot with newton-raphson to get the squareroot as the result (they describe a similar approach for division). I actually wanted the reciprocal of the square root to avoid the division of each component by the length, and was able to remove one multiplication from the newton-raphson. For some reason SSE only has the reciprocal-squareroot and reciprocal instructions for 32-bit floats, not 64-bit floats. So the 64-bit float implementation was pretty boring; I just used sqrt and division instructions.
I just spent a while writing a README and fairly detailed documentation of the gen_ufunc() and create_ufunc() functions that make up my framework. Hopefully things are clear enough that others can use this work. This more or less wraps up my project! Over the next week I'm going to try my hand at implementing a custom ufunc to do the entire computation for a particle simulation demo Chris Mueller (my mentor) did. I'm also attending SciPy next week; I will be giving a talk on recent CorePy work on thursday I believe. Since the talks got cut down to 10 minutes instead of 35, I'll barely have time to highly various things we have going on, spending the most time on what's most relevant to this audience -- my GSoC project.
Expect one more post next week with an update on the particle system and SciPy.