Monday, August 31, 2009

This will probably be my last post for the project. Id like to thank again my mentor Chis Mueller and backup mentor Stéfan van der Walt, I think they did a lot in helping to make this project happen in the first place.

I've written a bunch of documentation on the CorePy wiki (a lot of it came from the code/README):

http://corepy.org/wiki/index.php?title=CoreFunc

When any updates to this work occur, the wiki page will be updated.

The SciPy talk went pretty well I think, though I wish I had more than 10 minutes to talk. A lot of important things got left out, and I felt there were places I didn't explain things very well. Even so, it was great to be able to go to SciPy and give a talk -- thanks to the people who make SciPy happen! There seemed to be quite a few people interested in CorePy; I had quite a few questions and had some nice discussions with various people.
Slides:
http://www.osl.iu.edu/~afriedle/presentations/scipy-corepy-afriedle.pdf

Video recording of my actual talk:
http://www.archive.org/details/scipy09_day1_08-Andrew_Friedley

As promised I used the CoreFunc framework to update Chris' particle simulation demo. Pretty cool stuff -- I was able to move the entire particle position/velocity update calculation into a ufunc. At 200,000 particles I was seeing ~100x speedup over the pure NumPy version, though my measurement wasn't particularly scientific. Check out the glparticle-cf.py file in the project repository/branch (listed in an earlier post).

Sunday, August 16, 2009

Closing in on the finish here. I think I've achieved all the goals described in my original proposal, and ended up going a little further using the framework to write custom ufuncs. To refresh, the SVN is here:

https://svn.osl.iu.edu/trac/corepy/browser/branches/afriedle-gsoc-ufunc

I cleaned up the framework a little, and extended it to support multiple output arrays. In fact, any combination of inputs and outputs is supported as long as the total is <= 5. 5 is a limitation of the x86_64 register file -- 2 registers are required for each input or output (a pointer and a element 'step' or increment value), and I've managed to clear up 10 registers for this purpose. I could support 10 inputs/outputs if I pushed the step values onto the stack (or chose not to use them); I wouldn't expect anything more than a minor performance hit from this, if any.

I've experimented with a couple more ufuncs. Wanting to try conditionals, I implemented a 'greater' ufunc and ran into an issue -- I don't know of a good way to return a boolean array (Python boolean objects) from a CorePy-based ufuncs; all I have is the native integer/float types. If 1/0 integer values were suitable as truth values, the conditional ufuncs would be no problem.

Moving along, I tested the multiple-output support by writing an 'addsub' ufunc. This ufunc takes two arrays as inputs, and outputs two arrays. The first output array contains the sum of each pair of elements, while the second output array contains the difference. Nothing fancy there, just a simple test to make sure my code did what I wanted.

Mark Jeffree suggested vector normalization as a good example for a custom ufunc:

L = 1.0/sqrt(x**2 + y**2)
X = x * L; Y = y * L

This ended up pretty fun to work with, as it is simple, but just involved enough to allow playing around with various optimization tricks. What I ended up doing was using the AMD optimization manual's suggestion of using the reciprical-squareroot instruction combined with a single iteration of newton-raphson to get accuracy slightly less than IEEE, but with better performance than the sqrt instruction. It's kind of funny how a sequence of ~10 instructions is faster than one single instruction! Another advantage is that the longer instruction sequence pipelines much better when unrolling loop iterations. The sqrt instruction has a latency of something like 18 cycles (might be wrong, that's off the top of my head), and a new sqrt instruction can only be issued every 15 cycles. On the other hand, reciprocal-squareroot with an iteration of newton-raphson uses a series of instructions with 4 cycles or less of latency and all can be issued every cycle.

The approach the AMD optimization manual described combined recprocal-squareroot with newton-raphson to get the squareroot as the result (they describe a similar approach for division). I actually wanted the reciprocal of the square root to avoid the division of each component by the length, and was able to remove one multiplication from the newton-raphson. For some reason SSE only has the reciprocal-squareroot and reciprocal instructions for 32-bit floats, not 64-bit floats. So the 64-bit float implementation was pretty boring; I just used sqrt and division instructions.

I just spent a while writing a README and fairly detailed documentation of the gen_ufunc() and create_ufunc() functions that make up my framework. Hopefully things are clear enough that others can use this work. This more or less wraps up my project! Over the next week I'm going to try my hand at implementing a custom ufunc to do the entire computation for a particle simulation demo Chris Mueller (my mentor) did. I'm also attending SciPy next week; I will be giving a talk on recent CorePy work on thursday I believe. Since the talks got cut down to 10 minutes instead of 35, I'll barely have time to highly various things we have going on, spending the most time on what's most relevant to this audience -- my GSoC project.

Expect one more post next week with an update on the particle system and SciPy.

Monday, August 3, 2009

While messing around with my cosine ufunc I realized that NumPy's performance for cosine over float32's is horrible, while float64 performance is more normal:


cfcos type 'numpy.float32'
inp sizes 1024 10240 102400 1024000 3072000
numpy 0.7241 9.3472 115.6120 995.1050 3027.2000
corepy 1 0.0191 0.1521 1.6742 16.3748 50.1912
corepy 2 0.0160 0.0811 0.8180 8.2700 25.5480
corepy 4 0.0219 0.0889 0.8180 4.1308 12.9910

cfcos type 'numpy.float64'
inp sizes 1024 10240 102400 1024000 3072000
numpy 0.0970 0.9248 9.3892 93.8919 277.4410
corepy 1 0.0370 0.3450 3.4950 35.4860 106.4019
corepy 2 0.0319 0.1729 1.7591 17.8859 53.5469
corepy 4 0.0288 0.0958 0.8941 8.9951 26.8750


I started a thread on numpy-discussion, and it's gotten a bit of attention. :) I'm not sure what causes this right now, still have some things to try out. See the thread for details:

http://mail.scipy.org/pipermail/numpy-discussion/2009-August/044274.html

The main thing is that my float32 cosine wasn't so fast after all -- NumPy is just really slow, at least on some installations (including mine). Some other users weren't able to reproduce the slowness.

I've implemented a few different ufuncs now, so figure it's worth summarizing them and the types they work over:

add: int{8,16,32,64}, float{32,64}
multiply: int{8,16,32,64}, float{32,64}
cosine (faster less accurate version): {float32,64}
cosine (using x87, slower but accurate): {float32, 64}
maximum: int{32,64}, float{32,64}

All of these written using the framework I've developed, of course. As I ran my test code for writing this post, I found that the x86-based cosine is segfaulting at 4 threads/cores. It wasn't doing this earlier.. will have to investigate.

Maximum was fun to do, particularly for int32. I spent time iterating on the code, and came up with something pretty creative. I use SSE compare instructions to create a sort of selection mask to use with bit-wise operations to choose the maximum values without a branch. All SIMD, too. Unfortunately I don't have the instructions to do this for int64 (no quadword compares), but I can/will do the same thing for int8/int16. SSE has a specific instruction for selecting the maximum values from two vectors, so I just used that there. Very easy to take advantage of the hardware. :)

Future work: I'm planning on doing a couple more ufuncs, particularly one logical comparison and one bitwise (AND, OR) operation. The idea is to build up a collection of different types of ufuncs to show that the framework works and can be used to develop more.

After that I'm going to look at using the framework to build more complex, custom ufuncs. Initially just something like fused multiply-add, and then maybe some entire algorithms (particle simulation?) in a single ufunc. The idea here being that we can use the framework to do any sort of computation, and use it to reduce memory traffic, leverage the CPU functionality, and go faster without much trouble.