Implementing NumPy's ufuncs using CorePy

Monday, August 31, 2009

This will probably be my last post for the project. Id like to thank again my mentor Chis Mueller and backup mentor Stéfan van der Walt, I think they did a lot in helping to make this project happen in the first place.

I've written a bunch of documentation on the CorePy wiki (a lot of it came from the code/README):

http://corepy.org/wiki/index.php?title=CoreFunc

When any updates to this work occur, the wiki page will be updated.

The SciPy talk went pretty well I think, though I wish I had more than 10 minutes to talk. A lot of important things got left out, and I felt there were places I didn't explain things very well. Even so, it was great to be able to go to SciPy and give a talk -- thanks to the people who make SciPy happen! There seemed to be quite a few people interested in CorePy; I had quite a few questions and had some nice discussions with various people.
Slides:
http://www.osl.iu.edu/~afriedle/presentations/scipy-corepy-afriedle.pdf

Video recording of my actual talk:
http://www.archive.org/details/scipy09_day1_08-Andrew_Friedley

As promised I used the CoreFunc framework to update Chris' particle simulation demo. Pretty cool stuff -- I was able to move the entire particle position/velocity update calculation into a ufunc. At 200,000 particles I was seeing ~100x speedup over the pure NumPy version, though my measurement wasn't particularly scientific. Check out the glparticle-cf.py file in the project repository/branch (listed in an earlier post).

Sunday, August 16, 2009

Closing in on the finish here. I think I've achieved all the goals described in my original proposal, and ended up going a little further using the framework to write custom ufuncs. To refresh, the SVN is here:

https://svn.osl.iu.edu/trac/corepy/browser/branches/afriedle-gsoc-ufunc

I cleaned up the framework a little, and extended it to support multiple output arrays. In fact, any combination of inputs and outputs is supported as long as the total is <= 5. 5 is a limitation of the x86_64 register file -- 2 registers are required for each input or output (a pointer and a element 'step' or increment value), and I've managed to clear up 10 registers for this purpose. I could support 10 inputs/outputs if I pushed the step values onto the stack (or chose not to use them); I wouldn't expect anything more than a minor performance hit from this, if any.

I've experimented with a couple more ufuncs. Wanting to try conditionals, I implemented a 'greater' ufunc and ran into an issue -- I don't know of a good way to return a boolean array (Python boolean objects) from a CorePy-based ufuncs; all I have is the native integer/float types. If 1/0 integer values were suitable as truth values, the conditional ufuncs would be no problem.

Moving along, I tested the multiple-output support by writing an 'addsub' ufunc. This ufunc takes two arrays as inputs, and outputs two arrays. The first output array contains the sum of each pair of elements, while the second output array contains the difference. Nothing fancy there, just a simple test to make sure my code did what I wanted.

Mark Jeffree suggested vector normalization as a good example for a custom ufunc:

L = 1.0/sqrt(x**2 + y**2)
X = x * L; Y = y * L

This ended up pretty fun to work with, as it is simple, but just involved enough to allow playing around with various optimization tricks. What I ended up doing was using the AMD optimization manual's suggestion of using the reciprical-squareroot instruction combined with a single iteration of newton-raphson to get accuracy slightly less than IEEE, but with better performance than the sqrt instruction. It's kind of funny how a sequence of ~10 instructions is faster than one single instruction! Another advantage is that the longer instruction sequence pipelines much better when unrolling loop iterations. The sqrt instruction has a latency of something like 18 cycles (might be wrong, that's off the top of my head), and a new sqrt instruction can only be issued every 15 cycles. On the other hand, reciprocal-squareroot with an iteration of newton-raphson uses a series of instructions with 4 cycles or less of latency and all can be issued every cycle.

The approach the AMD optimization manual described combined recprocal-squareroot with newton-raphson to get the squareroot as the result (they describe a similar approach for division). I actually wanted the reciprocal of the square root to avoid the division of each component by the length, and was able to remove one multiplication from the newton-raphson. For some reason SSE only has the reciprocal-squareroot and reciprocal instructions for 32-bit floats, not 64-bit floats. So the 64-bit float implementation was pretty boring; I just used sqrt and division instructions.

I just spent a while writing a README and fairly detailed documentation of the gen_ufunc() and create_ufunc() functions that make up my framework. Hopefully things are clear enough that others can use this work. This more or less wraps up my project! Over the next week I'm going to try my hand at implementing a custom ufunc to do the entire computation for a particle simulation demo Chris Mueller (my mentor) did. I'm also attending SciPy next week; I will be giving a talk on recent CorePy work on thursday I believe. Since the talks got cut down to 10 minutes instead of 35, I'll barely have time to highly various things we have going on, spending the most time on what's most relevant to this audience -- my GSoC project.

Expect one more post next week with an update on the particle system and SciPy.

Monday, August 3, 2009

While messing around with my cosine ufunc I realized that NumPy's performance for cosine over float32's is horrible, while float64 performance is more normal:


cfcos type 'numpy.float32'
inp sizes      1024    10240   102400  1024000  3072000
numpy        0.7241   9.3472 115.6120 995.1050 3027.2000
corepy   1   0.0191   0.1521   1.6742  16.3748  50.1912
corepy   2   0.0160   0.0811   0.8180   8.2700  25.5480
corepy   4   0.0219   0.0889   0.8180   4.1308  12.9910

cfcos type 'numpy.float64'
inp sizes      1024    10240   102400  1024000  3072000
numpy        0.0970   0.9248   9.3892  93.8919 277.4410
corepy   1   0.0370   0.3450   3.4950  35.4860 106.4019
corepy   2   0.0319   0.1729   1.7591  17.8859  53.5469
corepy   4   0.0288   0.0958   0.8941   8.9951  26.8750

I started a thread on numpy-discussion, and it's gotten a bit of attention. :) I'm not sure what causes this right now, still have some things to try out. See the thread for details:

http://mail.scipy.org/pipermail/numpy-discussion/2009-August/044274.html

The main thing is that my float32 cosine wasn't so fast after all -- NumPy is just really slow, at least on some installations (including mine). Some other users weren't able to reproduce the slowness.

I've implemented a few different ufuncs now, so figure it's worth summarizing them and the types they work over:

add: int{8,16,32,64}, float{32,64}
multiply: int{8,16,32,64}, float{32,64}
cosine (faster less accurate version): {float32,64}
cosine (using x87, slower but accurate): {float32, 64}
maximum: int{32,64}, float{32,64}

All of these written using the framework I've developed, of course. As I ran my test code for writing this post, I found that the x86-based cosine is segfaulting at 4 threads/cores. It wasn't doing this earlier.. will have to investigate.

Maximum was fun to do, particularly for int32. I spent time iterating on the code, and came up with something pretty creative. I use SSE compare instructions to create a sort of selection mask to use with bit-wise operations to choose the maximum values without a branch. All SIMD, too. Unfortunately I don't have the instructions to do this for int64 (no quadword compares), but I can/will do the same thing for int8/int16. SSE has a specific instruction for selecting the maximum values from two vectors, so I just used that there. Very easy to take advantage of the hardware. :)

Future work: I'm planning on doing a couple more ufuncs, particularly one logical comparison and one bitwise (AND, OR) operation. The idea is to build up a collection of different types of ufuncs to show that the framework works and can be used to develop more.

After that I'm going to look at using the framework to build more complex, custom ufuncs. Initially just something like fused multiply-add, and then maybe some entire algorithms (particle simulation?) in a single ufunc. The idea here being that we can use the framework to do any sort of computation, and use it to reduce memory traffic, leverage the CPU functionality, and go faster without much trouble.

Sunday, July 19, 2009

Another two weeks worth of stuff included in this post.

I implemented support for the int8 and int16 types for add/multiply. The framework had to be extended somewhat. Sounds like no big deal, but the 8bit stuff uncovered some bugs that took a while to fix -- mainly due to the fact that I'm unrolling 16/32 elements worth of operations in the main loop, which is more than the 4/8 I had been doing until now.

Finally went and reworked the multicore support to do things properly. I split the work into double-quadword (16 byte) sized blocks. This is the size of a single SIMD vector on x86, and it is important to work on 16-byte aligned data for better performance. So what I wanted to do was make sure each thread's sub-range to work on was 16-byte aligned regardless of datatype and total array length. Getting all the math right was pretty tricky! I think (hope :)) I got it all correct now. The last thread tends to get extra work when the array length rounded to a 16-byte block times the number of threads -- I think this is OK though, the extra work is never more than num_threads - 1 16-byte blocks. How many elements this is depends on the size of the data type -- a 16-byte block contains only two int64/float64s, or 16 int8s.

Something I've also wanted to do for a while was set up persistent threads and farm out work to them, instead of creating/joining threads in every ufunc call. This reduces the overhead incurred by using multiple threads in a ufunc, and is visible for smaller array sizes (< ~10-100,000 elements). However I'm not sure this is worthwhile, as the multi-thread performance on these smaller array lengths is still worse than just using a single core. I think I need to investigate the overheads more.

I have also generalized the framework to support single-input ufuncs. This was actually easier than I expected, and ended up doing most of the work needed to support > 2 inputs as well. All that remains for >2 input support is register assignment (easy) and sorting out whether reduction makes sense, and how it should be done (probably not too hard).

I revived the cosine ufunc, which had bit-rotted some, and updated it to use the new single-input support in the framework. My testing harness fails the cosine ufunc, but hand inspection shows the results are correct if you tolerate some rounding error. More investigation needed here.

Side note -- I discovered AMD's Software Optimization Guide for AMD64 Processors -- looks like it has some good information in it, including actual instruction latencies! Find it here:

http://developer.amd.com/documentation/guides/Pages/default.aspx

Sunday, July 5, 2009

Okay, two weeks worth of updates here. I've been mostly (still) ironing out various bugs and corner cases.

First off, I ran into some problems with add/multiply reduction over integer types -- in this case only, NumPy converts the input array to the native integer word size (64bit on x86_64) for the reduction. Rather than redescribing the whole issue (including its solution), I'll just link to the numpy-discuss thread:

http://mail.scipy.org/pipermail/numpy-discussion/2009-June/043659.html
http://mail.scipy.org/pipermail/numpy-discussion/2009-July/043749.html

Last time I mentioned having problems with single-element arrays. I finally just sat down and re-thought out the exact condition in which a reduction occurs, and wrote a code sequence to detect that case (and only that case). These sorts of problems should be put to rest now.

However something I'm going to have to deal with is the possibility of array elements not being contiguous -- for example when performing operations over non-contiguous dimensions of a multi-dimensional array. This should be fairly simple -- compare the step to the size of the basic datatype. If they match, we can go crazy with SIMD, otherwise we need to fall back to an element-wise case or perhaps some code to shuffle disjoint elements together into a single SIMD, and back out again. I think the shuffling will only make sense for operations that are computationally exepensive, and aren't bandwidth-limited.

I've added a multiply ufunc, supporting int32/int64/float32/float64. There doesn't seem to be any suitable SIMD integer multiply instructions, so I had to fall back to using single-element general-purpose multiplication there. The float types are SIMD-ized though, and were actually an easy translation from the add ufuncs. On the integer side, I hit a well-hidden bug in CorePy. Turns out that the operands for the two-register imul instruction are encoded backwards compared to the normal operand encoding order on x86. This took a while to find and fix, as there was no error, just registers with unexpected values.

Back to floats, I did have to deal with initializing an identity value into my 'accumulator' register for multiplication reductions. Before I had just hardcoded the identity to zero (using a shorthand: xor reg, reg). Clearly this won't work for multiply though :) For the floating point types, I used a little hack, since there is no direct way to load immediate values into SSE registers. I figured out the integer binary/hex representation of a 1.0 as a float32 and float64 type, and loaded that as an immediate into a general purpose (GP) register. Then I used movd/movq to copy an value from the GP register to the XMM (SSE) register. From there I just treat it like a float, and everything works. Tricks like this are really nice on architectures that use the same registers for integer and float types. Anyway, I parameterized the loading of the identity value by requiring a function be passed in to the ufunc code generator that generates the identity loading code.

The general ufunc code generation framework is coming along. The general element-wise operation case is pretty good I think; I've simplified/generalized it a little since last post. The reduction code I think is complicated, but necessarily so. In particular the final atomic update is kind of ugly, but I don't have anything better right now. At this point any two operand ufunc should be very easy to implement using the framework.

Next along the lines of the framework is to revisit single-operand functions like cos(). I need to take a stab at generalizing all single-operand functions, then look at what that has in common with the generic two-operand code and again see what I can generalize out into a framework.

The address alignment problem with multiple threads/cores remains; I'll look at that depending on how things go with the framework progress.

Monday, June 22, 2009

This past week I wrote code for int32 and float64 types, giving me an addition ufunc that supports 32/64bit ints and floats. The code for each case is pretty long; each datatype duplicates code that loads the function arguments, checks for cases like reduction and lengths that don't evenly divide the SIMD parallelization/loop unrolling, etc. So I started factoring this out and developing the beginnings of a framework for writing CorePy-based ufuncs with minimal effort and code duplication.

I succeeded in isolating most of the duplicate code, though the 'generic' code generation is a bit more complicated than before -- especially the specialized reduction code. This needs some polish still; I used two different approaches for factoring out the operation-specific code for general element-wise operations and reductions. I will probably use the approach I did for reduction -- the generic code requires a function (I can actually pass a single instruction) that takes two registers as arguments (accumulator and source element) and generates reduction code for a single element.

I ran into a few problems while working on this stuff; I have no idea how I managed to miss these bugs until now. First, applying my ufunc to a single-element array gave bad results -- the test I was using to detect reduction was also catching this single-element case, and not working properly.

The other issue I've run into is due to the way I'm splitting work among multiple threads. I just did the easiest thing possible (halfway expecting to do something more elaborate later) -- divide the length evenly among each thread, giving the last thread additional work if the number of threads does not divide the length evenly. The problem with this is that individual threads will get work segments with unaligned memory if the work segment length is not a multiple of 16 bytes. So this problem is the next thing I need to tackle -- I've been thinking about ways to split the work evenly, while also rounding work segments to a multiple of 16 bytes.

Sunday, June 14, 2009

Have been busy lately.. big paper deadline (CorePy paper!) and moving. When I can, I've been working on general code correctness/cleanliness/robustness. Management and movement of various data inside the 'corefunc' C module has been cleaned up. I now have a fairly generic routine for creating a ufunc from some type information and CorePy InstructionStreams. Multi-core is supported at the C level in a generic way; nothing there is ufunc-specific.

I didn't mention this before, but my code previously only worked correctly for input sizes that were a multiple of the number of threads used (1, 2, 4) and the number of iterations that were unrolled (1, 4, 8). Now my code works for any size input. At the thread level, I just had the last thread (if more than one) check if the number of threads evenly divided the input length. If not, that last thread increased the number of elements it processed to compensate. At the CorePy loop level, a special code path was needed to handle both the case where the input length is less than the number of unrolled loop iterations, and when the length was not a multiple of the unrolled iterations. Here, I do some fancy (but somewhat messy) checks, and added unrolled code to handle one element at a time (as opposed to 4/8 at a time using SIMD).

Turns out reduction wasn't working right when multiple threads are used, but this is fixed now (took a while! evil bugs). Each thread reduces (eg. adds, subtracts) all the elements in its assigned input sub-range into a local register. When complete, the local value is atomically reduced into the single shared output value residing in memory. CorePy's support for the LOCK prefix (makes some instructions atomic) wasn't complete, so I had to go do that. I think I've covered most/all of the instructions supporting the LOCK prefix on x86_64.

This brings up sort of a side issue -- because floating point is not associative, the use of multiple threads accumulating to their own local values independently means the order of operations for reductions is changed. This causes a result that is different from NumPy's. I'm not sure what to do about this, if anything -- I don't think there is a way to maintain thread-level parallelism and perform the operations in the same order as NumPy. I think all I can do is provide a warning about this issue when using multiple threads to do reductions.

Sort of related to this GSOC work, I implemented ExtBuffer, an object implementing the Python 2.6 buffer interface (mainly for use with NumPy arrays). ExtBuffer provides guaranteed page-aligned memory, supports huge pages, and can optionally just take a pointer to some pre-allocated memory and use that (useful for overlaying special memory-mapped hardware regions, e.g. Cell SPU local store). Documentation here:

http://corepy.org/wiki/index.php?title=Extended_Array

I was actually anticipating that I would have to write ExtBuffer as part of this GSOC project. The issue was that SIMD code typically requires 16-byte memory alignment for best performance. To my knowledge, malloc() (which I understand NumPy arrays use to alloc memory) only guarantees 8-byte alignment. However I've just been using NumPy arrays as-is and have not (yet) run into any alignment issues. So I need to look into this and see if I can now really rely on NumPy array data always being at least 16-byte aligned.

Where to next?

I really just have an addition ufunc with support for in64 and float32 types right now. This has sufficed for building the initial codebase, but now I need to expand this to support more data types and more operations. With that done, it'll be time to start refactoring duplicate code and designing/building more of a framework to build new ufuncs quickly/easily (my halfway milestone!)