Monday, August 31, 2009

This will probably be my last post for the project. Id like to thank again my mentor Chis Mueller and backup mentor Stéfan van der Walt, I think they did a lot in helping to make this project happen in the first place.

I've written a bunch of documentation on the CorePy wiki (a lot of it came from the code/README):

When any updates to this work occur, the wiki page will be updated.

The SciPy talk went pretty well I think, though I wish I had more than 10 minutes to talk. A lot of important things got left out, and I felt there were places I didn't explain things very well. Even so, it was great to be able to go to SciPy and give a talk -- thanks to the people who make SciPy happen! There seemed to be quite a few people interested in CorePy; I had quite a few questions and had some nice discussions with various people.

Video recording of my actual talk:

As promised I used the CoreFunc framework to update Chris' particle simulation demo. Pretty cool stuff -- I was able to move the entire particle position/velocity update calculation into a ufunc. At 200,000 particles I was seeing ~100x speedup over the pure NumPy version, though my measurement wasn't particularly scientific. Check out the file in the project repository/branch (listed in an earlier post).

Sunday, August 16, 2009

Closing in on the finish here. I think I've achieved all the goals described in my original proposal, and ended up going a little further using the framework to write custom ufuncs. To refresh, the SVN is here:

I cleaned up the framework a little, and extended it to support multiple output arrays. In fact, any combination of inputs and outputs is supported as long as the total is <= 5. 5 is a limitation of the x86_64 register file -- 2 registers are required for each input or output (a pointer and a element 'step' or increment value), and I've managed to clear up 10 registers for this purpose. I could support 10 inputs/outputs if I pushed the step values onto the stack (or chose not to use them); I wouldn't expect anything more than a minor performance hit from this, if any.

I've experimented with a couple more ufuncs. Wanting to try conditionals, I implemented a 'greater' ufunc and ran into an issue -- I don't know of a good way to return a boolean array (Python boolean objects) from a CorePy-based ufuncs; all I have is the native integer/float types. If 1/0 integer values were suitable as truth values, the conditional ufuncs would be no problem.

Moving along, I tested the multiple-output support by writing an 'addsub' ufunc. This ufunc takes two arrays as inputs, and outputs two arrays. The first output array contains the sum of each pair of elements, while the second output array contains the difference. Nothing fancy there, just a simple test to make sure my code did what I wanted.

Mark Jeffree suggested vector normalization as a good example for a custom ufunc:

L = 1.0/sqrt(x**2 + y**2)
X = x * L; Y = y * L

This ended up pretty fun to work with, as it is simple, but just involved enough to allow playing around with various optimization tricks. What I ended up doing was using the AMD optimization manual's suggestion of using the reciprical-squareroot instruction combined with a single iteration of newton-raphson to get accuracy slightly less than IEEE, but with better performance than the sqrt instruction. It's kind of funny how a sequence of ~10 instructions is faster than one single instruction! Another advantage is that the longer instruction sequence pipelines much better when unrolling loop iterations. The sqrt instruction has a latency of something like 18 cycles (might be wrong, that's off the top of my head), and a new sqrt instruction can only be issued every 15 cycles. On the other hand, reciprocal-squareroot with an iteration of newton-raphson uses a series of instructions with 4 cycles or less of latency and all can be issued every cycle.

The approach the AMD optimization manual described combined recprocal-squareroot with newton-raphson to get the squareroot as the result (they describe a similar approach for division). I actually wanted the reciprocal of the square root to avoid the division of each component by the length, and was able to remove one multiplication from the newton-raphson. For some reason SSE only has the reciprocal-squareroot and reciprocal instructions for 32-bit floats, not 64-bit floats. So the 64-bit float implementation was pretty boring; I just used sqrt and division instructions.

I just spent a while writing a README and fairly detailed documentation of the gen_ufunc() and create_ufunc() functions that make up my framework. Hopefully things are clear enough that others can use this work. This more or less wraps up my project! Over the next week I'm going to try my hand at implementing a custom ufunc to do the entire computation for a particle simulation demo Chris Mueller (my mentor) did. I'm also attending SciPy next week; I will be giving a talk on recent CorePy work on thursday I believe. Since the talks got cut down to 10 minutes instead of 35, I'll barely have time to highly various things we have going on, spending the most time on what's most relevant to this audience -- my GSoC project.

Expect one more post next week with an update on the particle system and SciPy.

Monday, August 3, 2009

While messing around with my cosine ufunc I realized that NumPy's performance for cosine over float32's is horrible, while float64 performance is more normal:

cfcos type 'numpy.float32'
inp sizes 1024 10240 102400 1024000 3072000
numpy 0.7241 9.3472 115.6120 995.1050 3027.2000
corepy 1 0.0191 0.1521 1.6742 16.3748 50.1912
corepy 2 0.0160 0.0811 0.8180 8.2700 25.5480
corepy 4 0.0219 0.0889 0.8180 4.1308 12.9910

cfcos type 'numpy.float64'
inp sizes 1024 10240 102400 1024000 3072000
numpy 0.0970 0.9248 9.3892 93.8919 277.4410
corepy 1 0.0370 0.3450 3.4950 35.4860 106.4019
corepy 2 0.0319 0.1729 1.7591 17.8859 53.5469
corepy 4 0.0288 0.0958 0.8941 8.9951 26.8750

I started a thread on numpy-discussion, and it's gotten a bit of attention. :) I'm not sure what causes this right now, still have some things to try out. See the thread for details:

The main thing is that my float32 cosine wasn't so fast after all -- NumPy is just really slow, at least on some installations (including mine). Some other users weren't able to reproduce the slowness.

I've implemented a few different ufuncs now, so figure it's worth summarizing them and the types they work over:

add: int{8,16,32,64}, float{32,64}
multiply: int{8,16,32,64}, float{32,64}
cosine (faster less accurate version): {float32,64}
cosine (using x87, slower but accurate): {float32, 64}
maximum: int{32,64}, float{32,64}

All of these written using the framework I've developed, of course. As I ran my test code for writing this post, I found that the x86-based cosine is segfaulting at 4 threads/cores. It wasn't doing this earlier.. will have to investigate.

Maximum was fun to do, particularly for int32. I spent time iterating on the code, and came up with something pretty creative. I use SSE compare instructions to create a sort of selection mask to use with bit-wise operations to choose the maximum values without a branch. All SIMD, too. Unfortunately I don't have the instructions to do this for int64 (no quadword compares), but I can/will do the same thing for int8/int16. SSE has a specific instruction for selecting the maximum values from two vectors, so I just used that there. Very easy to take advantage of the hardware. :)

Future work: I'm planning on doing a couple more ufuncs, particularly one logical comparison and one bitwise (AND, OR) operation. The idea is to build up a collection of different types of ufuncs to show that the framework works and can be used to develop more.

After that I'm going to look at using the framework to build more complex, custom ufuncs. Initially just something like fused multiply-add, and then maybe some entire algorithms (particle simulation?) in a single ufunc. The idea here being that we can use the framework to do any sort of computation, and use it to reduce memory traffic, leverage the CPU functionality, and go faster without much trouble.

Sunday, July 19, 2009

Another two weeks worth of stuff included in this post.

I implemented support for the int8 and int16 types for add/multiply. The framework had to be extended somewhat. Sounds like no big deal, but the 8bit stuff uncovered some bugs that took a while to fix -- mainly due to the fact that I'm unrolling 16/32 elements worth of operations in the main loop, which is more than the 4/8 I had been doing until now.

Finally went and reworked the multicore support to do things properly. I split the work into double-quadword (16 byte) sized blocks. This is the size of a single SIMD vector on x86, and it is important to work on 16-byte aligned data for better performance. So what I wanted to do was make sure each thread's sub-range to work on was 16-byte aligned regardless of datatype and total array length. Getting all the math right was pretty tricky! I think (hope :)) I got it all correct now. The last thread tends to get extra work when the array length rounded to a 16-byte block times the number of threads -- I think this is OK though, the extra work is never more than num_threads - 1 16-byte blocks. How many elements this is depends on the size of the data type -- a 16-byte block contains only two int64/float64s, or 16 int8s.

Something I've also wanted to do for a while was set up persistent threads and farm out work to them, instead of creating/joining threads in every ufunc call. This reduces the overhead incurred by using multiple threads in a ufunc, and is visible for smaller array sizes (< ~10-100,000 elements). However I'm not sure this is worthwhile, as the multi-thread performance on these smaller array lengths is still worse than just using a single core. I think I need to investigate the overheads more.

I have also generalized the framework to support single-input ufuncs. This was actually easier than I expected, and ended up doing most of the work needed to support > 2 inputs as well. All that remains for >2 input support is register assignment (easy) and sorting out whether reduction makes sense, and how it should be done (probably not too hard).

I revived the cosine ufunc, which had bit-rotted some, and updated it to use the new single-input support in the framework. My testing harness fails the cosine ufunc, but hand inspection shows the results are correct if you tolerate some rounding error. More investigation needed here.

Side note -- I discovered AMD's Software Optimization Guide for AMD64 Processors -- looks like it has some good information in it, including actual instruction latencies! Find it here:

Sunday, July 5, 2009

Okay, two weeks worth of updates here. I've been mostly (still) ironing out various bugs and corner cases.

First off, I ran into some problems with add/multiply reduction over integer types -- in this case only, NumPy converts the input array to the native integer word size (64bit on x86_64) for the reduction. Rather than redescribing the whole issue (including its solution), I'll just link to the numpy-discuss thread:

Last time I mentioned having problems with single-element arrays. I finally just sat down and re-thought out the exact condition in which a reduction occurs, and wrote a code sequence to detect that case (and only that case). These sorts of problems should be put to rest now.

However something I'm going to have to deal with is the possibility of array elements not being contiguous -- for example when performing operations over non-contiguous dimensions of a multi-dimensional array. This should be fairly simple -- compare the step to the size of the basic datatype. If they match, we can go crazy with SIMD, otherwise we need to fall back to an element-wise case or perhaps some code to shuffle disjoint elements together into a single SIMD, and back out again. I think the shuffling will only make sense for operations that are computationally exepensive, and aren't bandwidth-limited.

I've added a multiply ufunc, supporting int32/int64/float32/float64. There doesn't seem to be any suitable SIMD integer multiply instructions, so I had to fall back to using single-element general-purpose multiplication there. The float types are SIMD-ized though, and were actually an easy translation from the add ufuncs. On the integer side, I hit a well-hidden bug in CorePy. Turns out that the operands for the two-register imul instruction are encoded backwards compared to the normal operand encoding order on x86. This took a while to find and fix, as there was no error, just registers with unexpected values.

Back to floats, I did have to deal with initializing an identity value into my 'accumulator' register for multiplication reductions. Before I had just hardcoded the identity to zero (using a shorthand: xor reg, reg). Clearly this won't work for multiply though :) For the floating point types, I used a little hack, since there is no direct way to load immediate values into SSE registers. I figured out the integer binary/hex representation of a 1.0 as a float32 and float64 type, and loaded that as an immediate into a general purpose (GP) register. Then I used movd/movq to copy an value from the GP register to the XMM (SSE) register. From there I just treat it like a float, and everything works. Tricks like this are really nice on architectures that use the same registers for integer and float types. Anyway, I parameterized the loading of the identity value by requiring a function be passed in to the ufunc code generator that generates the identity loading code.

The general ufunc code generation framework is coming along. The general element-wise operation case is pretty good I think; I've simplified/generalized it a little since last post. The reduction code I think is complicated, but necessarily so. In particular the final atomic update is kind of ugly, but I don't have anything better right now. At this point any two operand ufunc should be very easy to implement using the framework.

Next along the lines of the framework is to revisit single-operand functions like cos(). I need to take a stab at generalizing all single-operand functions, then look at what that has in common with the generic two-operand code and again see what I can generalize out into a framework.

The address alignment problem with multiple threads/cores remains; I'll look at that depending on how things go with the framework progress.

Monday, June 22, 2009

This past week I wrote code for int32 and float64 types, giving me an addition ufunc that supports 32/64bit ints and floats. The code for each case is pretty long; each datatype duplicates code that loads the function arguments, checks for cases like reduction and lengths that don't evenly divide the SIMD parallelization/loop unrolling, etc. So I started factoring this out and developing the beginnings of a framework for writing CorePy-based ufuncs with minimal effort and code duplication.

I succeeded in isolating most of the duplicate code, though the 'generic' code generation is a bit more complicated than before -- especially the specialized reduction code. This needs some polish still; I used two different approaches for factoring out the operation-specific code for general element-wise operations and reductions. I will probably use the approach I did for reduction -- the generic code requires a function (I can actually pass a single instruction) that takes two registers as arguments (accumulator and source element) and generates reduction code for a single element.

I ran into a few problems while working on this stuff; I have no idea how I managed to miss these bugs until now. First, applying my ufunc to a single-element array gave bad results -- the test I was using to detect reduction was also catching this single-element case, and not working properly.

The other issue I've run into is due to the way I'm splitting work among multiple threads. I just did the easiest thing possible (halfway expecting to do something more elaborate later) -- divide the length evenly among each thread, giving the last thread additional work if the number of threads does not divide the length evenly. The problem with this is that individual threads will get work segments with unaligned memory if the work segment length is not a multiple of 16 bytes. So this problem is the next thing I need to tackle -- I've been thinking about ways to split the work evenly, while also rounding work segments to a multiple of 16 bytes.

Sunday, June 14, 2009

Have been busy lately.. big paper deadline (CorePy paper!) and moving. When I can, I've been working on general code correctness/cleanliness/robustness. Management and movement of various data inside the 'corefunc' C module has been cleaned up. I now have a fairly generic routine for creating a ufunc from some type information and CorePy InstructionStreams. Multi-core is supported at the C level in a generic way; nothing there is ufunc-specific.

I didn't mention this before, but my code previously only worked correctly for input sizes that were a multiple of the number of threads used (1, 2, 4) and the number of iterations that were unrolled (1, 4, 8). Now my code works for any size input. At the thread level, I just had the last thread (if more than one) check if the number of threads evenly divided the input length. If not, that last thread increased the number of elements it processed to compensate. At the CorePy loop level, a special code path was needed to handle both the case where the input length is less than the number of unrolled loop iterations, and when the length was not a multiple of the unrolled iterations. Here, I do some fancy (but somewhat messy) checks, and added unrolled code to handle one element at a time (as opposed to 4/8 at a time using SIMD).

Turns out reduction wasn't working right when multiple threads are used, but this is fixed now (took a while! evil bugs). Each thread reduces (eg. adds, subtracts) all the elements in its assigned input sub-range into a local register. When complete, the local value is atomically reduced into the single shared output value residing in memory. CorePy's support for the LOCK prefix (makes some instructions atomic) wasn't complete, so I had to go do that. I think I've covered most/all of the instructions supporting the LOCK prefix on x86_64.

This brings up sort of a side issue -- because floating point is not associative, the use of multiple threads accumulating to their own local values independently means the order of operations for reductions is changed. This causes a result that is different from NumPy's. I'm not sure what to do about this, if anything -- I don't think there is a way to maintain thread-level parallelism and perform the operations in the same order as NumPy. I think all I can do is provide a warning about this issue when using multiple threads to do reductions.

Sort of related to this GSOC work, I implemented ExtBuffer, an object implementing the Python 2.6 buffer interface (mainly for use with NumPy arrays). ExtBuffer provides guaranteed page-aligned memory, supports huge pages, and can optionally just take a pointer to some pre-allocated memory and use that (useful for overlaying special memory-mapped hardware regions, e.g. Cell SPU local store). Documentation here:

I was actually anticipating that I would have to write ExtBuffer as part of this GSOC project. The issue was that SIMD code typically requires 16-byte memory alignment for best performance. To my knowledge, malloc() (which I understand NumPy arrays use to alloc memory) only guarantees 8-byte alignment. However I've just been using NumPy arrays as-is and have not (yet) run into any alignment issues. So I need to look into this and see if I can now really rely on NumPy array data always being at least 16-byte aligned.

Where to next?

I really just have an addition ufunc with support for in64 and float32 types right now. This has sufficed for building the initial codebase, but now I need to expand this to support more data types and more operations. With that done, it'll be time to start refactoring duplicate code and designing/building more of a framework to build new ufuncs quickly/easily (my halfway milestone!)

Sunday, May 24, 2009

I've been working in two areas. First, filling out the 'corefunc' approach using the ufunc C API to build my own ufuncs. I've added support for multiple cores (1, 2, 4 threads), and have an add ufunc that supports 64-bit integers and 32-bit floats. Performance with the code I'm currently using looks like this:

cfadd numpy.int64
inp sizes 1024 10240 102400 1024000 3072000 4096000 5120000 6144000
numpy 0.0115 0.0443 1.3969 14.4416 42.4222 55.5169 101.4473 118.8185
corepy 1 0.0103 0.0608 0.9395 9.7391 29.0794 37.4023 109.7189 132.7189
corepy 2 0.0342 0.0810 0.6061 8.7033 25.1321 33.2698 85.4791 103.1184
corepy 4 0.0584 0.0903 0.6499 8.8539 26.3412 34.7378 72.2230 86.9815

numpy 0.0094 0.0421 0.4188 5.5379 16.5510 22.0293 27.4446 32.9556
corepy 1 0.0080 0.0180 0.1657 4.0605 12.1317 16.0499 20.0882 24.0584
corepy 2 0.0304 0.0363 0.1621 2.7398 7.0745 9.3564 11.6919 13.9961
corepy 4 0.0566 0.0601 0.1775 3.0038 7.5964 9.3019 11.8332 13.2176

cfadd numpy.float32
inp sizes 1024 10240 102400 1024000 3072000 4096000 5120000 6144000
numpy 0.0089 0.0404 0.5459 8.8722 26.5132 24.5601 31.0621 36.2360
corepy 1 0.0085 0.0330 0.3583 4.7290 14.0452 17.5809 21.9987 24.8356
corepy 2 0.0327 0.0436 0.2639 4.4774 12.9543 16.7708 21.0962 24.7302
corepy 4 0.0550 0.0664 0.3122 4.4057 13.4197 17.8114 22.0150 25.7672

numpy 0.0120 0.0695 0.6778 6.9267 20.7328 27.6145 34.3973 41.3703
corepy 1 0.0080 0.0263 0.2130 2.2117 6.5741 8.7050 10.8356 12.9950
corepy 2 0.0334 0.0403 0.1589 1.9634 5.3451 6.5962 7.1621 8.7798
corepy 4 0.0550 0.0659 0.1835 1.8873 4.3171 5.9614 7.2393 7.9173

(I rediscovered the 'pre' HTML tag, hooray!)

Nice speedups in some places; some other places that aren't so great (small and large arrays). On the small side my guess is my overhead is too high; on the large side I think I'm hitting some sort of (non)cache effect? I do better on floats too, for some reason. Maybe this is due to 32bit elements instead of 64bit elements? Need to do more testing.

Per a little discussion on the numpy-discussion list, I tried a cos ufunc. The idea being that cos is more computationally intensive than something like addition, and memory bandwidth isn't necessarily the performance limit. I used the SSE code from here:

And added a little code to round numbers outside the range of 0-2PI. After some debugging I believe my code is at least somewhat correct, and can benchmark it:

inp sizes 1024 10240 102400 1024000 3072000
numpy 0.7228 9.5713 117.2879 1019.7153 3099.1759
corepy 1 0.0144 0.0898 0.9272 9.6835 29.0826
corepy 2 0.0363 0.0980 0.6339 5.6012 15.2627
corepy 4 0.0615 0.1243 0.6605 4.7141 14.1573

100x (200x at 2 threads) speedup, nice :) This is almost too good to be true though, something can't be right -- cos performance is approaching that of my addition ufunc with multiple cores on larger arrays. This could sort of make sense with a fast enough CPU though? I'm going to spend some time understanding my code better and making sure it's correct, and figure out what exactly numpy is doing that takes so long. I wonder if the cos is computed in python instead of using a C routine or the x87 cos instruction?

Sunday, May 17, 2009

I spent some time reading the NumPy book and reference manual. Initially I was trying to find information on how the existing ufuncs are implemented, but I ended up just reading up on NumPy arrays and ufuncs in general.

Looking at the C API, I found a couple interesting functions. PyUFunc_FromFuncAndData is used to create new ufunc objects, while PyUFunc_RegisterLoopForType modifies an existing ufunc to use a provided function for operating on a particular data type. The latter is certainly useful; I can override portions of existing ufuncs with my own implementation as I progress. The functions that must be provided for each data type take one or two (or more) 1-D arrays as input, and output one 1-D array; the ufunc framework takes care of extending this to more complicated array shapes.

Finding this made me realize another approach to implementing ufuncs -- rather than building new ufunc objects from the ground up (like I have before), I can use this API. The trick is connecting CorePy code generated in Python to some C code that calls these functions to set up some ufuncs. Internally, CorePy creates an array containing the machine code representing the generated code. One of the CorePy runtime modules (written in C) executes the machine code by casting the array to a function pointer and calling it. How convenient.. the Ufunc C API requires a function pointer implementing the ufunc's operation over a 1-D array. So I made a proof-of-concept C module containing one method. This method takes an address to an array of machine code as input, and uses that as a function pointer to create a new ufunc supporting addition over integers. Sure enough, this works pretty well; my quick implementation performs roughly equivalent to the exising add ufunc.

Figuring out how reduce is supported was a little tricky -- the secret is that a 0-length step is used for the output, and one input. If the 1-D array function is written properly, no extra work is needed to support reduce. I took advantage of this though, and wrote a separate loop that reduces the number of memory accesses (therefore being faster).

I think this approach is the way to go. Multi-core support will need a C wrapper function to fork off several threads to do the compution. Something I might do is look at having persistent threads to move the thread-creation overhead out of each ufunc call. Next step from where I am now though, is to build/extend a couple existing ufuncs with CorePy routines for several different datatypes.

Code is comitted in the SVN repo (r677) mentioned in a previous post; particularly in the 'corefunc' directory. Little messy right now, but the code in the shows the idea discussed here will work.

Sunday, May 10, 2009

I implemented a quick multiplication ufunc by copying my existing add ufunc and changing it to do multiply. I'm not using SSE here for dtype=int64; I'm not seeing an instruction to multiply 64-bit integers. No big deal, the general-purpose multiply instruction gets the job done. With two ufuncs going I spent some time developing and playing with a testing suite that covers both correctness and performance. The code is in my SVN branch. Performance output looks like this:

inp sizes 1024 10240 102400 1024000 10240000
add numpy 0.009 0.043 1.217 12.059 158.210
add corepy 1 0.050 0.100 1.312 11.653 171.419
add corepy 2 0.080 0.112 1.070 9.352 109.685
add corepy 4 0.179 0.206 1.053 9.355 98.217

add numpy 0.009 0.042 0.377 4.154 41.192
add corepy 1 0.049 0.057 0.212 3.901 37.618
add corepy 2 0.068 0.071 0.178 3.084 23.808
add corepy 4 0.155 0.156 0.255 2.950 22.188

inp sizes 1024 10240 102400 1024000 10240000
mul numpy 0.008 0.047 1.078 12.415 162.024
mul corepy 1 0.057 0.122 1.040 8.687 173.505
mul corepy 2 0.066 0.098 0.424 7.309 126.512
mul corepy 4 0.124 0.173 0.625 8.103 115.533

mul numpy 0.013 0.058 0.528 5.486 54.841
mul corepy 1 0.041 0.056 0.152 1.078 10.369
mul corepy 2 0.068 0.085 0.129 0.751 6.586
mul corepy 4 0.158 0.156 0.221 0.849 5.797

Times are in milliseconds, and are average of 10 runs. For the CorePy cases, I'm running with 1/2/4 cores (basic parallelization over cores was pretty easy!). CorePy's overhead is visible, as mentioned in my previous post. There's some unexpected things in these numbers too -- for example, multiplication is faster than addition for CorePy but not NumPy. I dont know why; multiply instructions take longer to execute, and I am otherwise doing the same thing in both cases.

While testing ufuncs with various inputs (and input sizes) I came across some differences in behavior when values reach/exceed 2**63/64 (I'm using dtype=int64). This sent me off trying to understand what NumPy's behavior is. What I expected was for integers to wrap at 2**63 (signed) or 2**64 (unsigned), but what I see is this:

>>> x = numpy.array((2**62,) * 1)
>>> hex(numpy.add.reduce(x))
>>> x = numpy.array((2**62,) * 2)
>>> hex(numpy.add.reduce(x))

Huh? Last time I checked, 4 + 4 = 8, not -8... am I missing something? Will have to check with the NumPy people to see what's going on.

Tuesday, May 5, 2009

I set up an SVN repository for my project here:

(edit: fixed link)

Not much there, though I did add a quick implementation of an 'add' func I wrote back around christmas and started playing with again to re-familiarize myself. When implementing anything in CorePy, one of the first things you need to do is figure out how to get a pointer to the raw data -- in this case, a pointer to a NumPy array object's underlying data. In Python, I found that I could do this:

params.p1 = a.__array_interface__['data'][0]

With this in hand, I wrote a quick 'add' ufunc just to get something working, along with a correctness test and performance benchmark. I'm finding that on the system I'm using (dual dual-core 2ghz Opteron), adding two 10,000 element arrays together averages about 0.07ms, while NumPy's ufunc averages about 0.04ms. Some playing around with different input sizes shows that the CorePy version catches up as the input size increases -- so we have some initial invocation overhead. Hmm, that pointer lookup seems kind of expensive.. a member lookup, string-based dictionary reference, and another array reference.

The first thing I tried was using a C module to grab the pointer from the NumPy array; the code looks something like this:

PyArrayObject* arr;
PyArg_ParseTuple(args, "O!", &PyArray_Type, &arr);
return PyInt_FromLong((long)arr->data);

This was faster than extracting the pointer directly from Python, but I shouldn't have to resort to writing a C module just to get a pointer out of a data structure! This is where I left off back around christmas. Looking at this again, I realized I should be able to get at the pointer inside the array object directly in assembly, also. Examining a dissassembly of the above C code showed the 'data' pointer is 16 bytes from the start of the PyArrayObject struct. So I figured, if I could pass the address of the Python array object itself to the assembly code, I could extract the pointer there. Conveniently, id(foo) gives the address of object foo. Easy enough, I just pass that through and do this in the assembly:, MemRef(rdi, 16))

Now this is pretty crazy, and probably not the best idea in the world -- if the data pointer gets moved inside the array object, this code will be wrong. This works though, and is a little faster: 0.06ms at 10,000 elements. So a question for anyone reading -- is the a better/faster/more reliable way to get at the data pointer?

Another issue this has uncovered is the relative performance of the CorePy ufunc to the NumPy ufunc -- NumPy is faster for smaller array sizes, while CorePy eventually catches up and outperforms in the >10,000,000 element range. I'm out of time for now, but next I'll be looking at other sources of overhead, and consider other ways to improve performance.

Monday, April 27, 2009

My project proposal, Implementing NumPy's ufuncs using CorePy was accepted for GSoC 2009. I'm pretty excited on a number of levels -- I get to experience GSoC, become more involved in the general Python community, and do the actual project itself.

Some background:

NumPy is a very general Python library for doing scientific programming. In particular, NumPy contains an N-dimensional array class and a set of operations over these arrays called ufunc. These operations include basic arithmetic (add, multiply), math functions (sin, cos), logic/comparison functions (min, max, greater than), and more advanced operations like reduce and accumulate.

CorePy provides a framework for generating and executing high performance computation kernels (assembly code) at runtime directly in Python. Runtime-generated code can be specialized/optimized in domain-specific ways, using parameters that might only be available at runtime (think generative programming).

Put these two pieces together and you have my project. NumPy's ufuncs serve as a base for developing more complex array-based operations and algorithms, making them an ideal area for significant performance optimization. CorePy is an excellent tool for this task, especially due to the way it exposes the underlying processor architecture. x86 SSE instructions can (will) be used to develop vectorized ufunc operations, and multiple cores leveraged to obtain even more parallelism.

I'll keep updating here as I progress through the project.