Sunday, May 24, 2009

I've been working in two areas. First, filling out the 'corefunc' approach using the ufunc C API to build my own ufuncs. I've added support for multiple cores (1, 2, 4 threads), and have an add ufunc that supports 64-bit integers and 32-bit floats. Performance with the code I'm currently using looks like this:

cfadd numpy.int64
inp sizes 1024 10240 102400 1024000 3072000 4096000 5120000 6144000
numpy 0.0115 0.0443 1.3969 14.4416 42.4222 55.5169 101.4473 118.8185
corepy 1 0.0103 0.0608 0.9395 9.7391 29.0794 37.4023 109.7189 132.7189
corepy 2 0.0342 0.0810 0.6061 8.7033 25.1321 33.2698 85.4791 103.1184
corepy 4 0.0584 0.0903 0.6499 8.8539 26.3412 34.7378 72.2230 86.9815

numpy 0.0094 0.0421 0.4188 5.5379 16.5510 22.0293 27.4446 32.9556
corepy 1 0.0080 0.0180 0.1657 4.0605 12.1317 16.0499 20.0882 24.0584
corepy 2 0.0304 0.0363 0.1621 2.7398 7.0745 9.3564 11.6919 13.9961
corepy 4 0.0566 0.0601 0.1775 3.0038 7.5964 9.3019 11.8332 13.2176

cfadd numpy.float32
inp sizes 1024 10240 102400 1024000 3072000 4096000 5120000 6144000
numpy 0.0089 0.0404 0.5459 8.8722 26.5132 24.5601 31.0621 36.2360
corepy 1 0.0085 0.0330 0.3583 4.7290 14.0452 17.5809 21.9987 24.8356
corepy 2 0.0327 0.0436 0.2639 4.4774 12.9543 16.7708 21.0962 24.7302
corepy 4 0.0550 0.0664 0.3122 4.4057 13.4197 17.8114 22.0150 25.7672

numpy 0.0120 0.0695 0.6778 6.9267 20.7328 27.6145 34.3973 41.3703
corepy 1 0.0080 0.0263 0.2130 2.2117 6.5741 8.7050 10.8356 12.9950
corepy 2 0.0334 0.0403 0.1589 1.9634 5.3451 6.5962 7.1621 8.7798
corepy 4 0.0550 0.0659 0.1835 1.8873 4.3171 5.9614 7.2393 7.9173

(I rediscovered the 'pre' HTML tag, hooray!)

Nice speedups in some places; some other places that aren't so great (small and large arrays). On the small side my guess is my overhead is too high; on the large side I think I'm hitting some sort of (non)cache effect? I do better on floats too, for some reason. Maybe this is due to 32bit elements instead of 64bit elements? Need to do more testing.

Per a little discussion on the numpy-discussion list, I tried a cos ufunc. The idea being that cos is more computationally intensive than something like addition, and memory bandwidth isn't necessarily the performance limit. I used the SSE code from here:

And added a little code to round numbers outside the range of 0-2PI. After some debugging I believe my code is at least somewhat correct, and can benchmark it:

inp sizes 1024 10240 102400 1024000 3072000
numpy 0.7228 9.5713 117.2879 1019.7153 3099.1759
corepy 1 0.0144 0.0898 0.9272 9.6835 29.0826
corepy 2 0.0363 0.0980 0.6339 5.6012 15.2627
corepy 4 0.0615 0.1243 0.6605 4.7141 14.1573

100x (200x at 2 threads) speedup, nice :) This is almost too good to be true though, something can't be right -- cos performance is approaching that of my addition ufunc with multiple cores on larger arrays. This could sort of make sense with a fast enough CPU though? I'm going to spend some time understanding my code better and making sure it's correct, and figure out what exactly numpy is doing that takes so long. I wonder if the cos is computed in python instead of using a C routine or the x87 cos instruction?

Sunday, May 17, 2009

I spent some time reading the NumPy book and reference manual. Initially I was trying to find information on how the existing ufuncs are implemented, but I ended up just reading up on NumPy arrays and ufuncs in general.

Looking at the C API, I found a couple interesting functions. PyUFunc_FromFuncAndData is used to create new ufunc objects, while PyUFunc_RegisterLoopForType modifies an existing ufunc to use a provided function for operating on a particular data type. The latter is certainly useful; I can override portions of existing ufuncs with my own implementation as I progress. The functions that must be provided for each data type take one or two (or more) 1-D arrays as input, and output one 1-D array; the ufunc framework takes care of extending this to more complicated array shapes.

Finding this made me realize another approach to implementing ufuncs -- rather than building new ufunc objects from the ground up (like I have before), I can use this API. The trick is connecting CorePy code generated in Python to some C code that calls these functions to set up some ufuncs. Internally, CorePy creates an array containing the machine code representing the generated code. One of the CorePy runtime modules (written in C) executes the machine code by casting the array to a function pointer and calling it. How convenient.. the Ufunc C API requires a function pointer implementing the ufunc's operation over a 1-D array. So I made a proof-of-concept C module containing one method. This method takes an address to an array of machine code as input, and uses that as a function pointer to create a new ufunc supporting addition over integers. Sure enough, this works pretty well; my quick implementation performs roughly equivalent to the exising add ufunc.

Figuring out how reduce is supported was a little tricky -- the secret is that a 0-length step is used for the output, and one input. If the 1-D array function is written properly, no extra work is needed to support reduce. I took advantage of this though, and wrote a separate loop that reduces the number of memory accesses (therefore being faster).

I think this approach is the way to go. Multi-core support will need a C wrapper function to fork off several threads to do the compution. Something I might do is look at having persistent threads to move the thread-creation overhead out of each ufunc call. Next step from where I am now though, is to build/extend a couple existing ufuncs with CorePy routines for several different datatypes.

Code is comitted in the SVN repo (r677) mentioned in a previous post; particularly in the 'corefunc' directory. Little messy right now, but the code in the shows the idea discussed here will work.

Sunday, May 10, 2009

I implemented a quick multiplication ufunc by copying my existing add ufunc and changing it to do multiply. I'm not using SSE here for dtype=int64; I'm not seeing an instruction to multiply 64-bit integers. No big deal, the general-purpose multiply instruction gets the job done. With two ufuncs going I spent some time developing and playing with a testing suite that covers both correctness and performance. The code is in my SVN branch. Performance output looks like this:

inp sizes 1024 10240 102400 1024000 10240000
add numpy 0.009 0.043 1.217 12.059 158.210
add corepy 1 0.050 0.100 1.312 11.653 171.419
add corepy 2 0.080 0.112 1.070 9.352 109.685
add corepy 4 0.179 0.206 1.053 9.355 98.217

add numpy 0.009 0.042 0.377 4.154 41.192
add corepy 1 0.049 0.057 0.212 3.901 37.618
add corepy 2 0.068 0.071 0.178 3.084 23.808
add corepy 4 0.155 0.156 0.255 2.950 22.188

inp sizes 1024 10240 102400 1024000 10240000
mul numpy 0.008 0.047 1.078 12.415 162.024
mul corepy 1 0.057 0.122 1.040 8.687 173.505
mul corepy 2 0.066 0.098 0.424 7.309 126.512
mul corepy 4 0.124 0.173 0.625 8.103 115.533

mul numpy 0.013 0.058 0.528 5.486 54.841
mul corepy 1 0.041 0.056 0.152 1.078 10.369
mul corepy 2 0.068 0.085 0.129 0.751 6.586
mul corepy 4 0.158 0.156 0.221 0.849 5.797

Times are in milliseconds, and are average of 10 runs. For the CorePy cases, I'm running with 1/2/4 cores (basic parallelization over cores was pretty easy!). CorePy's overhead is visible, as mentioned in my previous post. There's some unexpected things in these numbers too -- for example, multiplication is faster than addition for CorePy but not NumPy. I dont know why; multiply instructions take longer to execute, and I am otherwise doing the same thing in both cases.

While testing ufuncs with various inputs (and input sizes) I came across some differences in behavior when values reach/exceed 2**63/64 (I'm using dtype=int64). This sent me off trying to understand what NumPy's behavior is. What I expected was for integers to wrap at 2**63 (signed) or 2**64 (unsigned), but what I see is this:

>>> x = numpy.array((2**62,) * 1)
>>> hex(numpy.add.reduce(x))
>>> x = numpy.array((2**62,) * 2)
>>> hex(numpy.add.reduce(x))

Huh? Last time I checked, 4 + 4 = 8, not -8... am I missing something? Will have to check with the NumPy people to see what's going on.

Tuesday, May 5, 2009

I set up an SVN repository for my project here:

(edit: fixed link)

Not much there, though I did add a quick implementation of an 'add' func I wrote back around christmas and started playing with again to re-familiarize myself. When implementing anything in CorePy, one of the first things you need to do is figure out how to get a pointer to the raw data -- in this case, a pointer to a NumPy array object's underlying data. In Python, I found that I could do this:

params.p1 = a.__array_interface__['data'][0]

With this in hand, I wrote a quick 'add' ufunc just to get something working, along with a correctness test and performance benchmark. I'm finding that on the system I'm using (dual dual-core 2ghz Opteron), adding two 10,000 element arrays together averages about 0.07ms, while NumPy's ufunc averages about 0.04ms. Some playing around with different input sizes shows that the CorePy version catches up as the input size increases -- so we have some initial invocation overhead. Hmm, that pointer lookup seems kind of expensive.. a member lookup, string-based dictionary reference, and another array reference.

The first thing I tried was using a C module to grab the pointer from the NumPy array; the code looks something like this:

PyArrayObject* arr;
PyArg_ParseTuple(args, "O!", &PyArray_Type, &arr);
return PyInt_FromLong((long)arr->data);

This was faster than extracting the pointer directly from Python, but I shouldn't have to resort to writing a C module just to get a pointer out of a data structure! This is where I left off back around christmas. Looking at this again, I realized I should be able to get at the pointer inside the array object directly in assembly, also. Examining a dissassembly of the above C code showed the 'data' pointer is 16 bytes from the start of the PyArrayObject struct. So I figured, if I could pass the address of the Python array object itself to the assembly code, I could extract the pointer there. Conveniently, id(foo) gives the address of object foo. Easy enough, I just pass that through and do this in the assembly:, MemRef(rdi, 16))

Now this is pretty crazy, and probably not the best idea in the world -- if the data pointer gets moved inside the array object, this code will be wrong. This works though, and is a little faster: 0.06ms at 10,000 elements. So a question for anyone reading -- is the a better/faster/more reliable way to get at the data pointer?

Another issue this has uncovered is the relative performance of the CorePy ufunc to the NumPy ufunc -- NumPy is faster for smaller array sizes, while CorePy eventually catches up and outperforms in the >10,000,000 element range. I'm out of time for now, but next I'll be looking at other sources of overhead, and consider other ways to improve performance.