Monday, August 3, 2009

While messing around with my cosine ufunc I realized that NumPy's performance for cosine over float32's is horrible, while float64 performance is more normal:

cfcos type 'numpy.float32'
inp sizes 1024 10240 102400 1024000 3072000
numpy 0.7241 9.3472 115.6120 995.1050 3027.2000
corepy 1 0.0191 0.1521 1.6742 16.3748 50.1912
corepy 2 0.0160 0.0811 0.8180 8.2700 25.5480
corepy 4 0.0219 0.0889 0.8180 4.1308 12.9910

cfcos type 'numpy.float64'
inp sizes 1024 10240 102400 1024000 3072000
numpy 0.0970 0.9248 9.3892 93.8919 277.4410
corepy 1 0.0370 0.3450 3.4950 35.4860 106.4019
corepy 2 0.0319 0.1729 1.7591 17.8859 53.5469
corepy 4 0.0288 0.0958 0.8941 8.9951 26.8750

I started a thread on numpy-discussion, and it's gotten a bit of attention. :) I'm not sure what causes this right now, still have some things to try out. See the thread for details:

The main thing is that my float32 cosine wasn't so fast after all -- NumPy is just really slow, at least on some installations (including mine). Some other users weren't able to reproduce the slowness.

I've implemented a few different ufuncs now, so figure it's worth summarizing them and the types they work over:

add: int{8,16,32,64}, float{32,64}
multiply: int{8,16,32,64}, float{32,64}
cosine (faster less accurate version): {float32,64}
cosine (using x87, slower but accurate): {float32, 64}
maximum: int{32,64}, float{32,64}

All of these written using the framework I've developed, of course. As I ran my test code for writing this post, I found that the x86-based cosine is segfaulting at 4 threads/cores. It wasn't doing this earlier.. will have to investigate.

Maximum was fun to do, particularly for int32. I spent time iterating on the code, and came up with something pretty creative. I use SSE compare instructions to create a sort of selection mask to use with bit-wise operations to choose the maximum values without a branch. All SIMD, too. Unfortunately I don't have the instructions to do this for int64 (no quadword compares), but I can/will do the same thing for int8/int16. SSE has a specific instruction for selecting the maximum values from two vectors, so I just used that there. Very easy to take advantage of the hardware. :)

Future work: I'm planning on doing a couple more ufuncs, particularly one logical comparison and one bitwise (AND, OR) operation. The idea is to build up a collection of different types of ufuncs to show that the framework works and can be used to develop more.

After that I'm going to look at using the framework to build more complex, custom ufuncs. Initially just something like fused multiply-add, and then maybe some entire algorithms (particle simulation?) in a single ufunc. The idea here being that we can use the framework to do any sort of computation, and use it to reduce memory traffic, leverage the CPU functionality, and go faster without much trouble.


  1. Nice find! I'm interested in looking at your SSE code, is it in a public repo somewhere?

    Keep up the good work :)

  2. Sorry, I just now noticed your comment!

    My SVN repo for the project is here:

    The ASM code is in corefunc/; the framework code is at the top and the code specific to each ufunc follows that.