Sunday, May 24, 2009

I've been working in two areas. First, filling out the 'corefunc' approach using the ufunc C API to build my own ufuncs. I've added support for multiple cores (1, 2, 4 threads), and have an add ufunc that supports 64-bit integers and 32-bit floats. Performance with the code I'm currently using looks like this:


cfadd numpy.int64
inp sizes 1024 10240 102400 1024000 3072000 4096000 5120000 6144000
numpy 0.0115 0.0443 1.3969 14.4416 42.4222 55.5169 101.4473 118.8185
corepy 1 0.0103 0.0608 0.9395 9.7391 29.0794 37.4023 109.7189 132.7189
corepy 2 0.0342 0.0810 0.6061 8.7033 25.1321 33.2698 85.4791 103.1184
corepy 4 0.0584 0.0903 0.6499 8.8539 26.3412 34.7378 72.2230 86.9815

REDUCE
numpy 0.0094 0.0421 0.4188 5.5379 16.5510 22.0293 27.4446 32.9556
corepy 1 0.0080 0.0180 0.1657 4.0605 12.1317 16.0499 20.0882 24.0584
corepy 2 0.0304 0.0363 0.1621 2.7398 7.0745 9.3564 11.6919 13.9961
corepy 4 0.0566 0.0601 0.1775 3.0038 7.5964 9.3019 11.8332 13.2176

cfadd numpy.float32
inp sizes 1024 10240 102400 1024000 3072000 4096000 5120000 6144000
numpy 0.0089 0.0404 0.5459 8.8722 26.5132 24.5601 31.0621 36.2360
corepy 1 0.0085 0.0330 0.3583 4.7290 14.0452 17.5809 21.9987 24.8356
corepy 2 0.0327 0.0436 0.2639 4.4774 12.9543 16.7708 21.0962 24.7302
corepy 4 0.0550 0.0664 0.3122 4.4057 13.4197 17.8114 22.0150 25.7672

REDUCE
numpy 0.0120 0.0695 0.6778 6.9267 20.7328 27.6145 34.3973 41.3703
corepy 1 0.0080 0.0263 0.2130 2.2117 6.5741 8.7050 10.8356 12.9950
corepy 2 0.0334 0.0403 0.1589 1.9634 5.3451 6.5962 7.1621 8.7798
corepy 4 0.0550 0.0659 0.1835 1.8873 4.3171 5.9614 7.2393 7.9173


(I rediscovered the 'pre' HTML tag, hooray!)

Nice speedups in some places; some other places that aren't so great (small and large arrays). On the small side my guess is my overhead is too high; on the large side I think I'm hitting some sort of (non)cache effect? I do better on floats too, for some reason. Maybe this is due to 32bit elements instead of 64bit elements? Need to do more testing.

Per a little discussion on the numpy-discussion list, I tried a cos ufunc. The idea being that cos is more computationally intensive than something like addition, and memory bandwidth isn't necessarily the performance limit. I used the SSE code from here:

http://www.devmaster.net/forums/showthread.php?t=5784

And added a little code to round numbers outside the range of 0-2PI. After some debugging I believe my code is at least somewhat correct, and can benchmark it:


cfcos
inp sizes 1024 10240 102400 1024000 3072000
numpy 0.7228 9.5713 117.2879 1019.7153 3099.1759
corepy 1 0.0144 0.0898 0.9272 9.6835 29.0826
corepy 2 0.0363 0.0980 0.6339 5.6012 15.2627
corepy 4 0.0615 0.1243 0.6605 4.7141 14.1573


100x (200x at 2 threads) speedup, nice :) This is almost too good to be true though, something can't be right -- cos performance is approaching that of my addition ufunc with multiple cores on larger arrays. This could sort of make sense with a fast enough CPU though? I'm going to spend some time understanding my code better and making sure it's correct, and figure out what exactly numpy is doing that takes so long. I wonder if the cos is computed in python instead of using a C routine or the x87 cos instruction?

No comments:

Post a Comment