Implementing NumPy's ufuncs using CorePy

I implemented a quick multiplication ufunc by copying my existing add ufunc and changing it to do multiply. I'm not using SSE here for dtype=int64; I'm not seeing an instruction to multiply 64-bit integers. No big deal, the general-purpose multiply instruction gets the job done. With two ufuncs going I spent some time developing and playing with a testing suite that covers both correctness and performance. The code is in my SVN branch. Performance output looks like this:

inp sizes 1024 10240 102400 1024000 10240000
add numpy 0.009 0.043 1.217 12.059 158.210
add corepy 1 0.050 0.100 1.312 11.653 171.419
add corepy 2 0.080 0.112 1.070 9.352 109.685
add corepy 4 0.179 0.206 1.053 9.355 98.217

REDUCE
add numpy 0.009 0.042 0.377 4.154 41.192
add corepy 1 0.049 0.057 0.212 3.901 37.618
add corepy 2 0.068 0.071 0.178 3.084 23.808
add corepy 4 0.155 0.156 0.255 2.950 22.188

inp sizes 1024 10240 102400 1024000 10240000
mul numpy 0.008 0.047 1.078 12.415 162.024
mul corepy 1 0.057 0.122 1.040 8.687 173.505
mul corepy 2 0.066 0.098 0.424 7.309 126.512
mul corepy 4 0.124 0.173 0.625 8.103 115.533

REDUCE
mul numpy 0.013 0.058 0.528 5.486 54.841
mul corepy 1 0.041 0.056 0.152 1.078 10.369
mul corepy 2 0.068 0.085 0.129 0.751 6.586
mul corepy 4 0.158 0.156 0.221 0.849 5.797

Times are in milliseconds, and are average of 10 runs. For the CorePy cases, I'm running with 1/2/4 cores (basic parallelization over cores was pretty easy!). CorePy's overhead is visible, as mentioned in my previous post. There's some unexpected things in these numbers too -- for example, multiplication is faster than addition for CorePy but not NumPy. I dont know why; multiply instructions take longer to execute, and I am otherwise doing the same thing in both cases.

While testing ufuncs with various inputs (and input sizes) I came across some differences in behavior when values reach/exceed 2**63/64 (I'm using dtype=int64). This sent me off trying to understand what NumPy's behavior is. What I expected was for integers to wrap at 2**63 (signed) or 2**64 (unsigned), but what I see is this:

>>> x = numpy.array((2**62,) * 1)
>>> hex(numpy.add.reduce(x))
'0x4000000000000000L'
>>> x = numpy.array((2**62,) * 2)
>>> hex(numpy.add.reduce(x))
'-0x8000000000000000L'

Huh? Last time I checked, 4 + 4 = 8, not -8... am I missing something? Will have to check with the NumPy people to see what's going on.

Sunday, May 10, 2009

No comments: