(edit: fixed link)
Not much there, though I did add a quick implementation of an 'add' func I wrote back around christmas and started playing with again to re-familiarize myself. When implementing anything in CorePy, one of the first things you need to do is figure out how to get a pointer to the raw data -- in this case, a pointer to a NumPy array object's underlying data. In Python, I found that I could do this:
params.p1 = a.__array_interface__['data']
With this in hand, I wrote a quick 'add' ufunc just to get something working, along with a correctness test and performance benchmark. I'm finding that on the system I'm using (dual dual-core 2ghz Opteron), adding two 10,000 element arrays together averages about 0.07ms, while NumPy's ufunc averages about 0.04ms. Some playing around with different input sizes shows that the CorePy version catches up as the input size increases -- so we have some initial invocation overhead. Hmm, that pointer lookup seems kind of expensive.. a member lookup, string-based dictionary reference, and another array reference.
The first thing I tried was using a C module to grab the pointer from the NumPy array; the code looks something like this:
PyArg_ParseTuple(args, "O!", &PyArray_Type, &arr);
This was faster than extracting the pointer directly from Python, but I shouldn't have to resort to writing a C module just to get a pointer out of a data structure! This is where I left off back around christmas. Looking at this again, I realized I should be able to get at the pointer inside the array object directly in assembly, also. Examining a dissassembly of the above C code showed the 'data' pointer is 16 bytes from the start of the PyArrayObject struct. So I figured, if I could pass the address of the Python array object itself to the assembly code, I could extract the pointer there. Conveniently, id(foo) gives the address of object foo. Easy enough, I just pass that through and do this in the assembly:
x86.mov(rdi, MemRef(rdi, 16))
Now this is pretty crazy, and probably not the best idea in the world -- if the data pointer gets moved inside the array object, this code will be wrong. This works though, and is a little faster: 0.06ms at 10,000 elements. So a question for anyone reading -- is the a better/faster/more reliable way to get at the data pointer?
Another issue this has uncovered is the relative performance of the CorePy ufunc to the NumPy ufunc -- NumPy is faster for smaller array sizes, while CorePy eventually catches up and outperforms in the >10,000,000 element range. I'm out of time for now, but next I'll be looking at other sources of overhead, and consider other ways to improve performance.