I spent some time reading the NumPy book and reference manual. Initially I was trying to find information on how the existing ufuncs are implemented, but I ended up just reading up on NumPy arrays and ufuncs in general.
Looking at the C API, I found a couple interesting functions. PyUFunc_FromFuncAndData is used to create new ufunc objects, while PyUFunc_RegisterLoopForType modifies an existing ufunc to use a provided function for operating on a particular data type. The latter is certainly useful; I can override portions of existing ufuncs with my own implementation as I progress. The functions that must be provided for each data type take one or two (or more) 1-D arrays as input, and output one 1-D array; the ufunc framework takes care of extending this to more complicated array shapes.
Finding this made me realize another approach to implementing ufuncs -- rather than building new ufunc objects from the ground up (like I have before), I can use this API. The trick is connecting CorePy code generated in Python to some C code that calls these functions to set up some ufuncs. Internally, CorePy creates an array containing the machine code representing the generated code. One of the CorePy runtime modules (written in C) executes the machine code by casting the array to a function pointer and calling it. How convenient.. the Ufunc C API requires a function pointer implementing the ufunc's operation over a 1-D array. So I made a proof-of-concept C module containing one method. This method takes an address to an array of machine code as input, and uses that as a function pointer to create a new ufunc supporting addition over integers. Sure enough, this works pretty well; my quick implementation performs roughly equivalent to the exising add ufunc.
Figuring out how reduce is supported was a little tricky -- the secret is that a 0-length step is used for the output, and one input. If the 1-D array function is written properly, no extra work is needed to support reduce. I took advantage of this though, and wrote a separate loop that reduces the number of memory accesses (therefore being faster).
I think this approach is the way to go. Multi-core support will need a C wrapper function to fork off several threads to do the compution. Something I might do is look at having persistent threads to move the thread-creation overhead out of each ufunc call. Next step from where I am now though, is to build/extend a couple existing ufuncs with CorePy routines for several different datatypes.
Code is comitted in the SVN repo (r677) mentioned in a previous post; particularly in the 'corefunc' directory. Little messy right now, but the code in the shows the idea discussed here will work.