Another two weeks worth of stuff included in this post.
I implemented support for the int8 and int16 types for add/multiply. The framework had to be extended somewhat. Sounds like no big deal, but the 8bit stuff uncovered some bugs that took a while to fix -- mainly due to the fact that I'm unrolling 16/32 elements worth of operations in the main loop, which is more than the 4/8 I had been doing until now.
Finally went and reworked the multicore support to do things properly. I split the work into double-quadword (16 byte) sized blocks. This is the size of a single SIMD vector on x86, and it is important to work on 16-byte aligned data for better performance. So what I wanted to do was make sure each thread's sub-range to work on was 16-byte aligned regardless of datatype and total array length. Getting all the math right was pretty tricky! I think (hope :)) I got it all correct now. The last thread tends to get extra work when the array length rounded to a 16-byte block times the number of threads -- I think this is OK though, the extra work is never more than num_threads - 1 16-byte blocks. How many elements this is depends on the size of the data type -- a 16-byte block contains only two int64/float64s, or 16 int8s.
Something I've also wanted to do for a while was set up persistent threads and farm out work to them, instead of creating/joining threads in every ufunc call. This reduces the overhead incurred by using multiple threads in a ufunc, and is visible for smaller array sizes (< ~10-100,000 elements). However I'm not sure this is worthwhile, as the multi-thread performance on these smaller array lengths is still worse than just using a single core. I think I need to investigate the overheads more.
I have also generalized the framework to support single-input ufuncs. This was actually easier than I expected, and ended up doing most of the work needed to support > 2 inputs as well. All that remains for >2 input support is register assignment (easy) and sorting out whether reduction makes sense, and how it should be done (probably not too hard).
I revived the cosine ufunc, which had bit-rotted some, and updated it to use the new single-input support in the framework. My testing harness fails the cosine ufunc, but hand inspection shows the results are correct if you tolerate some rounding error. More investigation needed here.
Side note -- I discovered AMD's Software Optimization Guide for AMD64 Processors -- looks like it has some good information in it, including actual instruction latencies! Find it here: