Monday, June 22, 2009

This past week I wrote code for int32 and float64 types, giving me an addition ufunc that supports 32/64bit ints and floats. The code for each case is pretty long; each datatype duplicates code that loads the function arguments, checks for cases like reduction and lengths that don't evenly divide the SIMD parallelization/loop unrolling, etc. So I started factoring this out and developing the beginnings of a framework for writing CorePy-based ufuncs with minimal effort and code duplication.

I succeeded in isolating most of the duplicate code, though the 'generic' code generation is a bit more complicated than before -- especially the specialized reduction code. This needs some polish still; I used two different approaches for factoring out the operation-specific code for general element-wise operations and reductions. I will probably use the approach I did for reduction -- the generic code requires a function (I can actually pass a single instruction) that takes two registers as arguments (accumulator and source element) and generates reduction code for a single element.

I ran into a few problems while working on this stuff; I have no idea how I managed to miss these bugs until now. First, applying my ufunc to a single-element array gave bad results -- the test I was using to detect reduction was also catching this single-element case, and not working properly.

The other issue I've run into is due to the way I'm splitting work among multiple threads. I just did the easiest thing possible (halfway expecting to do something more elaborate later) -- divide the length evenly among each thread, giving the last thread additional work if the number of threads does not divide the length evenly. The problem with this is that individual threads will get work segments with unaligned memory if the work segment length is not a multiple of 16 bytes. So this problem is the next thing I need to tackle -- I've been thinking about ways to split the work evenly, while also rounding work segments to a multiple of 16 bytes.

Sunday, June 14, 2009

Have been busy lately.. big paper deadline (CorePy paper!) and moving. When I can, I've been working on general code correctness/cleanliness/robustness. Management and movement of various data inside the 'corefunc' C module has been cleaned up. I now have a fairly generic routine for creating a ufunc from some type information and CorePy InstructionStreams. Multi-core is supported at the C level in a generic way; nothing there is ufunc-specific.

I didn't mention this before, but my code previously only worked correctly for input sizes that were a multiple of the number of threads used (1, 2, 4) and the number of iterations that were unrolled (1, 4, 8). Now my code works for any size input. At the thread level, I just had the last thread (if more than one) check if the number of threads evenly divided the input length. If not, that last thread increased the number of elements it processed to compensate. At the CorePy loop level, a special code path was needed to handle both the case where the input length is less than the number of unrolled loop iterations, and when the length was not a multiple of the unrolled iterations. Here, I do some fancy (but somewhat messy) checks, and added unrolled code to handle one element at a time (as opposed to 4/8 at a time using SIMD).

Turns out reduction wasn't working right when multiple threads are used, but this is fixed now (took a while! evil bugs). Each thread reduces (eg. adds, subtracts) all the elements in its assigned input sub-range into a local register. When complete, the local value is atomically reduced into the single shared output value residing in memory. CorePy's support for the LOCK prefix (makes some instructions atomic) wasn't complete, so I had to go do that. I think I've covered most/all of the instructions supporting the LOCK prefix on x86_64.

This brings up sort of a side issue -- because floating point is not associative, the use of multiple threads accumulating to their own local values independently means the order of operations for reductions is changed. This causes a result that is different from NumPy's. I'm not sure what to do about this, if anything -- I don't think there is a way to maintain thread-level parallelism and perform the operations in the same order as NumPy. I think all I can do is provide a warning about this issue when using multiple threads to do reductions.

Sort of related to this GSOC work, I implemented ExtBuffer, an object implementing the Python 2.6 buffer interface (mainly for use with NumPy arrays). ExtBuffer provides guaranteed page-aligned memory, supports huge pages, and can optionally just take a pointer to some pre-allocated memory and use that (useful for overlaying special memory-mapped hardware regions, e.g. Cell SPU local store). Documentation here:

http://corepy.org/wiki/index.php?title=Extended_Array

I was actually anticipating that I would have to write ExtBuffer as part of this GSOC project. The issue was that SIMD code typically requires 16-byte memory alignment for best performance. To my knowledge, malloc() (which I understand NumPy arrays use to alloc memory) only guarantees 8-byte alignment. However I've just been using NumPy arrays as-is and have not (yet) run into any alignment issues. So I need to look into this and see if I can now really rely on NumPy array data always being at least 16-byte aligned.

Where to next?

I really just have an addition ufunc with support for in64 and float32 types right now. This has sufficed for building the initial codebase, but now I need to expand this to support more data types and more operations. With that done, it'll be time to start refactoring duplicate code and designing/building more of a framework to build new ufuncs quickly/easily (my halfway milestone!)