Sunday, June 14, 2009

Have been busy lately.. big paper deadline (CorePy paper!) and moving. When I can, I've been working on general code correctness/cleanliness/robustness. Management and movement of various data inside the 'corefunc' C module has been cleaned up. I now have a fairly generic routine for creating a ufunc from some type information and CorePy InstructionStreams. Multi-core is supported at the C level in a generic way; nothing there is ufunc-specific.

I didn't mention this before, but my code previously only worked correctly for input sizes that were a multiple of the number of threads used (1, 2, 4) and the number of iterations that were unrolled (1, 4, 8). Now my code works for any size input. At the thread level, I just had the last thread (if more than one) check if the number of threads evenly divided the input length. If not, that last thread increased the number of elements it processed to compensate. At the CorePy loop level, a special code path was needed to handle both the case where the input length is less than the number of unrolled loop iterations, and when the length was not a multiple of the unrolled iterations. Here, I do some fancy (but somewhat messy) checks, and added unrolled code to handle one element at a time (as opposed to 4/8 at a time using SIMD).

Turns out reduction wasn't working right when multiple threads are used, but this is fixed now (took a while! evil bugs). Each thread reduces (eg. adds, subtracts) all the elements in its assigned input sub-range into a local register. When complete, the local value is atomically reduced into the single shared output value residing in memory. CorePy's support for the LOCK prefix (makes some instructions atomic) wasn't complete, so I had to go do that. I think I've covered most/all of the instructions supporting the LOCK prefix on x86_64.

This brings up sort of a side issue -- because floating point is not associative, the use of multiple threads accumulating to their own local values independently means the order of operations for reductions is changed. This causes a result that is different from NumPy's. I'm not sure what to do about this, if anything -- I don't think there is a way to maintain thread-level parallelism and perform the operations in the same order as NumPy. I think all I can do is provide a warning about this issue when using multiple threads to do reductions.

Sort of related to this GSOC work, I implemented ExtBuffer, an object implementing the Python 2.6 buffer interface (mainly for use with NumPy arrays). ExtBuffer provides guaranteed page-aligned memory, supports huge pages, and can optionally just take a pointer to some pre-allocated memory and use that (useful for overlaying special memory-mapped hardware regions, e.g. Cell SPU local store). Documentation here:

I was actually anticipating that I would have to write ExtBuffer as part of this GSOC project. The issue was that SIMD code typically requires 16-byte memory alignment for best performance. To my knowledge, malloc() (which I understand NumPy arrays use to alloc memory) only guarantees 8-byte alignment. However I've just been using NumPy arrays as-is and have not (yet) run into any alignment issues. So I need to look into this and see if I can now really rely on NumPy array data always being at least 16-byte aligned.

Where to next?

I really just have an addition ufunc with support for in64 and float32 types right now. This has sufficed for building the initial codebase, but now I need to expand this to support more data types and more operations. With that done, it'll be time to start refactoring duplicate code and designing/building more of a framework to build new ufuncs quickly/easily (my halfway milestone!)

No comments:

Post a Comment