Monday, June 22, 2009

This past week I wrote code for int32 and float64 types, giving me an addition ufunc that supports 32/64bit ints and floats. The code for each case is pretty long; each datatype duplicates code that loads the function arguments, checks for cases like reduction and lengths that don't evenly divide the SIMD parallelization/loop unrolling, etc. So I started factoring this out and developing the beginnings of a framework for writing CorePy-based ufuncs with minimal effort and code duplication.

I succeeded in isolating most of the duplicate code, though the 'generic' code generation is a bit more complicated than before -- especially the specialized reduction code. This needs some polish still; I used two different approaches for factoring out the operation-specific code for general element-wise operations and reductions. I will probably use the approach I did for reduction -- the generic code requires a function (I can actually pass a single instruction) that takes two registers as arguments (accumulator and source element) and generates reduction code for a single element.

I ran into a few problems while working on this stuff; I have no idea how I managed to miss these bugs until now. First, applying my ufunc to a single-element array gave bad results -- the test I was using to detect reduction was also catching this single-element case, and not working properly.

The other issue I've run into is due to the way I'm splitting work among multiple threads. I just did the easiest thing possible (halfway expecting to do something more elaborate later) -- divide the length evenly among each thread, giving the last thread additional work if the number of threads does not divide the length evenly. The problem with this is that individual threads will get work segments with unaligned memory if the work segment length is not a multiple of 16 bytes. So this problem is the next thing I need to tackle -- I've been thinking about ways to split the work evenly, while also rounding work segments to a multiple of 16 bytes.

No comments:

Post a Comment