元 GEN - Gen

元 GEN
Rogue Imagineer

2002-05-22

My Altivec code is coming along great. My standard procedure for writing SIMD code is to create a duplicate code path that uses a simple but mathematically accurate scalar algorithm. Then, I run the algorithms in parallel, making sure that the SIMD code is doing what it is supposed to be doing. This approach works great, and is especially good at rooting out off-by-one errors that show up when switching from fixed 16.16 to float with rounding and so forth.

Even though my Altivec code is working, I'm not sure the overall performance has increased that much. Next up I'm going to write a simple function timing routine. I'm suspecting that one of the bottlenecks is the 'get frame' routine provided by QuickTime.

By : 元  0 comments

2002-05-21

Is well-supported open source software always the best choice?

Not in my case. I'm working on Altivec optimizations right now in my Mac products, and I started out with a foray into Apple's sample code using the freely available Project Builder application which is based on gcc. Maybe I'm using a crappy out-moded 400Mhz G4 Titanium, but that aforementioned development environment is dawg-slow. After spending about 20 excrutiating minutes with it, I switched back to CodeWarrior and was suprised at how much faster it is. Probably two orders of magnitude at least.

Speaking of SIMD optimizations, time is ripe for some sort of proper language tool for expressing SIMD code. C language extensions are fine for now; I am looking forward to seeing how the compiler generates SIMD code. I hand-coded all the MMX routines in the Windows products using assembler. I can't see that it is that much faster to write code with the c-extensions, since most of the time is spent trying to figure out the dataflow and which instructions to use. As well, doing simple things like shifting the entire Velocity register is very arcane in Altivec, seems like Intel did a better job in that arena, even as early as SSE. However, one great thing about Altivec is that there are 32 registers. After squeezing code into the 8 MMX registers, it seems like a 'huge tract of land' to be working with. In my current routine, it looks like the compiler is squeezing it into 7 registers so far. Generally speaking, in order to avoid contention for registers and dependencies (like having to wait for the result of an operation to avoid a pipeline stall) I try to code one or two algorithms in parallel. Usually this means processing two pixels or four pixels at once.

By : 元  0 comments