Wow thanks for MAC, now I'm at 960 GFlops/s with 230KHz synch I do convolution most of the time, so that's the proper instruction.
(Gotta memorize that mad = mad+mac+madak+madmk. Even in my Mandelbrot example there's a spot for MAC.)
I also tried it with workgroup_size=256. And it worked without any slowdown. So the long piano string problem is no more a problem as I can ensure that every 4 adjacent wavefronts are using the same LDS memory.
Here's how it depends on speed: I have to fit the total number of string point into the whole card:
The whole instrument have 55K string points.
Total workitems: 10(cu)*8(wf)*64=5120
String points per workitems=10.74 -> 11
Longest string =11*64*4(wfs) = 2816 -> thats 2-3x more than actually needed, thanks for 256 workitems/workgroup
Iterations = 4096 (comes from 512 samples at 8x oversampling)
Maximum time = 10.6ms (512 samples @ 48KHz)
Estimated instruction count = 183 (based on actual HD6970 simulation, maybe I can optimize better)
Measured time with MAC's = 10.52ms sooooooo close! (That's 7.1ms on HD6970 which is 2.72 TFlops instead of 1.12)