Oups I had a mistake: forgot to use GLC while checking the synchronization with uav.
So the 8 wavefronts / CU is possible with GWS, and beyond this it is a crash.
w/CU 4 5 6 7 8
MAD 29 37 38 39 39 (exec time, ms)
ADD 21 34 34 34 34
When I raised it from 6 to 8, the exec time was only increased 1ms from 38, so some sleeping units was awaken.
Not the TFlops/s I can get out of it it is 838 (raised from 700, peak is 1126).
(And this leads to a problem in the piano: Faster processing leads to less string lengths given to each of the wavefronts. And it starting to reach lengths of the bass strings. It will be a miracle that how the whole thing will fit into the HD7770... But if it fits, it sits. )
Still there is room for MAD to be faster, but I think it's only can happen when the CU has all 10 waves inside.
Didn't tested for workgroup sizes bigger than 64. A test of that would be interesting 'tho.