Please note that these small numbers of waves are for the smallest GCN chip, which has only 10 CUes, not 32.
With 40 threads it is possible to utilize all the 640 streams but without any latency hiding and only with simple instructions. GDS synch is only works for 4 waves per CU, otherwise it's a deadlock.
With 60 threads (thanks for ds_gws_barrier) it was possible to put 6 waves into every CU, and this tolerates better the 'fat' instruction stream I'm planning to give them.
I measured 700 GFlops/s with MADs, while synching all the workitems at 220KHz. This means a synchpoint after every 400 v_mad_f32. On a 1126 GFlops/s card it's not that bad.
There's also a noticeable kernel launch overhead: I have to launch 100 kernel in every second because it has to be interactive.