With 60 threads (thanks for ds_gws_barrier) it was possible to put 6 waves into every CU, and this tolerates better the 'fat' instruction stream I'm planning to give them.
Thanks for the data.
Agree, fat instructions do better when you go past full house (4waves/CU). Before GCN, all insns were fat.
For wave barriers (gws), I often use 8 waves/CU and I have not seen a problem. That's GCNs sweet spot for computation (ignoring latency). However, as himanshu points out, its up to your luck as far as when kernels are issued. When I use 8 waves/CU, I almost always use 256 work items/ group, only two groups / CU. Now I wonder if that makes a difference.