On Windows, global sync was smooth until I did something like move the window around while the kernels are running. I figured it was partitioning CUs between compute and rendering or something.
Btw, why only 40 waves? It could run up to 400 waves, depending on the vgpr usage.