Quantcast
Channel: AMD Developer Forums: Message List - Global synchronization inside the kernel
Viewing all articles
Browse latest Browse all 17

Re: Global synchronization inside the kernel

$
0
0

Please note that these small numbers of waves are for the smallest GCN chip, which has only 10 CUes, not 32.

With 40 threads it is possible to utilize all the 640 streams but without any latency hiding and only with simple instructions. GDS synch is only works for 4 waves per CU, otherwise it's a deadlock.

With 60 threads (thanks for ds_gws_barrier) it was possible to put 6 waves into every CU, and this tolerates better the 'fat' instruction stream I'm planning to give them.

I measured 700 GFlops/s with MADs, while synching all the workitems at 220KHz. This means a synchpoint after every 400 v_mad_f32. On a 1126 GFlops/s card it's not that bad.

There's also a noticeable kernel launch overhead: I have to launch 100 kernel in every second because it has to be interactive.


Viewing all articles
Browse latest Browse all 17