Quantcast
Channel: AMD Developer Forums: Message List - Global synchronization inside the kernel
Viewing all articles
Browse latest Browse all 17

Re: Global synchronization inside the kernel

$
0
0

Wow thanks for MAC, now I'm at 960 GFlops/s with 230KHz synch I do convolution most of the time, so that's the proper instruction.

(Gotta memorize that mad = mad+mac+madak+madmk. Even in my Mandelbrot example there's a spot for MAC.)

 

I also tried it with workgroup_size=256. And it worked without any slowdown. So the long piano string problem is no more a problem as I can ensure that every 4 adjacent wavefronts are using the same LDS memory.

 

Here's how it depends on speed: I have to fit the total number of string point into the whole card:

The whole instrument have 55K string points.

Total workitems: 10(cu)*8(wf)*64=5120

String points per workitems=10.74 -> 11

Longest string =11*64*4(wfs) = 2816 -> thats 2-3x more than actually needed, thanks for 256 workitems/workgroup

Iterations = 4096  (comes from 512 samples at 8x oversampling)

Maximum time = 10.6ms (512 samples @ 48KHz)

Estimated instruction count = 183  (based on actual HD6970 simulation, maybe I can optimize better)

Measured time with MAC's = 10.52ms  sooooooo close! (That's 7.1ms on HD6970 which is 2.72 TFlops instead of 1.12)


Viewing all articles
Browse latest Browse all 17

Trending Articles