To confirm (or disprove) your hypothesis, measure the amount of time each thread requires to complete its work by inserting timing code at the beginning and end of workerThreadStart(). How do your measurements explain the speedup graph you previously created?
❯ ./mandelbrot --view 1 --threads 4 [mandelbrot serial]: [71.300] ms Wrote image file mandelbrot-v1-serial.ppm [thread 3]: [18.512] ms [thread 2]: [18.527] ms [thread 1]: [18.593] ms [thread 0]: [18.725] ms [thread 2]: [18.523] ms [thread 1]: [18.584] ms [thread 3]: [18.573] ms [thread 0]: [18.591] ms [thread 3]: [18.553] ms [thread 2]: [18.577] ms [thread 1]: [18.655] ms [thread 0]: [18.670] ms [mandelbrot thread]: [18.669] ms Wrote image file mandelbrot-v1-thread-4.ppm ++++ (3.82x speedup from 4 threads)
❯ ./mandelbrot --view 2 --threads 4 [mandelbrot serial]: [42.205] ms Wrote image file mandelbrot-v2-serial.ppm [thread 3]: [11.019] ms [thread 2]: [11.066] ms [thread 0]: [11.096] ms [thread 1]: [11.224] ms [thread 2]: [11.047] ms [thread 3]: [11.029] ms [thread 1]: [11.075] ms [thread 0]: [11.124] ms [thread 1]: [11.076] ms [thread 2]: [11.236] ms [thread 3]: [12.659] ms [thread 0]: [12.844] ms [mandelbrot thread]: [11.173] ms Wrote image file mandelbrot-v2-thread-4.ppm ++++ (3.78x speedup from 4 threads)
其实不明白为何验证非线性结论要求我计算每个线程的耗时
但是我发现了一个新问题:我开了4 threads,但是无论怎么改都有12个输出
后面改了不同的线程,1threads有3个输出,2就有6个,3就有9个,神奇啊(大雾)
Modify the mapping of work to threads to improve speedup to 8× on view 0 and almost 8× on views 1 and 2. In your writeup, describe your approach and report the final 16-thread 3 speedup obtained. Also comment on the difference in scaling behavior from 4 to 8 threads versus 8 to 16 threads.
for(int i = 0; i < threadNumRows; i++) { int threadStartRow = i * args->numThreads + args->threadId; if(threadStartRow < args->height) { mandelbrotSerial (args->x0, args->y0, args->x1, args->y1, args->width, args->height, threadStartRow, 1, args->maxIterations, args->output); } } returnNULL; }
// Line 147 - 159 for (int i = 0; i < numThreads; i++) { args[i].threadId = i; // TODO: Set thread arguments here args[i].x0 = x0, args[i].y0 = y0; args[i].x1 = x1, args[i].y1 = y1; args[i].width = width; args[i].height = height; args[i].maxIterations = maxIterations; args[i].output = output; args[i].numThreads = numThreads; }
Problem 2 - Vectorizing Code Using SIMD Intrinsics
修改functions.cpp中的clampedExpVector()以实现并行快速幂
修改functions.cpp中的arraySumVector()以实现并行数组求和
Q&A
How much does the vector utilization change as W changes? Explain the reason for these changes and the degree of sensitivity the uti-lization has on the vector width. Explain how the total number of vector instructions varies with W .
voidclampedExpVector(float *values, int *exponents, float *output, int N){ // TODO: Implement your vectorized version of clampedExpSerial here __cmu418_vec_float x, result, xpower; __cmu418_vec_int y, tmp_y; // tmp_y stores y & 0x1
__cmu418_vec_float exp = _cmu418_vset_float(4.18f); __cmu418_vec_int zero = _cmu418_vset_int(0); __cmu418_vec_int one = _cmu418_vset_int(1);
floatarraySumVector(float *values, int N){ // TODO: Implement your vectorized version here
__cmu418_vec_float x, result = _cmu418_vset_float(0.f); __cmu418_mask maskAll = _cmu418_init_ones(); int cnt = VECTOR_WIDTH;
for(int i = 0; i < N; i += VECTOR_WIDTH) { _cmu418_vload_float(x, values + i, maskAll); // x = values[i]; _cmu418_vadd_float(result, result, x, maskAll); // result += x; }
Problem 4, Part 2. Combining instruction-level and SIMD parallelism
Q&A
How much speedup does this two-way parallelism give over the regular ISPC version? Does it vary across different inputs (i.e., different --views)? When is it worth the effort?
❯ ./mandelbrot_ispc -v 1 [mandelbrot serial]: [183.684] ms Wrote image file mandelbrot-1-serial.ppm [mandelbrot ispc]: [44.045] ms Wrote image file mandelbrot-1-ispc.ppm [mandelbrot ispc parallel]: [66.698] ms Wrote image file mandelbrot-1-ispc-par.ppm (4.17x speedup from ISPC) (2.75x speedup from ISPC+parallelism) ❯ ./mandelbrot_ispc -v 2 [mandelbrot serial]: [107.956] ms Wrote image file mandelbrot-2-serial.ppm [mandelbrot ispc]: [30.914] ms Wrote image file mandelbrot-2-ispc.ppm [mandelbrot ispc parallel]: [43.129] ms Wrote image file mandelbrot-2-ispc-par.ppm (3.49x speedup from ISPC) (2.50x speedup from ISPC+parallelism) ❯ ./mandelbrot_ispc -v 3 [mandelbrot serial]: [260.036] ms Wrote image file mandelbrot-3-serial.ppm [mandelbrot ispc]: [65.158] ms Wrote image file mandelbrot-3-ispc.ppm [mandelbrot ispc parallel]: [89.083] ms Wrote image file mandelbrot-3-ispc-par.ppm (3.99x speedup from ISPC) (2.92x speedup from ISPC+parallelism) ❯ ./mandelbrot_ispc -v 4 [mandelbrot serial]: [257.352] ms Wrote image file mandelbrot-4-serial.ppm [mandelbrot ispc]: [64.656] ms Wrote image file mandelbrot-4-ispc.ppm [mandelbrot ispc parallel]: [88.640] ms Wrote image file mandelbrot-4-ispc-par.ppm (3.98x speedup from ISPC) (2.90x speedup from ISPC+parallelism)
exportvoidmandelbrot_ispc_par2(uniform float x0, uniform float y0, uniform float x1, uniform float y1, uniform int width, uniform int height, uniform int maxIterations, uniform int output[]){ uniform float dx = (x1 - x0) / width; uniform float dy = (y1 - y0) / height;
// TODO: Write ISPC code that will use function mandel_par2 to process // two rows on each pass. // You should use the foreach construct. // You should handle the case where the height is not a multiple // of 2.
uniform int numRowsPerPass = (height + 1) / 2;
foreach (i = 0 ... numRowsPerPass, j = 0 ... width) { int row0 = i * 2; int row1 = min(row0 + 1, height - 1);
❯ ./cuberoot [cuberoot serial]: [3658.882] ms [cuberoot ispc]: [884.804] ms [cuberoot task ispc]: [134.455] ms (4.14x speedup from ISPC) (27.21x speedup from task ISPC)
Q&A
Modify the function initGood/initBad() in the file data.cpp to generate data that will yield a very high relative speedup of the ISPC implementations.
Does your modification improve SIMD speedup? Does it improve multi-core speedup? Please explain why.
我最多可以跑到40x左右的加速,这时候value[i] = 1.9999999,似乎可以无限逼近2
1 2 3 4 5 6
❯ ./cuberoot -d g [cuberoot serial]: [7345.921] ms [cuberoot ispc]: [1190.029] ms [cuberoot task ispc]: [178.646] ms (6.17x speedup from ISPC) (41.12x speedup from task ISPC)
最慢可以达到2.3x的加速,这时候value[i] = 1.0
1 2 3 4 5
[cuberoot serial]: [80.059] ms [cuberoot ispc]: [43.914] ms [cuberoot task ispc]: [34.633] ms (1.82x speedup from ISPC) (2.31x speedup from task ISPC)