代写matrix multiplication、代做c++代写、c++编程调试、代写C program-代写C/C++编程

联系方式

QQ：99515681
邮箱：99515681@qq.com
工作时间：8:00-23:00
微信：codinghelp

您当前位置：首页 >> C/C++编程C/C++编程

代写matrix multiplication、代做c++代写、c++编程调试、代写C program

日期：2019-04-29 09:20

Final Assignment (Assignment 4)

Objective:

We consider a special case of matrix multiplication:

C := C + A*B

where A, B, and C are n x n matrices. This can be performed using 2n3 floating point operations (n3 adds,

n3 multiplies), as in the following pseudocode:

for i = 1 to n

for j = 1 to n

for k = 1 to n

C(i,j) = C(i,j) + A(i,k) * B(k,j)

In this assignment, we will design and implement a C program that performs computations on large

matrices. The size of a matrix is large enough so that the execution of program causes paging. Read p.

225 to p. 228, and p. 413 to p. 417 of Patterson & Hennessy textbook.

Purpose:

Different choices of copying the matrix may have different impacts on the program runtime. You are

required to notice such impacts and eventually propose a design that efficiently leverages the

mechanisms described below to achieve the best performance.

Instructions:

You will be paired up in teams of four. Sign up for your team at the blackboard Wiki by April 12

o If you don’t find a team, you will be randomly assigned by the instructor

After teams decided, you need to sign up for presentation via Wiki of the blackboard

o There are 4 groups of max. 6 teams

o Each team will sign up for 10 min. time slot

o Each team must arrive before their group begins and stay until the last team of the

group finishes

Each team must submit the followings by 12:00 pm April 30

o Presentation slide to the blackboard assignment link

o Report (Write-up) to the turnitin link

The names of the people in your team

The optimization used or attempted

The reason for any odd behavior (if any) in performance

You need to describe on what mechanisms are used in your implementation and

how much performance improvement is achieved. Please write it down in terms

of runtime.

How the performance changed when running your optimized code on a

different machine (You can run the same program on each member’s machine –

different hardware spec)

o Source code to the blackboard assignment link

Requirements:

1. Complete the given C program and use multiple optimization mechanisms to improve the

execution runtime.

2. You are required to use all the mechanisms discussed in the class and recitation

3. You are only allowed to use standard libraries and intrinsic library to implement your program

4. You are not allowed to create new threads in your implementation.

5. When using gcc to compile your code, you are allowed to use optimization level3(-O3).

Mechanisms:

1. Caching: You are required to try different cache block size in your code and use the block size

that gives you minimum runtime when integrated with other techniques (SIMD and superscalar

mechanism)

2. SIMD: You are required to make use of Single Instruction Multiple Data (SIMD). It means

performing a single operation on multiple data points simultaneously.

3. Superscalar mechanisms: A superscalar processor executes more than one instruction during a

clock cycle by simultaneously dispatching multiple instructions to different components on the

processor.

Experiments:

Per attempted optimization, you can have different size of matrix (for example, n = 128, 256,

512, 1024, 2048), different block sizes (m = 2^x), different number of unrolled instructions, etc.

You need to measure the runtime of each experiment.

You need to verify the correctness of the attempted optimization.

Reference:

Matrix Multiplication

Intel Intrinsic Guide

GCC documentation

Streaming SIMD Extensions

Grading Rubrics:

1. If your code does not run, you will not receive any credits

2. You are required to combine at least two technique of optimization, otherwise will not receive

any credits

3. Combining two methods with proper experiments can enable you to have at most 50%

4. Combining all 3 techniques can earn you additional (bonus) 20%

5. 50% of the points is reserved for quality of your presentation and report (25% each); i.e. how you

tried fine-tuning with different parameters and combined several different ways to reach your

result.

#include <stdio.h>

#include <stdlib.h>

#include <time.h>

#include <sys/time.h>

#include <string.h>

#include <math.h>

#include <immintrin.h>

#include <x86intrin.h>

#define ALIGN __attribute__ ((aligned (32)))

#define SIZE 1024

double ALIGN a[SIZE * SIZE];

double ALIGN b[SIZE * SIZE];

double ALIGN c[SIZE * SIZE];

double ALIGN c1[SIZE * SIZE];

// native matrix multiplication

void dgemm(int n)

{

int i,j,k;

for(i=0; i<n; i++)

{

for(j=0; j<n; j++)

{

double cij = 0;

for(k=0; k<n; k++)

cij = cij + a[i*n+k] * b[k*n+j];

c1[i*n+j] = cij;

}

/* Implement this function with multiple optimization techniques. */

void optimized_dgemm(int n)

{

// call any of optimization attempt

}

void main(int argc, char** argv)

{

int i, j;

time_t t;

struct timeval start, end;

double elapsed_time;

Sample Code for DGEMM

int check_correctness = 0;

int correct = 1;

if(argc > 1)

{

if(strcmp(argv[1], "corr") == 0)

{

check_correctness = 1;

}

/* Initialize random number generator */

srand((unsigned) time(&t));

/* Populate the arrays with random values */

for(i=0; i< SIZE; i++)

{

for(j=0; j< SIZE; j++)

{

a[i* SIZE +j] = (double)rand() / (RAND_MAX + 1.0);

b[i* SIZE +j] = (double)rand() / (RAND_MAX + 1.0);

c[i* SIZE +j] = 0.0;

c1[i* SIZE +j] = 0.0;

}

gettimeofday(&start, NULL);

/* Call you optimized function optimized_dgemm */

optimized_dgemm(SIZE);

gettimeofday(&end, NULL);

/* For TA use only */

if(check_correctness)

{

dgemm(SIZE);

for(i=0; (i< SIZE) && correct ; i++)

{

for(j=0; (j< SIZE) && correct; j++)

{

if(fabs(c[i* SIZE +j]-c1[i* SIZE +j]) >= 0.0000001)

{

printf("%f != %f\n", c[i* SIZE +j], c1[i* SIZE +j]);

correct = 0;

}

if(correct)

printf("Result is correct!\n");

else

printf("Result is incorrect!\n");

}

elapsed_time = (end.tv_sec - start.tv_sec) * 1000.0;

elapsed_time += (end.tv_usec - start.tv_usec) / 1000.0;

printf("dgemm finished in %f milliseconds.\n", elapsed_time);

}

void dgemm_unrolling(int n)

{

int i,j,k;

for(i=0; i<n; i++)

{

for(j=0; j<n; j++)

{

double cij = 0;

for(k=0; k<n; k+=4)

{

double s1 = a[i*n+k] * b[k*n+j];

double s2 = a[i*n+(k+1)] * b[(k+1)*n+j];

double s3 = a[i*n+(k+2)] * b[(k+2)*n+j];

double s4 = a[i*n+(k+3)] * b[(k+3)*n+j];

cij += s1 + s2 + s3 + s4;

}

c[i*n+j] = cij;

}

#define BLOCK_SIZE 4

void do_block(int n, int si, int sj, int sk, double *a, double *b, double *c)

{

Loop Unrolling Mechanism Example

Cache Blocking Mechanism Example

int i, j, k;

for (i=si; i<si+BLOCK_SIZE; i++)

for (j=sj; j<sj+BLOCK_SIZE; j++) {

double cij = c[i*n+j];

for (k=sk; k<sk+BLOCK_SIZE; k++)

cij += a[i*n+k] * b[k*n+j];

c[i*n+j] = cij;

}

void dgemm_blocking(int n)

{

int i, j, k;

for(i=0; i<n; i+=BLOCK_SIZE)

for(j=0; j<n; j+=BLOCK_SIZE) {

c[i*n+j] = 0;

for(k=0; k<n; k+=BLOCK_SIZE)

do_block(n, i, j, k, a, b, c);

}

void dgemm_intrin(int n)

{

int i,j,k;

for(i=0; i<n; i+=1)

{

for(j=0; j<n; j+=4)

{

__m256d c4 = _mm256_load_pd(&c[i * n+j]);

for(k=0; k<n; k++)

{

__m256d a4 = _mm256_broadcast_sd(&a[i*n+k]);

__m256d b4 = _mm256_load_pd(&b[k*n+j]);

c4 = _mm256_add_pd(c4, _mm256_mul_pd(a4, b4));

}

_mm256_store_pd(&c[i*n+j], c4);

}

SIMD Mechanism Example

After implementing the 3 things that we were given I tried many other things.

Things that sped up the run time but weren’t mentioned because miniscule boosts:

Ordering of variables of sequential operations

Manual calculations of repeatedly used arithmetic to relieve the CPU of arithmetic

Loop Tiling + Loop Jamming to reduce branch checks and thus less stalls

Using load intrinsic directly within function since it has no latency or throughput

Deleting declarations of objects that we no longer needed

Perform all load first before executing operations (less hazards, less stalls)

Things that I tried but didn’t decrease run time so I tossed it:

Using block sizes greater than xx

Unrolling more than optimal amount to see results

Using SSSE3 (_mm_stream_pd) function to initialize array c with 0’s

Using memcmp or memset from SSE4.1 and SSE4.2 to zero out array

Manual calculation of some repetitive arithmetic actually slowed it down

Using shift logical left by 10 instead of multiplying by 1024 slows it down

Removing the pointer parameters (unstable and spikes but best results 104ms)

Initializing the global array c = { 0 } so it gets zeroed from the beginning

Things that I want to try but can’t because they’re not released yet:

AVX512 intrinsic library since it has functions that can perform twice as fast as it is now

Blocking Loop Unrolling +

Blocking

Intrisic +

Blocking +

Unrolling

3 + SLL multiplyy 3 + FMA

instrinsic

+ sequential

locality

+ temporal &

spatial locality

+ unroll

do_block

+ loop

interchange &

optimization

Average Runtime for 1024 x 1024 (ms)

Example Report from the past semester

【返回顶部】【打印本稿】【关闭本页】

【上一篇】：代写GLM留学生、C/C++程序语言调试、代写C/C++设计、Program代做

【下一篇】：代写GLM留学生、C/C++程序语言调试、代写C/C++设计、Program代做

联系方式

最新辅导

热门辅导

您当前位置：首页 >> C/C++编程C/C++编程

代写matrix multiplication、代做c++代写、c++编程调试、代写C program

日期：2019-04-29 09:20

相关文章