Final Assignment (Assignment 4)
Objective:
We consider a special case of matrix multiplication:
C := C + A*B
where A, B, and C are n x n matrices. This can be performed using 2n3 floating point operations (n3 adds,
n3 multiplies), as in the following pseudocode:
for i = 1 to n
for j = 1 to n
for k = 1 to n
C(i,j) = C(i,j) + A(i,k) * B(k,j)
In this assignment, we will design and implement a C program that performs computations on large
matrices. The size of a matrix is large enough so that the execution of program causes paging. Read p.
225 to p. 228, and p. 413 to p. 417 of Patterson & Hennessy textbook.
Purpose:
Different choices of copying the matrix may have different impacts on the program runtime. You are
required to notice such impacts and eventually propose a design that efficiently leverages the
mechanisms described below to achieve the best performance.
Instructions:
You will be paired up in teams of four. Sign up for your team at the blackboard Wiki by April 12
o If you don’t find a team, you will be randomly assigned by the instructor
After teams decided, you need to sign up for presentation via Wiki of the blackboard
o There are 4 groups of max. 6 teams
o Each team will sign up for 10 min. time slot
o Each team must arrive before their group begins and stay until the last team of the
group finishes
Each team must submit the followings by 12:00 pm April 30
o Presentation slide to the blackboard assignment link
o Report (Write-up) to the turnitin link
The names of the people in your team
The optimization used or attempted
The reason for any odd behavior (if any) in performance
You need to describe on what mechanisms are used in your implementation and
how much performance improvement is achieved. Please write it down in terms
of runtime.
How the performance changed when running your optimized code on a
different machine (You can run the same program on each member’s machine –
different hardware spec)
o Source code to the blackboard assignment link
Requirements:
1. Complete the given C program and use multiple optimization mechanisms to improve the
execution runtime.
2. You are required to use all the mechanisms discussed in the class and recitation
3. You are only allowed to use standard libraries and intrinsic library to implement your program
4. You are not allowed to create new threads in your implementation.
5. When using gcc to compile your code, you are allowed to use optimization level3(-O3).
Mechanisms:
1. Caching: You are required to try different cache block size in your code and use the block size
that gives you minimum runtime when integrated with other techniques (SIMD and superscalar
mechanism)
2. SIMD: You are required to make use of Single Instruction Multiple Data (SIMD). It means
performing a single operation on multiple data points simultaneously.
3. Superscalar mechanisms: A superscalar processor executes more than one instruction during a
clock cycle by simultaneously dispatching multiple instructions to different components on the
processor.
Experiments:
Per attempted optimization, you can have different size of matrix (for example, n = 128, 256,
512, 1024, 2048), different block sizes (m = 2^x), different number of unrolled instructions, etc.
You need to measure the runtime of each experiment.
You need to verify the correctness of the attempted optimization.
Reference:
Matrix Multiplication
Intel Intrinsic Guide
GCC documentation
Streaming SIMD Extensions
Grading Rubrics:
1. If your code does not run, you will not receive any credits
2. You are required to combine at least two technique of optimization, otherwise will not receive
any credits
3. Combining two methods with proper experiments can enable you to have at most 50%
4. Combining all 3 techniques can earn you additional (bonus) 20%
5. 50% of the points is reserved for quality of your presentation and report (25% each); i.e. how you
tried fine-tuning with different parameters and combined several different ways to reach your
result.
#include <stdio.h>
#include <stdlib.h>
#include <time.h>
#include <sys/time.h>
#include <string.h>
#include <math.h>
#include <immintrin.h>
#include <x86intrin.h>
#define ALIGN __attribute__ ((aligned (32)))
#define SIZE 1024
double ALIGN a[SIZE * SIZE];
double ALIGN b[SIZE * SIZE];
double ALIGN c[SIZE * SIZE];
double ALIGN c1[SIZE * SIZE];
// native matrix multiplication
void dgemm(int n)
{
int i,j,k;
for(i=0; i<n; i++)
{
for(j=0; j<n; j++)
{
double cij = 0;
for(k=0; k<n; k++)
cij = cij + a[i*n+k] * b[k*n+j];
c1[i*n+j] = cij;
}
}
}
/* Implement this function with multiple optimization techniques. */
void optimized_dgemm(int n)
{
// call any of optimization attempt
}
void main(int argc, char** argv)
{
int i, j;
time_t t;
struct timeval start, end;
double elapsed_time;
Sample Code for DGEMM
int check_correctness = 0;
int correct = 1;
if(argc > 1)
{
if(strcmp(argv[1], "corr") == 0)
{
check_correctness = 1;
}
}
/* Initialize random number generator */
srand((unsigned) time(&t));
/* Populate the arrays with random values */
for(i=0; i< SIZE; i++)
{
for(j=0; j< SIZE; j++)
{
a[i* SIZE +j] = (double)rand() / (RAND_MAX + 1.0);
b[i* SIZE +j] = (double)rand() / (RAND_MAX + 1.0);
c[i* SIZE +j] = 0.0;
c1[i* SIZE +j] = 0.0;
}
}
gettimeofday(&start, NULL);
/* Call you optimized function optimized_dgemm */
optimized_dgemm(SIZE);
gettimeofday(&end, NULL);
/* For TA use only */
if(check_correctness)
{
dgemm(SIZE);
for(i=0; (i< SIZE) && correct ; i++)
{
for(j=0; (j< SIZE) && correct; j++)
{
if(fabs(c[i* SIZE +j]-c1[i* SIZE +j]) >= 0.0000001)
{
printf("%f != %f\n", c[i* SIZE +j], c1[i* SIZE +j]);
correct = 0;
}
}
}
if(correct)
printf("Result is correct!\n");
else
printf("Result is incorrect!\n");
}
elapsed_time = (end.tv_sec - start.tv_sec) * 1000.0;
elapsed_time += (end.tv_usec - start.tv_usec) / 1000.0;
printf("dgemm finished in %f milliseconds.\n", elapsed_time);
}
void dgemm_unrolling(int n)
{
int i,j,k;
for(i=0; i<n; i++)
{
for(j=0; j<n; j++)
{
double cij = 0;
for(k=0; k<n; k+=4)
{
double s1 = a[i*n+k] * b[k*n+j];
double s2 = a[i*n+(k+1)] * b[(k+1)*n+j];
double s3 = a[i*n+(k+2)] * b[(k+2)*n+j];
double s4 = a[i*n+(k+3)] * b[(k+3)*n+j];
cij += s1 + s2 + s3 + s4;
}
c[i*n+j] = cij;
}
}
}
#define BLOCK_SIZE 4
void do_block(int n, int si, int sj, int sk, double *a, double *b, double *c)
{
Loop Unrolling Mechanism Example
Cache Blocking Mechanism Example
int i, j, k;
for (i=si; i<si+BLOCK_SIZE; i++)
for (j=sj; j<sj+BLOCK_SIZE; j++) {
double cij = c[i*n+j];
for (k=sk; k<sk+BLOCK_SIZE; k++)
cij += a[i*n+k] * b[k*n+j];
c[i*n+j] = cij;
}
}
void dgemm_blocking(int n)
{
int i, j, k;
for(i=0; i<n; i+=BLOCK_SIZE)
for(j=0; j<n; j+=BLOCK_SIZE) {
c[i*n+j] = 0;
for(k=0; k<n; k+=BLOCK_SIZE)
do_block(n, i, j, k, a, b, c);
}
}
void dgemm_intrin(int n)
{
int i,j,k;
for(i=0; i<n; i+=1)
{
for(j=0; j<n; j+=4)
{
__m256d c4 = _mm256_load_pd(&c[i * n+j]);
for(k=0; k<n; k++)
{
__m256d a4 = _mm256_broadcast_sd(&a[i*n+k]);
__m256d b4 = _mm256_load_pd(&b[k*n+j]);
c4 = _mm256_add_pd(c4, _mm256_mul_pd(a4, b4));
}
_mm256_store_pd(&c[i*n+j], c4);
}
}
}
SIMD Mechanism Example
After implementing the 3 things that we were given I tried many other things.
Things that sped up the run time but weren’t mentioned because miniscule boosts:
Ordering of variables of sequential operations
Manual calculations of repeatedly used arithmetic to relieve the CPU of arithmetic
Loop Tiling + Loop Jamming to reduce branch checks and thus less stalls
Using load intrinsic directly within function since it has no latency or throughput
Deleting declarations of objects that we no longer needed
Perform all load first before executing operations (less hazards, less stalls)
Things that I tried but didn’t decrease run time so I tossed it:
Using block sizes greater than xx
Unrolling more than optimal amount to see results
Using SSSE3 (_mm_stream_pd) function to initialize array c with 0’s
Using memcmp or memset from SSE4.1 and SSE4.2 to zero out array
Manual calculation of some repetitive arithmetic actually slowed it down
Using shift logical left by 10 instead of multiplying by 1024 slows it down
Removing the pointer parameters (unstable and spikes but best results 104ms)
Initializing the global array c = { 0 } so it gets zeroed from the beginning
Things that I want to try but can’t because they’re not released yet:
AVX512 intrinsic library since it has functions that can perform twice as fast as it is now
Blocking Loop Unrolling +
Blocking
Intrisic +
Blocking +
Unrolling
3 + SLL multiplyy 3 + FMA
instrinsic
+ sequential
locality
+ temporal &
spatial locality
+ unroll
do_block
+ loop
interchange &
optimization
Average Runtime for 1024 x 1024 (ms)
Average Runtime for 1024 x 1024 (ms)
Example Report from the past semester
版权所有:留学生编程辅导网 2020 All Rights Reserved 联系方式:QQ:99515681 微信:codinghelp 电子信箱:99515681@qq.com
免责声明:本站部分内容从网络整理而来,只供参考!如有版权问题可联系本站删除。