联系方式

  • QQ:99515681
  • 邮箱:99515681@qq.com
  • 工作时间:8:00-23:00
  • 微信:codinghelp

您当前位置:首页 >> C/C++编程C/C++编程

日期:2024-10-08 08:22

Assignment 4

Due Friday by 11:59pm

Points 70

Submitting a file upload

Available Oct 4 at 12am - Dec 24 at 11:59pm

Start Assignment

Assignment 4 (70 Points)

Due Friday October 11 @ 11:59PM

In this assignment, we will improve the parallel solutions of PageRank from Assignment 3. You will

implement different task decomposition and mapping strategies, and observe their effect. You can use

the same serial versions of the pull-based and push-based PageRank from Assignment 3 to check for

the correctness of your results. You should use your parallel solutions " page_rank_pull_parallel.cpp "

and " page_rank_push_parallel_atomic.cpp " as the starting point of this assignment. You do not need to

download a separate tarball for this assignment since all you need was provided in Assignment 3.

You should have completed the Slurm Tutorial (https://canvas.sfu.ca/courses/84236/pages/slurm-tutorial)

which walks you through how to use our servers for your code testing. If you're still have trouble with

Slurm, please post questions on the canvas discussion board. It is strongly recommended to use Slurm

for this assignment instead of CSIL machines, since the graphs you need for checking your programs

are on the /scratch/ drive on the cluster's compute nodes.

General Instructions

1. You are provided with the serial versions and some other tools in the same tarball from Assignment

3 here (https://canvas.sfu.ca/courses/84236/files/24448335?wrap=1)

(https://canvas.sfu.ca/courses/84236/files/24448335/download?download_frd=1) . There is a much

larger file here (https://canvas.sfu.ca/courses/84236/files/24448340?wrap=1)

(https://canvas.sfu.ca/courses/84236/files/24448340/download?download_frd=1) that includes two input

graphs as well as the serial versions. To run a program (e.g., page_rank_pull_parallel.cpp ), follow

the steps below:

Run make page_rank_pull_parallel . This creates a binary file called page_rank_pull _parallel.

Create a slurm job to run the binary file using the following command: ./page_rank_pull --

nThreads 4 --nIterations 10 --inputFile absolute_path_of_input_graph --strategy 1

Detailed descriptions for the command-line arguments is provided below.

2. Please read general instructions from Assignment 3

(https://canvas.sfu.ca/courses/84236/assignments/1006850) since they also apply to this assignment.

AM Assignment 4

1/8

3. You will be asked to print the time spent by different threads on specific code regions. The time spent

by any code region can be computed as follows:

timer t1;

t1.start();

/* ---- Code region whose time is to be measured --- */

double time_taken = t1.stop();

4. If you need to time a sub-section inside a loop, you can do that as follows:

double time_taken = 0.0;

timer t1;

while(True){

/* ---- Code region whose time should not be measured --- */

t1.start();

/* ---- Code region whose time is to be measured --- */

time_taken += t1.stop();

/* ---- Code region whose time should not be measured --- */

}

std::cout << "Time spent on required code region : " << time_taken << "\n";

5. The programs operate on graph datasets. Sample input graphs are available

at /scratch/input_graphs/ on the compute nodes (note they are present on the compute nodes only,

and hence you can access them via slurm only).

$ ls /scratch/input_graphs/*.cs*

rmat.csc ? ? ? ? test_25M_50M.csc ? rmat.csr ? ? ? ? test_25M_50M.csr ?lj.csc ? ? ? ? ?

? ? ? roadNet-CA.csc ? ? ?

web-Google.csc ?lj.csr ? ? ? ? ? ? ? ? roadNet-CA.csr ?? ? ?web-Google.csr

Please read Assignment 3's general instructions if you want to generate more tests graphs.

1. Add a "strategy" Command Line Parameter

For your parallel programs " page_rank_pull_parallel.cpp " and " page_rank_push_parallel_atomic.cpp "

from Assignment 3, you need to add a command line parameter "strategy" which is used to define the

program's task decomposition and mapping strategy. If the parameter is set to 1, i.e., --strategy 1 ,

your parallel program should run the same exact way as specified in Assignment 3. "strategy" values

correspond to the strategies described in the following sections.

Strategy 1:

Vertex-based decomposition and static mapping, the same strategy as your Assignment 3 programs.

Vertices are statically distributed among threads such that each thread performs all computations on n/T

vertices.

2. Edge-based Decomposition for PageRank

Strategy 2: To achieve more load balance among threads, vertices are distributed among threads such

that each thread performs computations on m/T edges. Pull-based PageRank distributes vertices based

AM Assignment 4

2/8

on their in-degrees, while Push-based PageRank distributes vertices based on their out-degrees. Both of

your parallel programs need to support this strategy by setting the strategy parameter to 2 in the

command line ( --strategy 2 ). For example, the pull-based algorithm's pseudo-code is:

Create T threads

Distribute work between threads so that each thread gets to compute on approximately m/T edges

for(i=0; i<max_iterations; i++) {

for each thread in parallel {

for each vertex 'v' allocated to the thread {

for vertex 'u' in inNeighbor(v)

next_page_rank[v] += (current_page_rank[u]/outdegree[u])

}

}

for each thread in parallel {

for each vertex 'v' allocated to the thread {

compute the new_pagerank using the accumulated values in next_page_rank[v].

current_page_rank[v] = new_pagerank

Reset next_page_rank[v] to 0

}

}

}

As an example for how edge-based task decomposition works, a graph has the following vertex

distribution <vertex, in-degree, out-degree>:

v1, 8, 1

v2, 5, 6

v3, 8, 3

v4, 2, 8

v5, 3, 5

v6, 1, 4

v7, 5, 6

v8, 4, 3

The graph has 36 edges. To distribute the work evenly among 4 threads, each thread needs to compute

approximately 36/4=9 edges. Each thread gets assigned vertices until the total assigned edges is >=

(thread_id+1) * m/T. Examining the in-degrees, the pull-based algorithm has the following vertex

distribution:

Thread 0: v1, v2 (13 edges >= 1x9); Thread 1: v3 (8 edges so 13+8 >= 2x9); Thread 2: v4, v5, v6 (6

edges, so 13+8+6 >= 3x9); Thread 3: v7, v8 (9 edges)

For the push-based algorithm, we can use the following vertex distribution based on out-degrees:

Thread 0: v1, v2, v3 (10 edges >= 1x9); Thread 1: v4 (8 edges so 10+8 >= 2x9); Thread 2: v5, v6 (9

edges so 10+8+9 >= 27); Thread 3: v7, v8 (9 edges)

Note that the pull-based algorithm in the above example should have a similar runtime as the vertex?based decomposition since the longest thread needs to compute 13 edges in both cases. However, the

push-based algorithm has a shorter runtime since the longest thread computes 10 edges while the

vertex-based decomposition requires 11 edge computations for the longest thread.

AM Assignment 4

3/8

3. Vertex-based Decomposition for PageRank with Dynamic

Mapping

Strategy 3: To achieve even more load balance among threads, vertices are mapped to threads

dynamically based on the time at which different threads finish their tasks. Instead of allocating

approximately equal numbers of vertices (or edges) to threads, we can dynamically allocate work to

each thread whenever it is free. In this strategy, each thread dynamically gets the next vertex to be

computed until all the vertices are processed. Both of your parallel programs need to support this

strategy by setting the strategy parameter to 3 in the command line ( --strategy 3 ). Below is the

pseudo-code showing dynamic task mapping for the push-based Page Rank algorithm with vertex-based

decomposition:

Create T threads

for each thread in parallel {

for(i=0; i<max_iterations; i++) {

while(true){

u = getNextVertexToBeProcessed();

if(u == -1) break;

? ? ? ? ? ? ? ?edges_processed += outDegree(u) // used in output validation

for vertex v in outNeighbor(u)

next_page_rank[v] += (current_page_rank[u]/outdegree[u])

}

barrier1

while(true){

v = getNextVertexToBeProcessed();

if(v == -1) break;

vertices_processed += 1 // used in output validation

compute the new_pagerank using the accumulated values in next_page_rank[v].

current_page_rank[v] = new_pagerank

Reset next_page_rank[v] to 0

}

barrier2

}

}

You need to implement the above dynamic mapping strategy in your solutions for

both " page_rank_pull_parallel.cpp " and " page_rank_push_parallel_atomic.cpp "

4. Vertex-based Decomposition for PageRank with Coarse-Grained

Dynamic Mapping

Strategy 4: To reduce the time spent by each thread on the getNextVertexToBeProcessed() , we will vary

the task granularity so that each thread receives multiple vertices to be processed each time it

calls getNextVertexToBeProcessed() . Both of your parallel programs need to support this strategy by

setting the strategy parameter to 4 in the command line ( --strategy 4 ). You need to update the dynamic

load distribution logic as follows:

Each thread processes k vertices and then calls getNextVertexToBeProcessed() .

Here, k determines the granularity of the work done by each thread before requesting new work.

AM Assignment 4

4/8

For example,

If k = 1 , the thread calls getNextVertexToBeProcessed() after processing each vertex (exactly the

same way as strategy 3)

If k = 1000 , the thread calls getNextVertexToBeProcessed() after processing 1000 vertices.

The getNextVertexToBeProcessed() function should return 0 , k , 2k , ... depending on the

granularity k .

k should be provided at run time using command-line parameter. e.g: --granularity 100

Below is the pseudo-code showing the logic of our parallel solution:

k = 1000 // granularity

Create T threads

for each thread in parallel {

for(i=0; i<max_iterations; i++) {

while(true){

u = getNextVertexToBeProcessed()

if(u == -1) break;

for (j = 0; j < k; j++) {

edges_processed += outDegree(u) // used in output validation

for vertex v in outNeighbor(u)

next_page_rank[v] += (current_page_rank[u]/outdegree[u]

u++

if(u >= n) break; // n is the total number of vertices in the graph

}

}

barrier1

while(true){

v = getNextVertexToBeProcessed()

if(v == -1) break;

for (j = 0; j < k; j++) {

vertices_processed += 1 // used in output validation

compute the new_pagerank using the accumulated values in next_page_rank[v].

current_page_rank[v] = new_pagerank

Reset next_page_rank[v] to 0

v++

if(v >= n) break; // n is the total number of vertices in the graph

}

}

barrier2

}

}

This strategy should be used with command-line parameter --granularity to specify the granularity. You

need to add support for --granularity to both of your parallel programs " page_rank_pull_parallel.cpp "

and " page_rank_push_parallel_atomic.cpp "

Input and Output Formats:

1. Your program files should be named

page_rank_pull_parallel.cpp and page_rank_push_parallel_atomic.cpp and should support the

following command-line parameters:

--nThreads : The number of threads.

--nIterations : The number of iterations (similar to the serial code). Default is 10.

AM Assignment 4

5/8

--inputFile : The absolute path to the input graph file (similar to the serial code).

--strategy : A value of either 1, 2, 3, or 4 corresponding to the strategies defined above. Any

other value should lead to program termination. The default strategy is 1.

--granularity : A positive integer to be used only in strategy 4. This parameter is ignored for

other strategies. You need to check that granularity is always a positive integer. Zero, negative,

and non-integer values lead to program termination. The default granularity is 1.

2. Your parallel solution must output the following information:

Total number of threads used.

Number of iterations.

Strategy used

Granularity used

For each thread:

Thread id (your threads should be numbered between [0, T) )

Number of vertices processed (across all iterations) - This is the number of vertices processed

only in the second for loop across all iterations. Refer the pseudocode of pagerank above.

Number of edges processed (across all iterations) - This is the number of edges processed

only in the first for loop across all iterations. Refer the pseudocode of pagerank above.

Cumulative time spent waiting at barrier1 (in seconds)

Cumulative time spent waiting at barrier2 (in seconds)

Cumulative time spent waiting at getNextVertexToBeProcessed() (in seconds).

Time taken by the thread (in seconds).

The sum of pageranks of all vertices.

The total time taken for the entire execution, including the time used in allocating vertices or

edges to threads.

3. The sample console output is below. Please note that you should use this same format exactly for all

4 parts. If an output is not generated (e.g., getNextVertex time in strategy 1 and 2, print 0.0 instead).

Using DOUBLE

Number of Threads : 4

Strategy : 4

Granularity : 1

Iterations : 20

Reading graph

Created graph

thread_id, num_vertices, num_edges, barrier1_time, barrier2_time, getNextVertex_time, total_time

0, 4576646, 26069763, 0.001790, 0.001063, 4.067461, 10.366047

1, 4625648, 23284082, 0.001947, 0.001063, 4.107634, 10.366017

2, 4617670, 26393229, 0.001654, 0.001049, 4.029203, 10.365975

3, 4508596, 26353706, 0.001668, 0.001059, 4.101155, 10.365950

Sum of page ranks : 618961.562500

Time taken (in seconds) : 10.366616

Testing

We are not providing a testing script for this assignment. However, you should strictly adhere to the

format above, including all the spaces and commas. You can test your program using similar command

AM Assignment 4

6/8

lines to the test cases of Assignment 3 while adding the strategy parameter. We will not disclose the test

cases used for grading before the assignment due date.

Assignment Report

In addition to your parallel code, you need to submit a report (in pdf format) that answers the following

two questions:

Q1. Run each of your two parallel programs with strategy 4 above using 8 threads on the test_25M_50M

data set with granularities of 10, 100, 1000, 2000. Each of your parallel programs should run 3 times,

each including 20 iterations on the slow cluster using the default floating point data type. {Total number

of runs is 2 (parallel versions) x 4 (different granularities) x 3 (number of runs for each version/thread

combination) = 24 runs]

Plot a graph with average execution time on the y-axis, thread count on the x-axis where each

granularity has 2 bars, one for the average runtime of each of your parallel versions. Your graph should

be something like this (obviously with different values):

Q2. Based on data from the graph in Q1, which of the four granularities is better for pull? And which is

better for the push_atomic algorithm?

Submission Guidelines

Make sure that your solutions folder has the following files and sub-folders. Let's say your solutions

folder is called my_assignment4_solutions . It should contain:

core/ -- The folder containing all core files. It is already available in the assignment package. Do

not modify it or remove any files.

Makefile -- Makefile for the project. This file should not be changed.

page_rank_pull_parallel.cpp

AM Assignment 4

7/8

page_rank_push_parallel_atomic.cpp

report.pdf -- A pdf file that includes answers to questions in the previous section.?

To create the submission file, follow the steps below:

1. Enter in your solutions folder, and remove all the object/temporary files.

$ cd my_assignment4_solutions/

$ make clean

$ rm -rf input_graphs OR mv input_graphs ../ to avoid large tarballs

2. Create the tar.gz file.

$ tar cvzf assignment4.tar.gz *

which creates a compressed tar ball that contains the contents of the folder.

Submit via Canvas by the deadline.

AM Assignment 4

8/8


相关文章

版权所有:留学生编程辅导网 2020 All Rights Reserved 联系方式:QQ:99515681 微信:codinghelp 电子信箱:99515681@qq.com
免责声明:本站部分内容从网络整理而来,只供参考!如有版权问题可联系本站删除。 站长地图

python代写
微信客服:codinghelp