Summer School

#### s-Science with Many-core CPU/GPU Processors

## Lecture 3: Part 1: CUDA Threads

#### Block IDs and Thread IDs



## Matrix Multiplication Using **Multiple Blocks**

- Break-up Pd into tiles ٠
- Each block calculates one • tile
  - Each thread calculates one element
  - Block size equal tile size



bx

## A Small Example



## A Small Example: Multiplication



## Revised Matrix Multiplication Kernel using Multiple Blocks

```
_global___ void MatrixMulKernel(float* Md, float* Nd, float* Pd, int Width)
```

```
// Calculate the row index of the Pd element and M
int Row = blockIdx.y*TILE_WIDTH + threadIdx.y;
// Calculate the column idenx of Pd and N
int Col = blockIdx.x*TILE WIDTH + threadIdx.x;
```

```
float Pvalue = 0;
// each thread computes one element of the block sub-matrix
for (int k = 0; k < Width; ++k)
    Pvalue += Md[Row*Width+k] * Nd[k*Width+Col];
```

```
Pd[Row*Width+Col] = Pvalue;
```

© David Kirk/NVIDIA and Wen-mei W. Hwu Urbana, Illinois, August 10-14, 2009 Revised Step 5: Kernel Invocation (Host-side Code)

// Launch the device computation threads!
MatrixMulKernel<<<dimGrid, dimBlock>>>(Md, Nd, Pd, Width);

© David Kirk/NVIDIA and Wen-mei W. Hwu Urbana, Illinois, August 10-14, 2009

## **CUDA Thread Block**

- All threads in a block execute the same kernel program (SPMD)
- Programmer declares block:
  - Block size 1 to **512** concurrent threads
  - Block shape 1D, 2D, or 3D
  - Block dimensions in threads
- Threads have thread id numbers within block
  - Thread program uses thread id to select work and address shared data
- Threads in the same block share data and synchronize while doing their share of the work
- Threads in different blocks cannot cooperate
  - Each block can execute in any order relative to other blocs!

#### CUDA Thread Block



Courtesy: John Nickolls, NVIDIA

## **Transparent Scalability**

- Hardware is free to assigns blocks to any processor at any time
  - A kernel scales across any number of parallel processors



## Example: Executing Thread Blocks

t0 t1 t2 ... tm

Blocks

t0 t1 t2 ... tm

**SM 0 SM 1** 

Threads are assigned to Streaming Multiprocessors in block granularity

- Up to 8 (?) blocks to each SM as resource allows
- Fermi SM can take up to 1536 threads
  - Could be 256 (threads/block) \* 6 blocks
  - Or 512 (threads/block) \* 3 blocks, etc.
- Threads run concurrently

Blocks

- SM maintains thread/block id #s
- SM manages/schedules thread execution

## **Example: Thread Scheduling**

- Each Block is executed as 32thread Warps
  - An implementation decision, not part of the CUDA programming model
  - Warps are scheduling units in SM
- If 3 blocks are assigned to an SM and each block has 256 threads, how many Warps are there in an SM?
  - Each Block is divided into 256/32 = 8 Warps
  - There are 8 \* 3 = 24 Warps





## Example: Thread Scheduling (Cont.)

- SM implements zero-overhead warp scheduling
  - At any time, 1 or 2 of the warps is executed by SM
  - Warps whose next instruction has its operands ready for consumption are eligible for execution
  - Eligible Warps are selected for execution on a prioritized scheduling policy
  - All threads in a warp execute the same instruction when selected



#### **Block Granularity Considerations**

- For Matrix Multiplication using multiple blocks, should I use 8X8, 16X16 or 32X32 blocks?
  - For 8X8, we have 64 threads per Block. Since each SM can take up to 1536 threads, there are 24 Blocks. However, each SM can only take up to 8 Blocks, only 512 threads will go into each SM!
  - For 16X16, we have 256 threads per Block. Since each SM can take up to 1536 threads, it can take up to 6 Blocks and achieve full capacity unless other resource considerations overrule.
  - For 32X32, we have 1024 threads per Block. Only one can fit into an SM! And, some capacity is wasted.

#### Some Additional API Features

## **Application Programming Interface**

- The API is an extension to the C programming language
- It consists of:
  - Language extensions
    - To target portions of the code for execution on the device
  - A runtime library split into:
    - A common component providing built-in vector types and a subset of the C runtime library in both host and device codes
    - A host component to control and access one or more devices from the host
    - A device component providing device-specific functions

## Language Extensions: Built-in Variables

- dim3 gridDim;
  - Dimensions of the grid in blocks (gridDim.z unused)
- dim3 blockDim;
  - Dimensions of the block in threads
- dim3 blockIdx;
  - Block index within the grid
- dim3 threadIdx;
  - Thread index within the block

## Common Runtime Component: Mathematical Functions

- pow, sqrt, cbrt, hypot
- exp, exp2, expm1
- log, log2, log10, log1p
- sin, cos, tan, asin, acos, atan, atan2
- sinh, cosh, tanh, asinh, acosh, atanh
- ceil, floor, trunc, round
- Etc.
  - When executed on the host, a given function uses the C runtime implementation if available
  - These functions are only supported for scalar types, not vector types

## Device Runtime Component: Mathematical Functions

 Some mathematical functions (e.g. sin(x)) have a less accurate, but faster device-only version (e.g. sin(x))

## Host Runtime Component

- Provides functions to deal with:
  - Device management (including multi-device systems)
  - Memory management
  - Error handling
- Initializes the first time a runtime function is called
- A host thread can invoke device code on only one device
  - Multiple host threads required to run on multiple devices

Device Runtime Component: Synchronization Function

- void \_\_syncthreads();
- Synchronizes all threads in a block
- Once all threads have reached this point, execution resumes normally
- Used to avoid RAW / WAR / WAW hazards when accessing shared or global memory
- Allowed in conditional constructs only if the conditional is uniform across the entire thread block

Summer School

#### s-Science with Many-core CPU/GPU Processors

## Lecture 3: Part 2: CUDA Memories

© David Kirk/NVIDIA and Wen-mei W. Hwu Braga, Portugal, June 14-18, 2010

21

#### Implementation of CUDA Memories

#### • Each thread can:

- Read/write per-thread registers
- Read/write per-thread local memory
- Read/write per-block
   shared memory
- Read/write per-grid global memory
- Read/only per-grid constant memory



## **CUDA Variable Type Qualifiers**

| Variable dec   | Memory                      | Scope    | Lifetime |             |
|----------------|-----------------------------|----------|----------|-------------|
|                | <pre>int LocalVar;</pre>    | local    | thread   | thread      |
| deviceshared   | <pre>int SharedVar;</pre>   | shared   | block    | block       |
| device         | <pre>int GlobalVar;</pre>   | global   | grid     | application |
| deviceconstant | <pre>int ConstantVar;</pre> | constant | grid     | application |

- <u>device</u> is optional when used with <u>shared</u>, or <u>constant</u>
- Automatic variables without any qualifier reside in a register
  - Except arrays that reside in local memory
- 23 © David Kirk/NVIDIA and Wen-mei W. Hwu Braga, Portugal, June 14-18, 2010



# A Common Programming Strategy

- Global memory resides in device memory (DRAM)
   much slower access than shared memory
- So, a profitable way of performing computation on the device is to tile data to take advantage of fast shared memory:
  - Partition data into subsets that fit into shared memory
  - Handle each data subset with one thread block by:
    - Loading the subset from global memory to shared memory, using multiple threads to exploit memory-level parallelism
    - Performing the computation on the subset from shared memory; each thread can efficiently multi-pass over any data element
    - Copying results from shared memory to global memory
- 25 © David Kirk/NVIDIA and Wen-mei W. Hwu Braga, Portugal, June 14-18, 2010

## A Common Programming Strategy (Cont.)

- Constant memory also resides in device memory (DRAM) - much slower access than shared memory
  - But... cached!
  - Highly efficient access for read-only data
- Carefully divide data according to access patterns
  - − R/Only  $\rightarrow$  constant memory (very fast if in cache)
  - R/W shared within Block  $\rightarrow$  shared memory (very fast)
  - R/W within each thread  $\rightarrow$  registers (very fast)
  - R/W inputs/results → global memory (very slow)

For texture memory usage, see courses.ece.uiuc.edu/ece498/al.

## **GPU Atomic Integer Operations**

- Atomic operations on integers in global memory:
  - Associative operations on signed/unsigned ints
  - add, sub, min, max, ...
  - and, or, xor
  - Increment, decrement
  - Exchange, compare and swap
- Requires hardware with compute capability 1.1

## Matrix Multiplication using Shared Memory

## Revised Matrix Multiplication Kernel using Multiple Blocks

```
_global___ void MatrixMulKernel(float* Md, float* Nd, float* Pd, int Width)
```

```
// Calculate the row index of the Pd element and M
int Row = blockIdx.y*TILE_WIDTH + threadIdx.y;
// Calculate the column idenx of Pd and N
int Col = blockIdx.x*TILE WIDTH + threadIdx.x;
```

```
Pd[Row][Col] = Pvalue;
```

#### How about performance on G80?

- All threads access global memory for their input matrix elements
  - Two memory accesses (8 bytes) per floating point multiply add
  - 4B/s of memory bandwidth FLOPS
  - 4\*346.5 = 1386 GB/s required to achieve peak FLOP rating
  - 86.4 GB/s limits the code at 21.6 GFLOPS
- The actual code runs at about 15 GFLOPS
- Need to drastically cut down memory accesses to get closer to the peak 346.5 GFLOPS





# Idea: Use Shared Memory to reuse global memory data

- Each input element is read by WIDTH threads.
- Load each element into Shared Memory and have several threads use the local version <sup>M</sup> reduce the memory bandwidth
  - Tiled algorithms



## **Tiled Multiply**

 Break up the execution of the kernel into phases so that the data accesses in each phase is focused on one subset (tile) of Md and Nd





E WIDT

bv

32





#### Every M and N Element is used exactly twice in generating a 2X2 tile of P $\mathsf{P}_{0,0}$ P<sub>1,1</sub> P<sub>1,0</sub> P<sub>0.1</sub> thread<sub>0,0</sub> thread<sub>1.0</sub> thread<sub>0.1</sub> thread<sub>11</sub> $M_{0.0} * N_{0,0}$ M<sub>0.0</sub> \* M<sub>0.1</sub> \* $M_{0.1} * N_{0.0}$ N<sub>0,1</sub> M<sub>1.1</sub> \* N<sub>0,1</sub> Access M<sub>1.1</sub> \* N<sub>1.1</sub> order M<sub>2.0</sub> \* N<sub>0,2</sub> M<sub>2,0</sub> M<sub>2,1</sub> \* N<sub>0,2</sub> M<sub>2,1</sub> \* N<sub>1,2</sub> \* N<sub>1.2</sub> M<sub>3,1</sub> \* N<sub>0,3</sub> M<sub>3,1</sub> \* N<sub>1.3</sub> M<sub>3.0</sub> \* N<sub>0,3</sub> $M_{3,0} * N_{1,3}$

#### Breaking Md and Nd into Tiles



# Each phase uses one tile from Md and one from Nd

|                  | Phase 1                                      |                                                                   |                                                                                                                | Phase 2                                      |                                              |                                                                                                                |  |  |
|------------------|----------------------------------------------|-------------------------------------------------------------------|----------------------------------------------------------------------------------------------------------------|----------------------------------------------|----------------------------------------------|----------------------------------------------------------------------------------------------------------------|--|--|
| T <sub>0,0</sub> | Md <sub>0,0</sub>                            | Nd <sub>0,0</sub>                                                 | PValue <sub>0,0</sub> +=                                                                                       | Md <sub>2,0</sub>                            | Nd <sub>0,2</sub>                            | PValue <sub>0,0</sub> +=                                                                                       |  |  |
|                  | ↓                                            | ↓                                                                 | Mds <sub>0,0</sub> *Nds <sub>0,0</sub> +                                                                       | ↓                                            | ↓                                            | Mds <sub>0,0</sub> *Nds <sub>0,0</sub> +                                                                       |  |  |
|                  | Mds <sub>0,0</sub>                           | Nds <sub>0,0</sub>                                                | Mds <sub>1,0</sub> *Nds <sub>0,1</sub>                                                                         | Mds <sub>0,0</sub>                           | Nds <sub>0,0</sub>                           | Mds <sub>1,0</sub> *Nds <sub>0,1</sub>                                                                         |  |  |
| T <sub>1,0</sub> | Md <sub>1,0</sub><br>↓<br>Mds <sub>1,0</sub> | $\begin{matrix} Nd_{1,0} \\ \downarrow \\ Nds_{1,0} \end{matrix}$ | PValue <sub>1,0</sub> +=<br>Mds <sub>0,0</sub> *Nds <sub>1,0</sub> +<br>Mds <sub>1,0</sub> *Nds <sub>1,1</sub> | Md <sub>3,0</sub><br>↓<br>Mds <sub>1,0</sub> | Nd <sub>1,2</sub><br>↓<br>Nds <sub>1,0</sub> | PValue <sub>1,0</sub> +=<br>Mds <sub>0,0</sub> *Nds <sub>1,0</sub> +<br>Mds <sub>1,0</sub> *Nds <sub>1,1</sub> |  |  |
| T <sub>0,1</sub> | Md <sub>0,1</sub>                            | Nd <sub>0,1</sub>                                                 | PdValue <sub>0,1</sub> +=                                                                                      | Md <sub>2,1</sub>                            | Nd <sub>0,3</sub>                            | PdValue <sub>0,1</sub> +=                                                                                      |  |  |
|                  | ↓                                            | ↓                                                                 | Mds <sub>1,1</sub> *Nds <sub>0,0</sub> +                                                                       | ↓                                            | ↓                                            | Mds <sub>0,1</sub> *Nds <sub>0,0</sub> +                                                                       |  |  |
|                  | Mds <sub>0,1</sub>                           | Nds <sub>0,1</sub>                                                | Mds <sub>1,1</sub> *Nds <sub>0,1</sub>                                                                         | Mds <sub>0,1</sub>                           | Nds <sub>0,1</sub>                           | Mds <sub>1,1</sub> *Nds <sub>0,1</sub>                                                                         |  |  |
| T <sub>1,1</sub> | Md <sub>1,1</sub>                            | Nd <sub>1,1</sub>                                                 | PdValue <sub>111</sub> +=                                                                                      | Md <sub>3,1</sub>                            | Nd <sub>1,3</sub>                            | PdValue <sub>1,1</sub> +=                                                                                      |  |  |
|                  | ↓                                            | ↓                                                                 | Mds <sub>0,1</sub> *Nds <sub>1,0</sub> +                                                                       | ↓                                            | ↓                                            | Mds <sub>0,1</sub> *Nds <sub>1,0</sub> +                                                                       |  |  |
|                  | Mds <sub>1,1</sub>                           | Nds <sub>1,1</sub>                                                | Mds <sub>1,1</sub> *Nds <sub>1,1</sub>                                                                         | Mds <sub>1,1</sub>                           | Nds <sub>1,1</sub>                           | Mds <sub>1,1</sub> *Nds <sub>1,1</sub>                                                                         |  |  |
| time             |                                              |                                                                   |                                                                                                                |                                              |                                              |                                                                                                                |  |  |

#### **First-order Size Considerations**

- Each thread block should have many threads
   TILE\_WIDTH of 16 gives 16\*16 = 256 threads
- There should be many thread blocks
   A 1024\*1024 Pd gives 64\*64 = 4096 Thread Blocks
- Each thread block perform 2\*256 = 512 float loads from global memory for 256 \* (2\*16) = 8,192 mul/add operations.
  - Memory bandwidth no longer a limiting factor

## CUDA Code – Kernel Execution Configuration

```
// Setup the execution configuration
```

dim3 dimBlock (TILE\_WIDTH, TILE\_WIDTH);

```
dim3 dimGrid (Width / TILE_WIDTH,
```

Width / TILE\_WIDTH);

## **Tiled Matrix Multiplication Kernel**

```
global void MatrixMulKernel(float* Md, float* Nd, float* Pd, int Width)
     shared float Mds[TILE WIDTH][TILE WIDTH];
     shared float Nds[TILE WIDTH][TILE WIDTH];
    int bx = blockIdx.x; int by = blockIdx.y;
3.
    int tx = threadIdx.x; int ty = threadIdx.v;
  Identify the row and column of the Pd element to work on
11
  int Row = by * TILE WIDTH + ty;
    int Col = bx * TILE WIDTH + tx;
    float Pvalue = 0;
7.
// Loop over the Md and Nd tiles required to compute the Pd element
    for (int m = 0; m < Width/TILE WIDTH; ++m) {</pre>
8.
// Coolaborative loading of Md and Nd tiles into shared memory
       Mds[tx][ty] = Md[(m*TILE WIDTH + tx)*Width+Row];
9.
10.
       Nds[tx][ty] = Nd[Col*Width+(m*TILE WIDTH + ty)];
11.
      syncthreads();
12.
      for (int k = 0; k < TILE WIDTH; ++k)
13.
    Pvalue += Mds[tx][k] * Nds[k][ty];
        synchthreads();
14.
15.}
16.
      Pd[Row*Width+Col] = Pvalue;
    © David Kirk/NVIDIA and Wen-mei W. Hwu
39
```

Braga, Portugal, June 14-18, 2010

## **Tiled Multiply**

bx 2 012 TILE WIDTH-1 m bx k



 Each thread computes one element of Pd<sub>sub</sub>



#### Shared Memory and Threading

- Each SM in Fermi has 16KB or 48KB shared memory\*
  - SM size is implementation dependent!
  - For TILE\_WIDTH = 16, each thread block uses 2\*256\*4B = 2KB of shared memory.
  - Can potentially have up to 8 Thread Blocks actively executing
    - This allows up to 8\*512 = 4,096 pending loads. (2 per thread, 256 threads per block)
  - The next TILE\_WIDTH 32 would lead to 2\*32\*32\*4B= 8KB shared memory usage per thread block, allowing 2 or 6 thread blocks active at the same time
- Using 16x16 tiling, we reduce the accesses to the global memory by a factor of 16
  - The 86.4B/s bandwidth can now support (86.4/4)\*16 = 347.6
     GFLOPS!

#### \*Configurable vs L1, total 64KB

## Tiling Size Effects (G80)



# Summary- Typical Structure of a CUDA Program

Global variables declaration

– \_\_host\_\_

- \_\_device\_\_... \_global\_\_, \_\_constant\_\_, \_\_texture\_\_
- Function prototypes
  - \_\_global\_\_ void kernelOne(...)
  - float handyFunction(...)
- Main ()
  - allocate memory space on the device cudaMalloc(&d\_GlbIVarPtr, bytes )

repea

neede

as

d

- transfer data from host to device cudaMemCpy(d\_GlblVarPtr, h\_Gl...)
- execution configuration setup
- kernel call kernelOne<<<execution configuration>>>( args... );
- transfer results from device to host cudaMemCpy(h\_GlbIVarPtr,...)
- optional: compare against golden (host computed) solution
- Kernel void kernelOne(type args,...)
  - variables declaration local, \_\_shared\_\_
    - automatic variables transparently assigned to registers or local memory
  - syncthreads()...
- Other functions
  - float handyFunction(int inVar...);
- 43 © David Kirk/NVIDIA and Wen-mei W. Hwu Braga, Portugal, June 14-18, 2010