Data Center / Cloud

Efficient CUDA Debugging: Memory Initialization and Thread Synchronization with NVIDIA Compute Sanitizer

NVIDIA Compute Sanitizer is a powerful tool that can save you time and effort while improving the reliability and performance of your CUDA applications. 

In our previous post, Efficient CUDA Debugging: How to Hunt Bugs with NVIDIA Compute Sanitzer, we explored efficient debugging in the realm of parallel programming. We discussed how debugging code in the CUDA environment can be both challenging and time-consuming, especially when dealing with thousands of threads, and how Compute Sanitizer can help with this process. 

This post continues our exploration of efficient CUDA debugging. It highlights a few more Compute Sanitizer tools and walks through several examples. 

NVIDIA Compute Sanitizer

Compute Sanitizer is a suite of tools that can perform different types of checks on the functional correctness of your code. There are four main tools:

  • memcheck: Memory access error and leak detection
  • racecheck: Shared memory data access hazard detection tool
  • initcheck: Uninitialized device global memory access detection tool
  • synccheck: Thread synchronization hazard detection

In addition to these tools, Compute Sanitizer capabilities include:

This post focuses on debugging code and catching bugs related to uninitialized device arrays using initcheck, and synchronization using synccheck. For more information about using memcheck for discovering memory leaks and racecheck for finding race conditions, see Efficient CUDA Debugging: How to Hunt Bugs with NVIDIA Compute Sanitizer.

Initialization checking

initcheck helps you identify and resolve uninitialized memory access errors in CUDA code. Uninitialized memory access can lead to unpredictable behavior and incorrect results in CUDA applications. 

initcheck can detect uninitialized memory access to global memory in device code. It provides detailed information about the location and timing of the access, as well as the stack trace of the accessing thread. This helps to reveal the root cause of the issue and resolve the problem.

To provide an example, the following code example benefits from initialization checking:

#include <stdio.h>

#define THREADS 32
#define BLOCKS 2

__global__ void addToVector(float *v) {
  int tx = threadIdx.x + blockDim.x * blockIdx.x;
  v[tx] += tx;
}

int main(int argc, char **argv) {
  float *d_vec = NULL;
  float *h_vec = NULL;

  h_vec = (float *)malloc(BLOCKS*THREADS * sizeof(float));
  cudaMalloc((void**)&d_vec, sizeof(float) * BLOCKS * THREADS);
  cudaMemset(d_vec, 0, BLOCKS * THREADS); // Zero the array

  addToVector<<<BLOCKS, THREADS>>>(d_vec);
  cudaMemcpy(h_vec, d_vec, BLOCKS*THREADS * sizeof(float), cudaMemcpyDeviceToHost);
  cudaDeviceSynchronize();
  printf("After : Vector 0, 1 .. N-1: %f %f .. %f\n", h_vec[0], h_vec[1], h_vec[BLOCKS*THREADS-1]);

  cudaFree(d_vec);
  free(h_vec);
  exit(0);
}

This code contains a CUDA kernel called addToVector that performs a simple add of a value to each element in a vector, with the results written back to the same element. At ‌first glance, it looks fine: allocate the vector on the device with cudaMalloc, then zero it with cudaMemset, then perform calculations in the kernel. It even prints out the correct answer:

$ nvcc -lineinfo initcheck_example.cu -o initcheck_example
$ ./initcheck_example
After : Vector 0, 1 .. N-1: 0.000000 1.000000 .. 63.000000

But the code contains a small mistake. (Twenty points if you can spot it.)

Use initcheck to check whether any of the accesses to the vector in global memory on the device are trying to read uninitialized values.

$ compute-sanitizer --tool initcheck ./initcheck_example
========= COMPUTE-SANITIZER
========= Uninitialized __global__ memory read of size 4 bytes
=========     at 0x70 in /home/pgraham/Code/BlogExamples/initcheck_example.cu:8:addToVector(float *)
=========     by thread (16,0,0) in block (0,0,0)

. . .

========= Uninitialized __global__ memory read of size 4 bytes
=========     at 0x70 in /home/pgraham/Code/BlogExamples/initcheck_example.cu:8:addToVector(float *)
=========     by thread (17,0,0) in block (0,0,0)
. . . 
=========
After : Vector 0, 1 .. N-1: 0.000000 1.000000 .. 63.000000
========= ERROR SUMMARY: 48 errors

This should print a lot of information (the output shown has been edited for brevity), but something is not right. A large quantity of the output is backtrace information, which can be hidden using the --show-backtrace no option:

$ compute-sanitizer --tool initcheck --show-backtrace no ./initcheck_example

Looking at the output, you can see 48 errors in total. The report shows that they are all of the type Uninitialized __global__ memory read of size 4 bytes.

Each message refers to an attempt to read something from global device memory, and that something had a size of 4 bytes. A reasonable guess would be that the errors refer to attempts to access elements of the vector, made up of floats that are 4 bytes each.

Looking at the first error, the next part of the message indicates which thread, and which thread block, caused the error. In this case, it was thread 16 in block 0. As the kernel is set up so that each thread accesses a different element of the vector, element 17 of the vector, d_vec[16], was uninitialized.

In your output, you may see a different thread as the first one causing an error. The GPU can schedule warps (groups of 32 threads) in whatever order it sees fit. But check through the rest of the output, and convince yourself that the lowest element in the vector causing an error was element 17 (thread 16 from block 0).

Next, look at the line of code that initialized (or should have initialized) the array:

cudaMemset(d_vec, 0, BLOCKS * THREADS); // Zero the array

Checking the definition of cudaMemset, it takes three arguments:

  • The pointer to the device memory you want to set (d_vec in this case)
  • The value to which each byte in that memory region should be set (0 in this case)
  • The number of bytes to set (BLOCKS * THREADS)

Now the problem begins to become more clear. The vector contains 64 elements determined by BLOCKS * THREADS, but each element is a float, so the entire vector is 256 bytes long. cudaMemset was initializing only the first 64 bytes (the first 16 elements), which means the remaining 192 bytes (equivalent to 48 elements) are uninitialized. These 48 elements correspond to the 48 errors.

This ties in with the observation that element 17 (thread 16, block 0) was the first to cause an error. Bingo, problem found. 

To fix the problem, change the cudaMemset call: 

cudaMemset(d_vec, 0, sizeof(float) * BLOCKS * THREADS);

And check to make sure that the sanitizer is happy.

Checking unused memory

Another feature of the initcheck tool is identifying allocated device memory that hasn’t been accessed by the end of the application. In some programs, this may be deliberate—using a large static buffer to handle a range of potential problem sizes, for example. But when this is more likely an error causing a bug, use initcheck, as shown in the following code example: 

#include <stdio.h>

#define N 10

__global__ void initArray(float* array, float value) {
  int threadGlobalID = threadIdx.x + blockIdx.x * blockDim.x;
  if (threadGlobalID < N)
    array[threadGlobalID] = value;
  return;
}

int main() {
  float* array;

  const int numThreadsPerBlock = 4;
  const int numBlocks = 2;

  cudaMalloc((void**)&array, sizeof(float) * N);

  initArray<<<numBlocks, numThreadsPerBlock>>>(array, 3.0);
  cudaDeviceSynchronize();

  cudaFree(array);
  exit(0);
}

This basic code example will reveal the potential error. It is initializing an array, but the number of threads and the number of blocks are hard-coded. The execution configuration <<< … >>> launches a grid of eight threads while the dataset has 10 elements. The last two elements go unused.

Check this using the track-unused-memory option. The required syntax depends on the CUDA version in use. For versions before 12.3, supply an argument yes using the following code example:

--track-unused-memory yes ;

Beginning with version 12.3, you don’t have to supply an argument:

$ nvcc -o unused -lineinfo unused.cu
$ compute-sanitizer --tool initcheck --track-unused-memory ./unused
========= COMPUTE-SANITIZER
=========  Unused memory in allocation 0x7fe0a7200000 of size 40 bytes
=========     Not written 8 bytes at offset 0x20 (0x7fe0a7200020)
=========     20% of allocation were unused.
=========
========= ERROR SUMMARY: 1 error

Clearly, track-unused-memory indicates that the array of 40 bytes (10 x 4 byte floats) includes 8 bytes that were not written to. Use the array address (the first long 0x… number) and the offset (0 x 20, which is 32 in decimal, so 32 bytes or 8 floats along) to see which bytes were unused. As expected, floats 9 and 10 in the array were not used.

To fix this, use N to define numBlocks:

const int numBlocks = (N + numThreadsPerBlock - 1) / numThreadsPerBlock;

The --track-unused-memory option is designed to work for device memory assigned with cudaMalloc. The feature doesn’t work for unified memory (cudaMallocManaged allocated memory, for example).

Synchronization checking

The capability to synchronize threads at a variety of levels beyond just block and warp is a powerful CUDA feature, enabled by the Cooperative Groups programming model. Cooperative Groups has device code APIs for defining, partitioning, and synchronizing groups of threads, giving much more flexibility and control compared to the standard syncthreads function, which synchronizes all the threads in a block. For more information, see Cooperative Groups: Flexible CUDA Thread Programming.

Yet, this capability comes with greater opportunities to introduce bugs. This is where synccheck can help to identify and resolve synchronization errors in CUDA code. synccheck can identify whether a CUDA application is correctly using synchronization primitives and their Cooperative Groups API counterparts.

One interesting use of synchronization is the application of a mask to a warp of threads. Set up the warp so that some threads are true and others are false, enabling each thread to individually perform different operations depending on that property. For more information, see Using CUDA Warp-Level Primitives.

A useful function to help with this is __ballot_sync defined as the following:

unsigned int __ballot_sync(unsigned int mask, int predicate);
  • mask is an initial mask, typically created with all bits set to 1, representing that all threads in the warp are initially active.
  • predicate is a condition evaluated by each thread, where predicate evaluates to either true (non-zero) or false (zero) for each thread.

The ballot function evaluates the predicate for each thread in the warp, and returns a mask representing the outcome for that thread. It also provides a synchronization point. All threads in the warp must reach this __ballot_sync before any of them can proceed further. 

For example, set up a mask where even threads in the warp are true and odd threads are false:

__ballot_sync(0xffffffff, threadID % 2 == 0);

The initial mask 0xffffff is a hexadecimal representation and evaluates to 11111111111111111111111111111111 in binary. This ensures that all 32 threads are involved in the ballot. 

The outcome of the ballot is a mask, 0xaaaaaaaa, which in binary is 10101010101010101010101010101010. The even threads (thread ID 0, 2, 4 …) are set to true, and odd threads are set to false. 

The ballot is often used in conjunction with __syncwarp, which can synchronize threads in a warp based on the mask provided.

The following example uses both _ballot_sync and _syncwarp:

static constexpr int NumThreads = 32 ;

__shared__ int smem[NumThreads];

__global__ void sumValues(int *sum_out) {
    int threadID = threadIdx.x;

    unsigned int mask = __ballot_sync(0xffffffff, threadID < (NumThreads / 2));

    if (threadId <= (NumThreads / 2)) {
        smem[threadId] = threadId;

        __syncwarp(mask);

        if (threadID == 0) {
          *sum_out = 0;
          for (int i = 0; i < (NumThreads / 2); ++i)
            *sum_out += smem[i];
        }
    }

    __syncThreads();
}

int main(){
    int *sum_out = nullptr;

    cudaMallocManaged((void**)&sum_out, sizeof(int));

    sumVaules<<<1, NumThreads>>>(sum_out);
    cudaDeviceSynchronize();
    
    printf("Sum out = %d\n", *sum_out);
    cudaFree(sum_out);
    return 0;
}

Before reading further, take a look at the code and try and work out what it is doing given your understanding of the ballot and syncwarp functionality. See if you can spot what’s wrong. (Fifty points for this one—it’s more challenging.)

The purpose of this code is for each thread to assign a single value to shared memory, and then sum up all the values to get one answer. However, this is applied to only half the available threads. A single warp of 32 threads is set up through the execution configuration <<<1, numThreads>>> to execute the kernel sumValues

In that kernel, create a mask using __ballot_sync with threadID < NumThreads/2 as the predicate, which evaluates to true for the first half of the warp where threadID<16 (threads 0, 1, .. 15).

For those 16 threads, assign a value (threadID) to shared memory, and perform a __syncwarp(mask) synchronization on those threads to ensure that they have all‌ written to shared memory. Then update sum_out for the global sum based on those values.

Next, try compiling and running the following code:

$ nvcc -o ballot_example -lineinfo ballot_example.cu
$ ./ballot_example
Sum out = 0

The answer, zero, is not correct. It should be 120 (15 + 14 + 13 + … + 2 + 1 + 0).

Did you spot the mistake? The conditional section of code was executed using if (threadId <= (NumThreads / 2)). This code uses <= rather than < as the comparator, meaning that the first 17 threads execute. 

What happens when thread 17 tries to call syncwarp when it is not included as true in the mask? It‌ causes the whole kernel to stop running, so the sum calculation is never reached. Hence, the output is zero. 

All this fails silently, and only the incorrect output indicates a problem. In ‌more complicated code, this could be a nightmare to track down.

Using synccheck provides the following:

$ compute-sanitizer --tool synccheck --show-backtrace no ./ballot_example
========= COMPUTE-SANITIZER
========= Barrier error detected. Invalid arguments
=========     at 0x220 in /home/pgraham/Code/devblog/NCS_Part2/ballot_example.cu:32:sumValues(int *)
=========     by thread (0,0,0) in block (0,0,0)
=========

. . .

========= Barrier error detected. Invalid arguments
=========     at 0x220 in /home/pgraham/Code/devblog/NCS_Part2/ballot_example.cu:32:sumValues(int *)
=========     by thread (16,0,0) in block (0,0,0)
=========
Sum out = 0
========= ERROR SUMMARY: 17 errors

Regarding these 17 “invalid arguments” errors, the synccheck documentation states that the invalid argument can occur if not all threads reaching __syncwarp declare themselves in the mask parameter.

In this case, thread 17 or thread (16,0,0) is not active in the mask, so it shouldn’t call syncwarp. This causes all the other threads calling syncwarp to also register an error. They are individually calling syncwarp, but because one of them causes it to fail, all other syncwarp calls must also fail. It is a collective operation that causes 17 errors in total.

Conclusion

This post walked you through a few examples of how to debug code and catch bugs using the initcheck and synccheck features in NVIDIA Compute Sanitizer. To get started using Compute Sanitizer, download the CUDA Toolkit.

To learn more, visit /NVIDIA/compute-sanitizer-samples on GitHub, and read the Compute Sanitizer documentation. Join the conversation in the NVIDIA Developer Forum dedicated to sanitizer tools.

Good luck on your bug hunt!

Discuss (1)

Tags

  翻译: