Tutorial

This is a tutorial on how to get started with BONSAI. First, BONSAI requires:

  • Python 2.7+ (Python 3 is not supported)
  • GCC 4.4 (or similar) or later + make

There are three primary steps in an auto-tuning workflow,

  • Generating and pruning the Search Space
  • Tuning and selecting the proper kernel configuration
  • Analyzing the generated profiles to inform the next tuning run

Variables in the Makefile are defined in make.inc for easy customization.

There are two top level make targets, make space and make tune. The first generates the search space, and the second performs a tuning sweep over the search space with the provided kernel.

Working examples are provided in the examples folder, one CPU and one GPU, of varying complexity. The CPU example is simple with one iterator for tutorial purposes. The GPU example requires CUDA and the CUDA toolkit be installed. The GPU example is rather complex, showing how BONSAI can be incorporated into complex kernels.

Space Generation and Pruning

The first step in the autotuning workflow is to generate the search space for the parameters. (make space) This is done using a LANAI iterator file. For this example, we will use the following (admittedly toy) iterator.

from lanai import *

max_sz = 32
# -----------------
dim_m = range(1,max_sz)
dim_n = range(1,max_sz)

@iterator
def blk_m(dim_m):
    return range(dim_m, max_sz, dim_m)

@iterator
def blk_n(dim_n):
    return range(dim_n, max_sz, dim_n)
#------------------
@condition
def only_even(dim_m, dim_n):
    return( (dim_m % 2) == 1 or (dim_n % 2) == 1)

dim_m and dim_n can be viewed as the total size of a matrix panel, and blk_m / blk_n the size of the blocks within the panel. In LANAI terms, m and n are called ‘expression’ or ‘global’ iterators; blk_m and blk_n are called ‘deferred’ or ‘local’ iterators since they depend on a global value. We also have one condition for these iterators, in that we only want to consider even numbers for the sizes. An important note, conditions are written as what should fail, not what should be true. (see LANAI for more)

After specifying an iterator file (ITER in make.inc), BONSAI will take it and generate a C source file (by default sharing the same name as the iterator file) that defines the search space. BONSAI then will then compile the generator file using the specified options, and output to the specified CSV file.

Integrating with the Runtime

The runtime is expecting two routines, the kernel to be tested and a driver for the kernel that sets the parameters. The routines can be in the same file. As an example, look at the dgemm C code below (this is from examples/gemm_cpu).

#include <stdio.h>
#include <stdlib.h>
#include <sys/time.h>
#include <unistd.h>
#include <time.h>

#ifndef FLOAT_TYPE
#define FLOAT_TYPE double
#endif

int main(){
  // Here, BLOCK_SIZE is an iterator that is defined within
  // the LANAI iterator file

  struct timeval start;
  struct timeval end;
  unsigned long diff;
  srand((unsigned) time(NULL));

  int i, j, k, i_b, j_b, k_b;
  int n = 1024;
  int blk = BLOCK_SIZE;

  FLOAT_TYPE* A = (FLOAT_TYPE*) malloc(sizeof(FLOAT_TYPE)*n*n);
  FLOAT_TYPE* B = (FLOAT_TYPE*) malloc(sizeof(FLOAT_TYPE)*n*n);
  FLOAT_TYPE* C = (FLOAT_TYPE*) malloc(sizeof(FLOAT_TYPE)*n*n);

  for (i=0; i < n*n; i++)
    A[i] = ((FLOAT_TYPE)rand() / RAND_MAX - 0.5)*2;
  for (i=0; i < n*n; i++)
    B[i] = ((FLOAT_TYPE)rand() / RAND_MAX - 0.5)*2;
  for (i=0; i < n*n; i++)
    C[i] = ((FLOAT_TYPE)rand() / RAND_MAX - 0.5)*2;

  gettimeofday(&start, NULL);

  for (i_b=0; i_b < n/blk; i_b++)
    for (j_b=0; j_b < n/blk; j_b++)
      for (k_b=0; k_b < n/blk; k_b++)
        for (i=i_b*blk; i < (i_b+1)*blk; i++)
          for (j=j_b*blk; j < (j_b+1)*blk; j++)
            for (k=k_b*blk; k < (k_b+1)*blk; k++)
              C[i + j*n] += A[i + k*n] * B[k + j*n];

  gettimeofday(&end, NULL);

  diff = 1000000 * (end.tv_sec - start.tv_sec) + (end.tv_usec - start.tv_usec);
  printf("Time: %ld us\n", diff);

  free(A);
  free(B);
  free(C);

  return 0;
}

With the corresponding iterator file

from lanai import *


block_size = range(2, 66)


@condition
def only_even(block_size):
  return block_size % 2 == 1

Note how we have a correspondence between the iterator block_size and the assignment int blk = BLOCK_SIZE. The runtime is expecting the iterator names to be used like #define macros in the kernel/driver files. The runtime will construct the correct compiler statement for each configuration using the -D flag to define the macro values.