Loop Perforation

Wednesday, November 25, 2009

x264 Outputs

x264 encodes video according to the H.264 standard.

We have made videos encoded with x264, both with and without perforation.

These videos are available through the CodePerforation playlist on YouTube.

bodytrack Outputs

bodytrack outputs a series of images representing the position of the body with the tracks believed position of the body overlaid. Here we show images for the last frame of the production sequence with and without perforation.

Reference output (no perforation):

Output with perforation and 15% distortion bound:

At the 15% QoS bound the perforated tracker runs 2.57x faster, while maintaining the track on the largest features of the body - the chest and the legs. The track of smaller parts of the body has been perturbed as a cost of the speedup.

Output with dynamic perforation in the case where 3/8 cores fail:

Here we show the output when dynamic perforation is employed to maintain performance in the case of core failures. The performance of the tracker is preserved despite the loss of 37.5% of the system's compute capacity. In additon the tracker is able to maintain a track on the chest, head, and legs despite this capacity loss.

Output with dynamic perforation in the case where core frequency drops to 60% of the original and then rises again:

For this experiment, the core frequency of the processor is dropped from 2.5 GHz to 1.6 GHz 1/4 of the way through the application. With 1/4 of the frames left to process, frequency is raised back to the original value. The runtime system dynamically perforates bodytrack to maintain performance during the period of low clock speed. When the clock frequency returns, the perforation is turned off and full QoS is restored. Perforation allows bodytrack to maintain the track despite the change in clock frequency.

streamcluster Perforated Loops

This benchmark implements an online approximation of the K-means clustering algorithm~\cite{hartigan1979k}. It partitions a set of points into clusters, with each cluster centered around one of the input points. The algorithm is designed to process sets of points from a space of features, with the clusters corresponding to entities with similar features.

The inner loop in pFL() estimates the cost of opening a new cluster center. Perforating this loop allows the application to make a less accurate estimate of this cost more quickly. The following comment appears above the definition of the loop bound (ITER) in the source code:

/* higher ITER --> more likely to get correct number of centers */
/* higher ITER also scales the running time almost linearly */

This comment reflects the fact that the original version of the loop computes an inexact approximation of the true cost, with the number of iterations controlling a performance vs. quality of service tradeoff. In effect, the perforation space exploration space algorithm rediscovers this tradeoff.

// This function is defined in streamcluster.cpp in the streamcluster source code
float pFL(Points *points, int *feasible, int numfeasible,
   float z, long *k, double cost, long iter, float e,
   int pid, pthread_barrier_t* barrier)
{
#ifdef ENABLE_THREADS
 pthread_barrier_wait(barrier);
#endif
 long i;
 long x;
 double change;
 long numberOfPoints;

 change = cost;
 /* continue until we run iter iterations without improvement */
 /* stop instead if improvement is less than e */
 while (change/cost > 1.0*e) {
   change = 0.0;
   numberOfPoints = points->num;
   /* randomize order in which centers are considered */

   if( pid == 0 ) {
     intshuffle(feasible, numfeasible);
   }
#ifdef ENABLE_THREADS
   pthread_barrier_wait(barrier);
#endif
    // This loop is perforated, changing i++ to i+=2
   for (i=0;i<iter;i++) {
     x = i%numfeasible;
     change += pgain(feasible[x], points, z, k, pid, barrier);
   }

   cost -= change;
#ifdef PRINTINFO
   if( pid == 0 ) {
     fprintf(stderr, "%d centers, cost %lf, total distance %lf\n",
       *k, cost, cost - z*(*k));
   }
#endif
#ifdef ENABLE_THREADS
   pthread_barrier_wait(barrier);
#endif
 }
 return(cost);
}

Our examination of the source code confirms that perforating this loop
(as the comment suggests) leaves the application able to process all inputs.

blackscholes Perforated Loops

The experimental results show that it is possible to perforate the outer loop in main() without changing the output at all. Further investigation reveals that this loop is superfluous --- it simply repeats the same computation multiple times and was apparently added to the benchmark to increase the workload.

// This function is the main function in the threaded version of blackscholes.  
// It is found in blackscholes.cpp in the blackscholes source code
int bs_thread(void *tid_ptr) {
   int i, j, k;
   fptype price;
   fptype priceDelta;
   int tid = *(int *)tid_ptr;
   int start = tid * (numOptions / nThreads);
   int end = start + (numOptions / nThreads);

#ifdef ENABLE_THREADS
   BARRIER(barrier);
#endif

    // This loop is perforated, changing j++ to j+=2
   for (j=0; j<NUM_RUNS; j++) {
       for (i=start; i<end; i++) {
           /* Calling main function to calculate option value based on
            * Black & Sholes's equation.
            */

   if(i%25000 == 0)
     heartbeat(&heart);
  
   price = BlkSchlsEqEuroNoDiv( sptprice[i], strike[i],
              rate[i], volatility[i], otime[i],
              otype[i], 0);

   prices[i] = price;
          
#ifdef ERR_CHK  
           priceDelta = data[i].DGrefval - price;
           if( fabs(priceDelta) >= 1e-5 ){
               printf("Error on %d. Computed=%.5f, Ref=%.5f, Delta=%.5f\n",
                      i, price, data[i].DGrefval, priceDelta);
               numError ++;
           }
#endif
       }
   }

   return 0;
}


int main (int argc, char **argv)
{
#ifdef PARSEC_VERSION
#define __PARSEC_STRING(x) #x
#define __PARSEC_XSTRING(x) __PARSEC_STRING(x)
       printf("PARSEC Benchmark Suite Version "__PARSEC_XSTRING(PARSEC_VERSION)"\n");
 fflush(NULL);
#else
       printf("PARSEC Benchmark Suite\n");
 fflush(NULL);
#endif //PARSEC_VERSION
#ifdef ENABLE_PARSEC_HOOKS
  __parsec_bench_begin(__parsec_blackscholes);
#endif

  if (argc != 3)
       {
               printf("Usage:\n\t%s <nthreads> <numOptions>\n", argv[0]);
               exit(1);
       }
   nThreads = atoi(argv[1]);
#ifndef ENABLE_THREADS  
   if(nThreads != 1) {
       printf("Error: <nthreads> must be 1 (serial version)\n");
       exit(1);
   }
#endif

   int i, j;
   init_heartbeat(&heart, 0, 100, 20, "heartbeat.log");
   numOptions = atoi(argv[2]);
   fptype * buffer;
   int * buffer2;
   // alloc spaces for the option data
   data = new OptionData[numOptions];
   prices = new fptype[numOptions];
   FILE* output=fopen("prices.txt", "w");

   // initialize the data array
   int initOptionNum =  ( (sizeof(data_init)) / sizeof(OptionData) );
   for ( int loopnum = 0; loopnum < numOptions; ++ loopnum )  
   {
 OptionData *temp = data_init + loopnum%initOptionNum;
 data[loopnum].OptionType = new char[strlen(temp->OptionType)];
       strcpy(data[loopnum].OptionType, temp->OptionType);
 data[loopnum].s = temp->s;
 data[loopnum].strike = temp->strike;
 data[loopnum].r = temp->r;
 data[loopnum].divq = temp->divq;
 data[loopnum].v = temp->v;
 data[loopnum].t = temp->t;
 data[loopnum].divs = temp->divs;
 data[loopnum].DGrefval = temp->DGrefval;
   }
#ifdef ENABLE_THREADS
   MAIN_INITENV(,8000000,nThreads);
   BARINIT(barrier);
#endif
   printf("Num of Options: %d\n", numOptions);
   printf("Num of Runs: %d\n", NUM_RUNS);

#define PAD 256
#define LINESIZE 64
 
   buffer = (fptype *) malloc(5 * numOptions * sizeof(fptype) + PAD);
   sptprice = (fptype *) (((unsigned long long)buffer + PAD) & ~(LINESIZE - 1));
   strike = sptprice + numOptions;
   rate = strike + numOptions;
   volatility = rate + numOptions;
   otime = volatility + numOptions;
  
   buffer2 = (int *) malloc(numOptions * sizeof(fptype) + PAD);
   otype = (int *) (((unsigned long long)buffer2 + PAD) & ~(LINESIZE - 1));
  
   for (i=0; i<numOptions; i++) {
       otype[i]      = (data[i].OptionType == "P") ? 1 : 0;
 //  printf("Option %d, type = %s, otype = %d\n", i, data[i].OptionType, otype[i]);
       sptprice[i]   = data[i].s;
       strike[i]     = data[i].strike;
       rate[i]       = data[i].r;
       volatility[i] = data[i].v;   
       otime[i]      = data[i].t;
   }
  
   printf("Size of data: %d\n", sizeof(data) + sizeof(otype));

#ifdef ENABLE_PARSEC_HOOKS
   __parsec_roi_begin();
#endif
#ifdef ENABLE_THREADS
   int tids[nThreads];
   for(i=0; i<nThreads; i++) {
       tids[i]=i;
       CREATE_WITH_ARG(bs_thread, &tids[i]);
   }
   WAIT_FOR_END(nThreads);
#else
   int tid=0;
   bs_thread(&tid);
#endif //ENABLE_THREADS
#ifdef ENABLE_PARSEC_HOOKS
   __parsec_roi_end();
#endif

#ifdef ERR_CHK
   printf("Num Errors: %d\n", numError);
#endif
   for( int noptions = 0; noptions < numOptions; noptions++) {
     fprintf(output, "%f\n", prices[noptions]);
   }
  
   for (int loopnum = 0; loopnum < numOptions; ++ loopnum)
   {
 delete []data[loopnum].OptionType;
   }
   delete []data;

#ifdef ENABLE_PARSEC_HOOKS
   __parsec_bench_end();
#endif
 finalize_heartbeat(&heart);

   return 1;
}

canneal Perforated Loops

We find no perforatable loops for canneal. Most (6 of 8) were filtered out because their perforation does not increase the performance. We attribute this lack of performance gain to the fact that canneal performs annealing steps until it converges. While the perforations may affect the execution of the individual annealing steps, they do not cause canneal to converge faster and therefore do not improve the overall performance.

canneal Perforated Loops

ferret Perforated Loops

This benchmark performs a content-based similarity search based on a series of query images. For each query image, it returns the $N$ images in its database determined to be most similar. It ferret first decomposes the query image into a set of segments, extracts a feature vector from each segment, indexes an image database to find candidate similar images, ranks the candidate images by measuring the distance between the query image and the candidate images, then returns the highest ranked images.

The loop in the function sdist_emd() is inlined from the function emd(). This function measures the distance between the query image and the candidate image. Perforating this loop creates a new, more efficient, but less accurate distance function.

// This function is defined in src/emd.c in the ferret benchmark.
// The function is then inlined into the function sdist_emd()
float emd(signature_t *Signature1, signature_t *Signature2, float (*Dist)(cass_size_t, feature_t, feature_t, void *), cass_size_t dim, void *param, flow_t *Flow, int*FlowSize)
{
struct emd_state_t emd_state, *state = &emd_state;
int itr;
double totalCost;
float w;
emd_node2_t *XP;
flow_t *FlowP;
emd_node1_t U[MAX_SIG_SIZE1], V[MAX_SIG_SIZE1];

if (C == NULL) C = type_matrix_alloc(double, MAX_SIG_SIZE1, MAX_SIG_SIZE1);

bzero(state, sizeof *state);

state->C = C;

w = emdinit(state, Signature1, Signature2, Dist, dim, param);

if (w == EMD_INFINITY) {
    // init failed, due to nr_reg too high, ignore this seg.
    return EMD_INFINITY;
}

#if DEBUG_LEVEL > 1
print("\nINITIAL SOLUTION:\n");
printSolution(state);
#endif

if (state->n1 > 1 && state->n2 > 1)  /* IF _n1 = 1 OR _n2 = 1 THEN WE ARE DONE */
  {
      // This loop is perforated, changing itr++ to itr+=2
    for (itr = 1; itr < MAX_ITERATIONS; /*itr++*/ itr+=2)
    {
       /* FIND BASIC VARIABLES */
       findBasicVariables(state, U, V);

       /* CHECK FOR OPTIMALITY */
       if (isOptimal(state, U, V))
         break;

       /* IMPROVE SOLUTION */
       newSol(state);

#if DEBUG_LEVEL > 1
  print("\nITERATION # %d \n", itr);
  printSolution(state);
#endif
}

    if (itr == MAX_ITERATIONS)
      warn("emd: Maximum number of iterations has been reached (%d)\n",
      MAX_ITERATIONS);
    }

/* COMPUTE THE TOTAL FLOW */
totalCost = 0;
FlowP = Flow;
for(XP=state->X; XP < state->EndX; XP++)
  {
    if (XP == state->EnterX)  /* _EnterX IS THE EMPTY SLOT */
continue;
    if (XP->i == Signature1->n || XP->j == Signature2->n)  /* DUMMY FEATURE */
continue;
  
    if (XP->val == 0)  /* ZERO FLOW */
continue;

    totalCost += (double)XP->val * state->C[XP->i][XP->j];
    if (Flow != NULL)
{
  FlowP->from = XP->i;
  FlowP->to = XP->j;
  FlowP->amount = XP->val;
  FlowP->cost = state->C[XP->i][XP->j];
  state->tot_flow_costs += FlowP->cost;
  state->tot_flow ++;
  FlowP++;
}
  }
if (Flow != NULL)
  *FlowSize = FlowP-Flow;

#if DEBUG_LEVEL > 0
print("\n*** OPTIMAL SOLUTION (%d ITERATIONS): %f ***\n", itr, totalCost);
#endif

/* RETURN THE NORMALIZED COST == EMD */
//  print("avg_flow_costs: %g/%d = %g\n", state->tot_flow_costs,
//  state->tot_flow, state->tot_flow_costs / state->tot_flow);
return (float)(totalCost / w);
}

The loop in the LSH_query_bootstrap() function initializes several data structures required to query the image database; perforating this loop causes ferret to skip some of the initialization steps. Because the application explicitly zeros the memory before the initialization loop executes, the computation always accesses initialized memory and the Valgrind memcheck tool does not report any reads to uninitialized data. The presence of zeroed data structures in image database queries does not materially affect the quality of service of the application.

// This function is defined in src/lsh/LSH_query.c in the ferret benchmark
static void LSH_query_bootstrap (LSH_query_t *query, const float *point)
{
 cass_size_t D = query->lsh->D;
 cass_size_t K = query->K;
 cass_size_t L = query->L;
 LSH_t *lsh = query->lsh;
 ptb_vec_t **score = query->ptb;
 uint32_t **tmp = query->tmp;
 uint32_t *tmp2 = query->tmp2;
 cass_list_entry_t **_topk = query->_topk;
 cass_list_entry_t entry;

 int *C = query->C;
 int *H = query->H;

 int i;

 memset(C, 0, L * sizeof(int));
 memset(H, 0, L * sizeof(int));
 memset(query->S, 0, L * sizeof(float));

 bitmap_clear(query->bitmap);

 LSH_hash_score(query->lsh, L, point, tmp, score);
 LSH_hash2_noperturb(query->lsh, tmp, tmp2, L);

  // This loop is perforated, changing i++ to i+=2
 for (i = 0; i < L; i++)
 {
   memset(_topk[i], 0xff, sizeof (*_topk[i]) * K);
   TOPK_INIT(_topk[i], dist, K, HUGE);
   ARRAY_BEGIN_FOREACH(lsh->hash[i].bucket[tmp2[i]], uint32_t id) {
     if (!bitmap_contain(query->bitmap, id))
     {
       cass_vec_t *vec;
       bitmap_insert(query->bitmap, id);
          vec = DATASET_VEC(query->ds, id);
       entry.id = id;
       entry.dist = dist_L2_float(D, vec->u.float_data, point);
       C[i]++;
       query->CC++;
       TOPK_INSERT_MIN_UNIQ_DO(_topk[i], dist, id, K, entry, H[i]++);
     }
   }
   ARRAY_END_FOREACH;

   ptb_qsort(score[i], lsh->M * 2);

#ifdef  QUERY_DIRECT
   ARRAY_TRUNC(query->heap[i]);
   HEAP_ENQUEUE(query->heap[i], score[i][0], ptb_vec_ge);
#else
   query->ptb_step[i] = 0;
   map_perturb_vector(query->ptb_set, query->ptb_vec[i], score[i], lsh->M, query->T);
#endif
 }

 if (lsh->est != NULL)
 {
   LSH_query_merge(query);
   LSH_query_local(query);
 }

}

The two perforated loops account for a relatively small percentage of the execution time, their perforation does not significantly increase the performance.

Loop Perforation

Wednesday, November 25, 2009

x264 Outputs

bodytrack Outputs

streamcluster Perforated Loops

blackscholes Perforated Loops

canneal Perforated Loops

canneal Perforated Loops

ferret Perforated Loops

Blog Archive

Contributors

Followers