x264 encodes video according to the H.264 standard.
We have made videos encoded with x264, both with and without perforation.
These videos are available through the CodePerforation playlist on YouTube.
Wednesday, November 25, 2009
bodytrack Outputs
bodytrack outputs a series of images representing the position of the body with the tracks believed position of the body overlaid. Here we show images for the last frame of the production sequence with and without perforation.
Reference output (no perforation):

Output with perforation and 15% distortion bound:

At the 15% QoS bound the perforated tracker runs 2.57x faster, while maintaining the track on the largest features of the body - the chest and the legs. The track of smaller parts of the body has been perturbed as a cost of the speedup.
Output with dynamic perforation in the case where 3/8 cores fail:
Here we show the output when dynamic perforation is employed to maintain performance in the case of core failures. The performance of the tracker is preserved despite the loss of 37.5% of the system's compute capacity. In additon the tracker is able to maintain a track on the chest, head, and legs despite this capacity loss.
Output with dynamic perforation in the case where core frequency drops to 60% of the original and then rises again:
For this experiment, the core frequency of the processor is dropped from 2.5 GHz to 1.6 GHz 1/4 of the way through the application. With 1/4 of the frames left to process, frequency is raised back to the original value. The runtime system dynamically perforates bodytrack to maintain performance during the period of low clock speed. When the clock frequency returns, the perforation is turned off and full QoS is restored. Perforation allows bodytrack to maintain the track despite the change in clock frequency.
Reference output (no perforation):

Output with perforation and 15% distortion bound:

At the 15% QoS bound the perforated tracker runs 2.57x faster, while maintaining the track on the largest features of the body - the chest and the legs. The track of smaller parts of the body has been perturbed as a cost of the speedup.
Output with dynamic perforation in the case where 3/8 cores fail:

Output with dynamic perforation in the case where core frequency drops to 60% of the original and then rises again:

streamcluster Perforated Loops
This benchmark implements an online approximation of the K-means clustering algorithm~\cite{hartigan1979k}. It partitions a set of points into clusters, with each cluster centered around one of the input points. The algorithm is designed to process sets of points from a space of features, with the clusters corresponding to entities with similar features.
The inner loop in pFL() estimates the cost of opening a new cluster center. Perforating this loop allows the application to make a less accurate estimate of this cost more quickly. The following comment appears above the definition of the loop bound (ITER) in the source code:
/* higher ITER --> more likely to get correct number of centers */
/* higher ITER also scales the running time almost linearly */
This comment reflects the fact that the original version of the loop computes an inexact approximation of the true cost, with the number of iterations controlling a performance vs. quality of service tradeoff. In effect, the perforation space exploration space algorithm rediscovers this tradeoff.
Our examination of the source code confirms that perforating this loop
(as the comment suggests) leaves the application able to process all inputs.
The inner loop in pFL() estimates the cost of opening a new cluster center. Perforating this loop allows the application to make a less accurate estimate of this cost more quickly. The following comment appears above the definition of the loop bound (ITER) in the source code:
/* higher ITER --> more likely to get correct number of centers */
/* higher ITER also scales the running time almost linearly */
This comment reflects the fact that the original version of the loop computes an inexact approximation of the true cost, with the number of iterations controlling a performance vs. quality of service tradeoff. In effect, the perforation space exploration space algorithm rediscovers this tradeoff.
// This function is defined in streamcluster.cpp in the streamcluster source code
float pFL(Points *points, int *feasible, int numfeasible,
float z, long *k, double cost, long iter, float e,
int pid, pthread_barrier_t* barrier)
{
#ifdef ENABLE_THREADS
pthread_barrier_wait(barrier);
#endif
long i;
long x;
double change;
long numberOfPoints;
change = cost;
/* continue until we run iter iterations without improvement */
/* stop instead if improvement is less than e */
while (change/cost > 1.0*e) {
change = 0.0;
numberOfPoints = points->num;
/* randomize order in which centers are considered */
if( pid == 0 ) {
intshuffle(feasible, numfeasible);
}
#ifdef ENABLE_THREADS
pthread_barrier_wait(barrier);
#endif
// This loop is perforated, changing i++ to i+=2
for (i=0;i<iter;i++) {
x = i%numfeasible;
change += pgain(feasible[x], points, z, k, pid, barrier);
}
cost -= change;
#ifdef PRINTINFO
if( pid == 0 ) {
fprintf(stderr, "%d centers, cost %lf, total distance %lf\n",
*k, cost, cost - z*(*k));
}
#endif
#ifdef ENABLE_THREADS
pthread_barrier_wait(barrier);
#endif
}
return(cost);
}
Our examination of the source code confirms that perforating this loop
(as the comment suggests) leaves the application able to process all inputs.
blackscholes Perforated Loops
The experimental results show that it is possible to perforate the outer loop in main() without changing the output at all. Further investigation reveals that this loop is superfluous --- it simply repeats the same computation multiple times and was apparently added to the benchmark to increase the workload.
// This function is the main function in the threaded version of blackscholes.
// It is found in blackscholes.cpp in the blackscholes source code
int bs_thread(void *tid_ptr) {
int i, j, k;
fptype price;
fptype priceDelta;
int tid = *(int *)tid_ptr;
int start = tid * (numOptions / nThreads);
int end = start + (numOptions / nThreads);
#ifdef ENABLE_THREADS
BARRIER(barrier);
#endif
// This loop is perforated, changing j++ to j+=2
for (j=0; j<NUM_RUNS; j++) {
for (i=start; i<end; i++) {
/* Calling main function to calculate option value based on
* Black & Sholes's equation.
*/
if(i%25000 == 0)
heartbeat(&heart);
price = BlkSchlsEqEuroNoDiv( sptprice[i], strike[i],
rate[i], volatility[i], otime[i],
otype[i], 0);
prices[i] = price;
#ifdef ERR_CHK
priceDelta = data[i].DGrefval - price;
if( fabs(priceDelta) >= 1e-5 ){
printf("Error on %d. Computed=%.5f, Ref=%.5f, Delta=%.5f\n",
i, price, data[i].DGrefval, priceDelta);
numError ++;
}
#endif
}
}
return 0;
}
int main (int argc, char **argv)
{
#ifdef PARSEC_VERSION
#define __PARSEC_STRING(x) #x
#define __PARSEC_XSTRING(x) __PARSEC_STRING(x)
printf("PARSEC Benchmark Suite Version "__PARSEC_XSTRING(PARSEC_VERSION)"\n");
fflush(NULL);
#else
printf("PARSEC Benchmark Suite\n");
fflush(NULL);
#endif //PARSEC_VERSION
#ifdef ENABLE_PARSEC_HOOKS
__parsec_bench_begin(__parsec_blackscholes);
#endif
if (argc != 3)
{
printf("Usage:\n\t%s <nthreads> <numOptions>\n", argv[0]);
exit(1);
}
nThreads = atoi(argv[1]);
#ifndef ENABLE_THREADS
if(nThreads != 1) {
printf("Error: <nthreads> must be 1 (serial version)\n");
exit(1);
}
#endif
int i, j;
init_heartbeat(&heart, 0, 100, 20, "heartbeat.log");
numOptions = atoi(argv[2]);
fptype * buffer;
int * buffer2;
// alloc spaces for the option data
data = new OptionData[numOptions];
prices = new fptype[numOptions];
FILE* output=fopen("prices.txt", "w");
// initialize the data array
int initOptionNum = ( (sizeof(data_init)) / sizeof(OptionData) );
for ( int loopnum = 0; loopnum < numOptions; ++ loopnum )
{
OptionData *temp = data_init + loopnum%initOptionNum;
data[loopnum].OptionType = new char[strlen(temp->OptionType)];
strcpy(data[loopnum].OptionType, temp->OptionType);
data[loopnum].s = temp->s;
data[loopnum].strike = temp->strike;
data[loopnum].r = temp->r;
data[loopnum].divq = temp->divq;
data[loopnum].v = temp->v;
data[loopnum].t = temp->t;
data[loopnum].divs = temp->divs;
data[loopnum].DGrefval = temp->DGrefval;
}
#ifdef ENABLE_THREADS
MAIN_INITENV(,8000000,nThreads);
BARINIT(barrier);
#endif
printf("Num of Options: %d\n", numOptions);
printf("Num of Runs: %d\n", NUM_RUNS);
#define PAD 256
#define LINESIZE 64
buffer = (fptype *) malloc(5 * numOptions * sizeof(fptype) + PAD);
sptprice = (fptype *) (((unsigned long long)buffer + PAD) & ~(LINESIZE - 1));
strike = sptprice + numOptions;
rate = strike + numOptions;
volatility = rate + numOptions;
otime = volatility + numOptions;
buffer2 = (int *) malloc(numOptions * sizeof(fptype) + PAD);
otype = (int *) (((unsigned long long)buffer2 + PAD) & ~(LINESIZE - 1));
for (i=0; i<numOptions; i++) {
otype[i] = (data[i].OptionType == "P") ? 1 : 0;
// printf("Option %d, type = %s, otype = %d\n", i, data[i].OptionType, otype[i]);
sptprice[i] = data[i].s;
strike[i] = data[i].strike;
rate[i] = data[i].r;
volatility[i] = data[i].v;
otime[i] = data[i].t;
}
printf("Size of data: %d\n", sizeof(data) + sizeof(otype));
#ifdef ENABLE_PARSEC_HOOKS
__parsec_roi_begin();
#endif
#ifdef ENABLE_THREADS
int tids[nThreads];
for(i=0; i<nThreads; i++) {
tids[i]=i;
CREATE_WITH_ARG(bs_thread, &tids[i]);
}
WAIT_FOR_END(nThreads);
#else
int tid=0;
bs_thread(&tid);
#endif //ENABLE_THREADS
#ifdef ENABLE_PARSEC_HOOKS
__parsec_roi_end();
#endif
#ifdef ERR_CHK
printf("Num Errors: %d\n", numError);
#endif
for( int noptions = 0; noptions < numOptions; noptions++) {
fprintf(output, "%f\n", prices[noptions]);
}
for (int loopnum = 0; loopnum < numOptions; ++ loopnum)
{
delete []data[loopnum].OptionType;
}
delete []data;
#ifdef ENABLE_PARSEC_HOOKS
__parsec_bench_end();
#endif
finalize_heartbeat(&heart);
return 1;
}
canneal Perforated Loops
We find no perforatable loops for canneal. Most (6 of 8) were filtered out because their perforation does not increase the performance. We attribute this lack of performance gain to the fact that canneal performs annealing steps until it converges. While the perforations may affect the execution of the individual annealing steps, they do not cause canneal to converge faster and therefore do not improve the overall performance.
canneal Perforated Loops
We find no perforatable loops for canneal. Most (6 of 8) were filtered out because their perforation does not increase the performance. We attribute this lack of performance gain to the fact that canneal performs annealing steps until it converges. While the perforations may affect the execution of the individual annealing steps, they do not cause canneal to converge faster and therefore do not improve the overall performance.
ferret Perforated Loops
This benchmark performs a content-based similarity search based on a series of query images. For each query image, it returns the $N$ images in its database determined to be most similar. It ferret first decomposes the query image into a set of segments, extracts a feature vector from each segment, indexes an image database to find candidate similar images, ranks the candidate images by measuring the distance between the query image and the candidate images, then returns the highest ranked images.
The loop in the function sdist_emd() is inlined from the function emd(). This function measures the distance between the query image and the candidate image. Perforating this loop creates a new, more efficient, but less accurate distance function.
The loop in the LSH_query_bootstrap() function initializes several data structures required to query the image database; perforating this loop causes ferret to skip some of the initialization steps. Because the application explicitly zeros the memory before the initialization loop executes, the computation always accesses initialized memory and the Valgrind memcheck tool does not report any reads to uninitialized data. The presence of zeroed data structures in image database queries does not materially affect the quality of service of the application.
The two perforated loops account for a relatively small percentage of the execution time, their perforation does not significantly increase the performance.
The loop in the function sdist_emd() is inlined from the function emd(). This function measures the distance between the query image and the candidate image. Perforating this loop creates a new, more efficient, but less accurate distance function.
// This function is defined in src/emd.c in the ferret benchmark.
// The function is then inlined into the function sdist_emd()
float emd(signature_t *Signature1, signature_t *Signature2, float (*Dist)(cass_size_t, feature_t, feature_t, void *), cass_size_t dim, void *param, flow_t *Flow, int*FlowSize)
{
struct emd_state_t emd_state, *state = &emd_state;
int itr;
double totalCost;
float w;
emd_node2_t *XP;
flow_t *FlowP;
emd_node1_t U[MAX_SIG_SIZE1], V[MAX_SIG_SIZE1];
if (C == NULL) C = type_matrix_alloc(double, MAX_SIG_SIZE1, MAX_SIG_SIZE1);
bzero(state, sizeof *state);
state->C = C;
w = emdinit(state, Signature1, Signature2, Dist, dim, param);
if (w == EMD_INFINITY) {
// init failed, due to nr_reg too high, ignore this seg.
return EMD_INFINITY;
}
#if DEBUG_LEVEL > 1
print("\nINITIAL SOLUTION:\n");
printSolution(state);
#endif
if (state->n1 > 1 && state->n2 > 1) /* IF _n1 = 1 OR _n2 = 1 THEN WE ARE DONE */
{
// This loop is perforated, changing itr++ to itr+=2
for (itr = 1; itr < MAX_ITERATIONS; /*itr++*/ itr+=2)
{
/* FIND BASIC VARIABLES */
findBasicVariables(state, U, V);
/* CHECK FOR OPTIMALITY */
if (isOptimal(state, U, V))
break;
/* IMPROVE SOLUTION */
newSol(state);
#if DEBUG_LEVEL > 1
print("\nITERATION # %d \n", itr);
printSolution(state);
#endif
}
if (itr == MAX_ITERATIONS)
warn("emd: Maximum number of iterations has been reached (%d)\n",
MAX_ITERATIONS);
}
/* COMPUTE THE TOTAL FLOW */
totalCost = 0;
FlowP = Flow;
for(XP=state->X; XP < state->EndX; XP++)
{
if (XP == state->EnterX) /* _EnterX IS THE EMPTY SLOT */
continue;
if (XP->i == Signature1->n || XP->j == Signature2->n) /* DUMMY FEATURE */
continue;
if (XP->val == 0) /* ZERO FLOW */
continue;
totalCost += (double)XP->val * state->C[XP->i][XP->j];
if (Flow != NULL)
{
FlowP->from = XP->i;
FlowP->to = XP->j;
FlowP->amount = XP->val;
FlowP->cost = state->C[XP->i][XP->j];
state->tot_flow_costs += FlowP->cost;
state->tot_flow ++;
FlowP++;
}
}
if (Flow != NULL)
*FlowSize = FlowP-Flow;
#if DEBUG_LEVEL > 0
print("\n*** OPTIMAL SOLUTION (%d ITERATIONS): %f ***\n", itr, totalCost);
#endif
/* RETURN THE NORMALIZED COST == EMD */
// print("avg_flow_costs: %g/%d = %g\n", state->tot_flow_costs,
// state->tot_flow, state->tot_flow_costs / state->tot_flow);
return (float)(totalCost / w);
}
The loop in the LSH_query_bootstrap() function initializes several data structures required to query the image database; perforating this loop causes ferret to skip some of the initialization steps. Because the application explicitly zeros the memory before the initialization loop executes, the computation always accesses initialized memory and the Valgrind memcheck tool does not report any reads to uninitialized data. The presence of zeroed data structures in image database queries does not materially affect the quality of service of the application.
// This function is defined in src/lsh/LSH_query.c in the ferret benchmark
static void LSH_query_bootstrap (LSH_query_t *query, const float *point)
{
cass_size_t D = query->lsh->D;
cass_size_t K = query->K;
cass_size_t L = query->L;
LSH_t *lsh = query->lsh;
ptb_vec_t **score = query->ptb;
uint32_t **tmp = query->tmp;
uint32_t *tmp2 = query->tmp2;
cass_list_entry_t **_topk = query->_topk;
cass_list_entry_t entry;
int *C = query->C;
int *H = query->H;
int i;
memset(C, 0, L * sizeof(int));
memset(H, 0, L * sizeof(int));
memset(query->S, 0, L * sizeof(float));
bitmap_clear(query->bitmap);
LSH_hash_score(query->lsh, L, point, tmp, score);
LSH_hash2_noperturb(query->lsh, tmp, tmp2, L);
// This loop is perforated, changing i++ to i+=2
for (i = 0; i < L; i++)
{
memset(_topk[i], 0xff, sizeof (*_topk[i]) * K);
TOPK_INIT(_topk[i], dist, K, HUGE);
ARRAY_BEGIN_FOREACH(lsh->hash[i].bucket[tmp2[i]], uint32_t id) {
if (!bitmap_contain(query->bitmap, id))
{
cass_vec_t *vec;
bitmap_insert(query->bitmap, id);
vec = DATASET_VEC(query->ds, id);
entry.id = id;
entry.dist = dist_L2_float(D, vec->u.float_data, point);
C[i]++;
query->CC++;
TOPK_INSERT_MIN_UNIQ_DO(_topk[i], dist, id, K, entry, H[i]++);
}
}
ARRAY_END_FOREACH;
ptb_qsort(score[i], lsh->M * 2);
#ifdef QUERY_DIRECT
ARRAY_TRUNC(query->heap[i]);
HEAP_ENQUEUE(query->heap[i], score[i][0], ptb_vec_ge);
#else
query->ptb_step[i] = 0;
map_perturb_vector(query->ptb_set, query->ptb_vec[i], score[i], lsh->M, query->T);
#endif
}
if (lsh->est != NULL)
{
LSH_query_merge(query);
LSH_query_local(query);
}
}
The two perforated loops account for a relatively small percentage of the execution time, their perforation does not significantly increase the performance.
Subscribe to:
Posts (Atom)