thesis

Design Space Exploration

CMPEN 431 • Spring 2025

Energy, performance, and area exploration (EDP/ED²P/EDAP) illustration

Introduction

Different processors are architected with different design goals, some optimize for performance, others aim for efficiency, costs, area, or even combinations of these factors. Achieving these set goals require exploring multidimensional spaces of core microarchitecture, memory hierarchies, and cache configurations. This paper delivers a heuristic proposal to find high performing configurations under four different optimizations: Energy savings (EDP), Energy efficiency per area (EDAP), Performance sensitive (ED²P), and a balance of all (ED²AP).

With a total 18 different dimensions with up to 10 indices per dimension, finding the optimal configurations would be near impossible. Therefore I designed a search algorithm that uniquely adapts for each of the different optimizations. The following sections explore more on the design space, development of heuristic strategies, results, and a final analysis.

Design Space

To make full use of the limited evaluations, the validation function is crucial for ensuring that only architecturally valid configurations are considered, allowing the algorithm to prioritize meaningful design points. To simplify the implementation, I first decoded the encoded index values into corresponding microarchitectural parameters, to enforce classes of constraints: cardinality, structural, and realism checks. Cardinality checks ensured each dimension was within its allowed minimum to maximum range. Structural constraints verified compatibility with dependent dimensions, for example L2 < L1D + L1I and L2 block size < 2 × L1 block size. Realism checks filtered out inefficient designs like mismatched cache latencies, bottleneck scenarios like width > 2 × fetch speed, and other unrealistic HW constraints.

Now, after filtering out invalid configurations, the next step is to narrow down configurations toward individual optimizations. Since exhaustively iterating through each possible configuration is infeasible especially under a 1000 evaluation limit, I narrowed it down by targeting heuristics that exploit architectural characteristics for each optimization: EDP, EDAP, ED²P, and ED²AP.

EDP, which emphasizes efficiency, naturally led me to exploit simplicity (narrow, in order, minimal HW). As supported by lectures and energy tables in prompt, narrow in order processors consume significantly less energy per cycle. For example, the narrowest in order pipeline only consumes 8 pJ, whereas the widest out of order processors consume 27 pJ. Similarly smaller HW, like cache at 8 KB only use 20 pJ at 0.125 mW while main memory consumes 20 nJ at 512 mW per access.

EDAP, which emphasizes area, led me to exploit low footprint configurations that still look at energy and delay. I primarily selected an in order narrow pipeline since this significantly reduces area overhead. Additionally to what was learnt in class, the in order pipelines scale at width² / 2 mm², while out of order pipelines begin at 4 mm² and scales exponentially.

ED²P, which emphasizes performance, led me to prioritize wide pipelines with fast fetch speeds and out of order execution. The performance table shows how in order processors take less time per clock cycle than dynamic with the same fetch width, but as learnt in class, out of order processors exploit ILP more aggressively and though there is a 5–10 ps increase, the processors can retire more instructions per cycle since it reduces pipeline stalls and hazards. Since ED²P enforces delay exponentially, improving IPC is more impactful than only saving a few picoseconds of cycle time.

ED²AP, which is a mix of all the previous objectives would require a hybrid approach. Since all components dominate, exploring a wide range of configurations that balance performance, efficiency, and area without sacrificing any single metric.

Iterations of Heuristic

Iteration 1 – Greedy Search

My first approach was a greedy, sequential approach where I initialized all dimensions to index 0, then incrementally and greedily selected space one dimension at a time. For example, in the first dimension (width), I tested all indices and locked in the index with better results. This continued for all dimensions, narrowing down to a “greedy” configuration. However, after testing, this was extremely exhaustive and I came to realize that I would be leaving many great configurations out simply due to a greedy early stage decision.

Iteration 2 – Neighbor Search

To fix the problem of bad early stage decisions and to explore more diversely, I switched to a more fair strategy that gave each dimension a chance. For each of the dimensions, I would increment its index and enclose it by modulating the valid choices:

configuration[dimI] = (configuration[dimI] + 1) % validChoices

This generates a new “neighbor” of the previous iteration changing a single dimension, so over many configurations, it improves the diversity of the search with better results. However, I noticed when running, that there would be invalid configurations which would stall progress, therefore to fix this, iteration 3 introduced an embedded validator.

Iteration 3 – Retrieval Configuration

This iteration introduced an embedded validating checker to help reduce stalls. I would invoke the validateConfiguration() and if it returned 0, I would know that the configuration curated in iteration 2 failed and I would return a retreated configuration using a completely random configuration which I know would be less exhaustive on the system. While the 431projectUtils.cpp already rejects invalid configurations before finalizing them, my embedded validating checker would act earlier in the process by reducing redundant attempts within the proposal function.

Iteration 4 – Optimization based filtering

To individually exploit each of the four optimizations instead of treating all dimensions equally, I tuned specific dimensions to take advantage of architectural characteristics that aligned with the selected optimized goal.

For ED²P, I biased the configuration for high throughput, so large width, high fetch speed, and out of order execution. This aligned with the optimization with an emphasis on delay/performance rather than power. For EDP, I emphasized on the configuration to bias toward the lower half of the index space in each of the dimensions. This heuristic exploited simpler, more efficient hardware such as narrow in order processors and minimized HW. For EDAP, I emphasized on narrow in order processors and hardcoded dimensions like width, fetch speed, and scheduling. Unlike EDP, which randomly biases toward the lower half of each dimension, EDAP selectively targets values that minimize physical footprint. For ED²AP, I did not bias a single metric, instead I allowed all dimensions to be tunable which would allow a wide random exploration across all performance, area, and energy.

Results and Analysis

EDP

The early spike in the EDP plot shows the initial phase to explore inefficient configurations, but as more iterations were evaluated, the heuristic started to converge toward simpler and low energy efficient values. The best EDP configuration achieved the lowest value on benchmark 1 and the highest on benchmark 4.

ED²P

The early spike in the ED²P plot shows the initial phase to explore inefficient configuration, but as more iterations were evaluated, the heuristic started to converge toward high throughput designs. The best ED²P configuration prioritized a wider out of order processor achieving the lowest value on benchmark 1 and the highest on benchmark 0.

EDAP

EDAP normalized geomean had a spiky start but as more iterations were evaluated, the heuristic started to converge toward a minimized EDAP around 1.07 × 10⁻⁸ J s mm². The best EDAP configuration used a narrow in order processor and low indexed dimensions to minimize a physical footprint without a huge loss in performance. This configuration achieved the lowest value on benchmark 1 and the highest on benchmark 4.

ED²AP

Similar to the other optimizations ED²AP, the normalized geomean had a spiky start but as more iterations were evaluated, the heuristic started to converge toward a minimized ED²AP around 1.87 × 10⁻⁸ J s² mm². The best ED²AP configuration which factors in energy, performance, and area resulted in a narrow in order processor with scattered HW configurations. It achieved the lowest value on benchmark 1 and the highest on benchmark 3.

All of the optimizations were normalized to the baseline (0, 0, 0, 0, 0, 0, 5, 0, 5, 0, 2, 2, 2, 3, 0, 0, 3, 0).

Best configurations

Dim	EDP	ED²P	EDAP	ED²AP	Baseline
0: Width	2	4	1	1	1
1: Fetch speed	2	2	2	2	1
2: Scheduling	out of order	out of order	in order	in order	in order
3: RUU Size	128	64	16	8	4
4: LSQ Size	32	16	8	8	4
5: Memports	1	2	2	1	1
6: L1D sets	512	512	512	512	1024
7: L1D ways	1	1	2	2	1
8: L1I sets	1024	128	512	512	1024
9: L1I ways	2	4	4	2	2
10: Unified L2 sets	2048	256	512	512	1024
11: Unified L2 block size	128	64	128	64	64
12: Unified L2 ways	2	8	2	4	4
13: TLB sets	16	8	16	64	32
14: L1D latency	1	2	2	2	1
15: L1I latency	4	4	4	2	1
16: Unified L2 latency	8	8	6	7	8
17: Branch predictor	perfect	Bimodal	2 level GAp	2 Level GAp	perfect

Conclusion

In this project, I explored a large space of processor configurations and designed heuristics for different optimization goals: EDP, EDAP, ED²P, and ED²AP. By exploiting architectural characteristics and tailoring the algorithm, I was able to find the “best” configuration across different optimizations. This project helped me better understand how microarchitectural choices affect performance, energy, and area in the real world systems.

Project Code

The proposal code (YOURCODEHERE-final.cpp) implements the heuristic search, generating candidate processor configurations and calling into 431project.cpp to evaluate them against benchmarks. It relies on 431projectUtils.cpp to decode indices and enforce architectural constraints, ensuring only valid configurations are tested, while the shared header file (431project.h) ties these components together. The full project also includes a Makefile to compile everything into a single executable and a worker.sh script to automate multiple runs and manage evaluation limits. Here, only the proposal code is shown, as it contains the core algorithm driving the exploration process. The results it produces — including best configurations and performance metrics — form the basis for the analysis and discussion presented on this site.

Show C++ source (toggle)

#include <iostream>
#include <sstream>
#include <stdio.h>
#include <stdlib.h>
#include <string>
#include <sys/stat.h>
#include <unistd.h>
#include <algorithm>
#include <fstream>
#include <map>
#include <math.h>
#include <fcntl.h>


#include "431project.h"


using namespace std;

/*
 * Not all configurations are inherently valid. (For example, L1 > L2). 
 * Returns 1 if valid, else 0.
 */
int validateConfiguration(std::string configuration){
  // is the configuration at least describing 18 integers/indices?
  if (isan18dimconfiguration(configuration) != 1)
    return 0;

  // if it is, lets convert it to an array of ints for use below
  int configurationDimsAsInts[18];
  int returnValue = 1;  // assume true, set 0 if otherwise
  extractConfiguration(configuration, configurationDimsAsInts); // Configuration parameters now available in array
    
  // 
  // FIXME - YOUR VERIFICATION CODE HERE 
  // ...

  int widthV = 1 << configurationDimsAsInts[0];
  int fetchspeedV = 1 << configurationDimsAsInts[1];
  int schedulingV = configurationDimsAsInts[2];
  int ruusizeV = 1 << (configurationDimsAsInts[3] + 2);
  int lsqsizeV = 1 << (configurationDimsAsInts[4] + 2);
  int memportV = 1 << configurationDimsAsInts[5];

  int L1D_setsV = 32 << configurationDimsAsInts[6];
  int L1D_waysV = 1 << configurationDimsAsInts[7];
  int L1I_setsV = 32 << configurationDimsAsInts[8];
  int L1I_waysV = 1 << configurationDimsAsInts[9];
    
  //helpers
  int L1_blksizeV = widthV * 8;
  unsigned L1D_sizeV = L1D_setsV * L1D_waysV * L1_blksizeV;
  unsigned L1I_sizeV = L1I_setsV * L1I_waysV * L1_blksizeV;

  int L2_setsV = 256 << configurationDimsAsInts[10];
  int L2_blksizeV = 16 << configurationDimsAsInts[11]; //bytes
  int L2_waysV = 1 << configurationDimsAsInts[12];

  unsigned L2_sizeV  = L2_setsV  * L2_blksizeV * L2_waysV;


  int tlb_sets = 4 << configurationDimsAsInts[13];

  int L1D_latency =configurationDimsAsInts[14] + 1;
  int L1I_latency =configurationDimsAsInts[15] + 1;
  int L2_latency =configurationDimsAsInts[16] + 5;

  int bp = configurationDimsAsInts[17];

  //basic limiters

  if(widthV < 1|| fetchspeedV < 1 || schedulingV< 0 || ruusizeV < 4 || lsqsizeV < 4 || memportV <1 || L1D_setsV < 32 || L1D_waysV < 1 
    || L1I_setsV < 32 || L1I_waysV < 1 || L2_setsV < 256 ||L2_blksizeV < 16 || L2_waysV < 1 || tlb_sets < 4 || L1D_latency < 1 
    || L1I_latency < 1 || L2_latency < 5 ||bp < 0){
      return 0;
  }
    
  if (widthV > 8|| fetchspeedV > 2 || schedulingV> 1 || ruusizeV > 128 || lsqsizeV > 32 || memportV >2 || L1D_setsV > 8192 
    || L1D_waysV > 4 || L1I_setsV > 8192 || L1I_waysV > 4 || L2_setsV > 131072 ||L2_blksizeV > 128 || L2_waysV > 16 || tlb_sets > 64 
    || L1D_latency > 7 || L1I_latency > 7 || L2_latency > 13 ||bp > 5){
      return 0;
  }

  //L2 must be bigger than:
  //both L1 caches
  //twice as big as L1 
  else if(L2_sizeV < (L1D_sizeV + L1I_sizeV) || (L2_blksizeV < (2*L1_blksizeV ))){
    return 0;
  }

  //L1D
  int t = 0, n = 0;
  if(L1D_sizeV == 8*1024){
    t = 1;
  }
  else if(L1D_sizeV == 16*1024){
    t = 2;
  }
  else if(L1D_sizeV == 32*1024){
    t = 3;
  }
  else if(L1D_sizeV == 64*1024){
    t = 4;
  }
  else{
    return 0;
  }
  if(L1D_waysV == 2){
    n = 1;
  }
  else if(L1D_waysV == 4){
    n = 2;
  }
  if(L1D_latency != t + n){
    return 0;
  }

  t = 0;
  n = 0;

  //L1I
  if(L1I_sizeV == 8*1024){
    t = 1;
  }
  else if(L1I_sizeV == 16*1024){
    t = 2;
  }
  else if(L1I_sizeV == 32*1024){
    t = 3;
  }
  else if(L1I_sizeV == 64*1024){
    t = 4;
  }
  else{
    return 0;
  }
  if(L1I_waysV == 2){
    n = 1;
  }
  else if(L1I_waysV == 4){
    n = 2;
  }
  if(L1I_latency != t + n){
    return 0;
  }

  t = 0;
  n = 0;

  //L2
  if(L2_sizeV == 128*1024){
    t = 7;
  }
  else if(L2_sizeV == 256*1024){
    t = 8;
  }
  else if(L2_sizeV == 512*1024){
    t = 9;
  }
  else if(L2_sizeV == 1024*1024){
    t = 10;
  }
  else if(L2_sizeV == 2048*1024){
    t = 11;
  }
  else{
    return 0;
  }
  if(L2_waysV == 1){
    n = -2;
  }
  else if(L2_waysV == 2){
    n = -1;
  }
  else if(L2_waysV == 4){
    n = 0;
  }
  else if(L2_waysV == 8){
    n = 1;
  }
  else if(L2_waysV == 16){
    n = 2;
  }
  if(L2_latency != t + n){
    return 0;
  }

  if(widthV > 2 * fetchspeedV || tlb_sets * 4 > 512){
    return 0;
  }
return returnValue;
}



/*
 * Given the current best known configuration for a particular optimization, 
 * the current configuration, and using globally visible map of all previously 
 * investigated configurations, suggest a new, previously unexplored design 
 * point. You will only be allowed to investigate 1000 design points in a 
 * particular run, so choose wisely. Using the optimizeForX variables,
 * propose your next configuration provided the optimiztion strategy.
 */
std::string YourProposalFunction(
  std::string currentconfiguration,
  std::string bestED2Pconfiguration,
  std::string bestEDPconfiguration,
  std::string bestEDAPconfiguration,
  std::string bestED2APconfiguration,
  int optimizeforED2P,
  int optimizeforEDP,
  int optimizeforEDAP,
  int optimizeforED2AP
){
  /*
  * REPLACE THE BELOW CODE WITH YOUR PROPOSAL FUNCTION
  *
  * The proposal function below is extremely unintelligent and
  * will produce configurations that, while properly formatted, 
  * violate specified project constraints
  */    
  
  // produces an essentially random proposal
  int configuration[18];

  extractConfiguration(currentconfiguration, configuration);

  static int DIM_index = 0; // to remember value across different calls
  int dimV = DIM_index;
  DIM_index = (DIM_index + 1) %18; //rotates through each of the 18 dimensions

  int validChoices = GLOB_dimensioncardinality[dimV];

  //iteration 4

  if(optimizeforED2P){
    configuration[dimV] = (configuration[dimV] + 1) %validChoices; 
    configuration[0] = (rand() %2 ) + 2; // width being very large
    configuration[1] = 1; //fetch speed
    configuration[2] = 1; //scheduling OoO
  }
  else if(optimizeforEDP){
    configuration[dimV] = (configuration[dimV] +(validChoices/2) - 1) %validChoices; 
  }
  else if(optimizeforEDAP){
    configuration[dimV] = (configuration[dimV] + 1) %validChoices;
    configuration[0] = 0;
    configuration[1] = 1; //fetch speed
    configuration[2] = 0; //scheduling in order
    //configuration[dimV] = (configuration[dimV] +(validChoices/2) - 1) %validChoices; 
    //iteration 5
    //configuration[dimV] = rand()% (validChoices/ 2 + 1);

  }
  else if(optimizeforED2AP){
    for(int dimV = 0; dimV < 18; dimV++){
      configuration[dimV] = rand() % GLOB_dimensioncardinality[dimV];
    }
  }
  else{
    //not possible
  }

  //iteration 2
  //generating neighbor
  //configuration[dimV] = (configuration[dimV] + 1) %validChoices; // rotates through each of the index in specified dimension

  string tempConfig = compactConfiguration(configuration);



  //added this in my third iteration of my propossal function
  if(validateConfiguration(tempConfig) == 0){
    int retreatedConfiguration[18];
    string newConfig = "";

    while(true){
      for(int i = 0; i < 18; i++){
        retreatedConfiguration[i] = rand() % GLOB_dimensioncardinality[i];
      }
      newConfig = compactConfiguration(retreatedConfiguration);
      if(validateConfiguration(newConfig)){ // if new random config passes
        break;
      }
    }


    tempConfig = newConfig;
  }


  return tempConfig;
}

Download Thesis (PDF)