Dynamic Voltage and Frequency Scaling based on Workload Decomposition

> Kihwan Choi Ramakrishna Soma Massoud Pedram

Dept. of EE University of Southern California

## Outline

- Background
- Workload Decomposition
- PXA255's Performance Monitoring Unit
- Fine-grained DVFS Policy for BitsyX
- Experimental Results
- Conclusions

#### **Background on DVFS**

- DVFS is a method through which variable amount of energy is allocated to perform a task
- Power consumption of a digital CMOS circuit is:

 $P = \alpha \cdot C_{\text{eff}} \cdot V^2 \cdot f$   $\alpha : \text{switching factor}$   $C_{\text{eff}} : \text{effective capacitance}$  V : operating voltage f : operating frequency

Energy required to run a task during T is:

 $\mathbf{E} = \mathbf{P} \cdot \mathbf{T} \propto \mathbf{V}^2 \qquad (\text{assuming } f \propto V, \ \mathbf{T} \propto f^{-1})$ 

 Lowering V (while simultaneously and proportionately cutting f) causes a quadratic reduction in E









| Components of the Program Execution Time                                                                                                                                                       |
|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| <ul> <li>The amount of CPU and memory workloads for<br/>an application program must be determined</li> </ul>                                                                                   |
| <ul> <li>Execution time of a program is the sum of the On-<br/>chip (CPU work) and the Off-chip Latency<br/>(memory work)</li> <li>T = T<sup>ON</sup> + T<sup>OFF</sup></li> </ul>             |
| <ul> <li><i>T</i><sup>ON</sup> : varies with the CPU frequency</li> <li>Cache hit</li> <li>Stall due to data dependency</li> <li>TLB hit,</li> </ul>                                           |
| <ul> <li><i>T</i><sup>OFF</sup>: is invariant with the CPU frequency</li> <li>Access to external memory such as SDRAM and frame buffer memory, which is in turn due to a cache miss</li> </ul> |
|                                                                                                                                                                                                |



|   | Performance Monitoring Unit (PMU)                                                                                                                                                                                                                                               |
|---|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| • | <ul> <li>PMU on the PXA255 processor chip can report up to 15 different dynamic events during execution of a program</li> <li>Cache hit/miss counts, TLB hit/miss counts, No. of stall cycles, Total no. of instructions being executed, Branch misprediction counts</li> </ul> |
| • | Only two events can be monitored and reported at any given time                                                                                                                                                                                                                 |
| • | <ul> <li>For DVFS, we use the PMU to generate statistics for</li> <li>Total no. of instructions being executed (INSTR)</li> <li>No. of stall cycles due to on-chip or off-chip data dependencies (STALL)</li> <li>No. of Data Cache misses (DMISS)</li> </ul>                   |
| • | We also record the no. of clock cycles from the beginning of the program execution (CCNT)                                                                                                                                                                                       |
| • | From these parameters, we can calculate the average on-chip<br>CPI                                                                                                                                                                                                              |
|   |                                                                                                                                                                                                                                                                                 |

## Frequency Settings in BitsyX

- PXA255 can operate from 100MHz to 400MHz, with a core supply voltage of 0.85V to 1.3V
- Nine frequency combinations (f CPU, f INT, f EXT)
- Internal bus connects the core and other functional blocks inside the CPU Ö

| External bus | s is coi       | nnected | to SDRAM     | l (64MB)     |                        |
|--------------|----------------|---------|--------------|--------------|------------------------|
|              | Freq.          | CPU     | Internal bus | External bus |                        |
|              | Set            | (MHz)   | (MHz)        | (MHz)        |                        |
|              | F <sub>1</sub> | 100     | 50           | 100          |                        |
|              | F <sub>2</sub> | 200     | 50           | 100          |                        |
|              | F <sub>3</sub> | 300     | 50           | 100          |                        |
|              | F <sub>4</sub> | 200     | 100          | 100          |                        |
|              | $F_5$          | 300     | 100          | 100          |                        |
|              | F <sub>6</sub> | 400     | 100          | 100          |                        |
|              | F <sub>7</sub> | 400     | 200          | 100          | Highest Performance Se |
|              | F <sub>8</sub> | 133     | 66           | 133          |                        |
|              | F <sub>9</sub> | 265     | 133          | 133          |                        |
|              |                |         |              |              |                        |

## **Execution Time and Frequency Settings**

- Execution time variation over different frequency combinations "djpeg", "qsort", and "gzip"
  - "djpeg" is CPU-bound (strongly dependent on f CPU)
  - "gzip" is memory-bound (f<sup>INT</sup> & f<sup>EXT</sup> dependent)



| Freq.<br>Set   | CPU<br>(MHz) | Internal bus<br>(MHz) | External<br>bus |
|----------------|--------------|-----------------------|-----------------|
|                | (IVIHZ)      | (MHZ)                 | (MHz)           |
| F <sub>1</sub> | 100          | 50                    | 100             |
| $F_2$          | 200          | 50                    | 100             |
| $F_3$          | 300          | 50                    | 100             |
| F <sub>4</sub> | 200          | 100                   | 100             |
| F <sub>5</sub> | 300          | 100                   | 100             |
| $F_6$          | 400          | 100                   | 100             |
| F <sub>7</sub> | 400          | 200                   | 100             |
| F <sub>8</sub> | 133          | 66                    | 133             |
| F <sub>9</sub> | 265          | 133                   | 133             |

tting

#### Modeling the Execution Time in BitsyX

- T<sup>OFF</sup> is strongly dependent on the f <sup>EXT</sup>. However, f <sup>INT</sup> also affects T<sup>OFF</sup>
- Example: when a D-cache miss occurs, two operations are performed:
  - Data fetch from the external memory ( $f^{EXT}$ )
  - Data transfer to the CPU core where the cache-line and destination register are updated (f <sup>INT</sup>)
- Due to lack of exact timing information, we have opted to model TOFF as:

$$T^{OFF} = \frac{\alpha \cdot W^{OFF}}{f^{INT}} + \frac{(1-\alpha) \cdot W^{OFI}}{f^{EXT}}$$

An α value of ~0.35 was obtained for tested applications
 The error in predicting the execution time was less than 3% for all nine frequency settings





# **Determining the Optimal Frequency Setting**

• After calculating *CPI*<sub>on</sub><sup>avg</sup> for the current quantum, *i*, the on-chip and off-chip execution times are calculated as follows:

$$T_{i}^{ON} = \frac{N_{i} \cdot CPI_{on,i}^{avg}}{f_{i}^{CPU}} \qquad T_{i}^{OFF} = T_{i} - T_{i}^{ON}$$

• Next we choose a frequency setting for the quantum i+1,  $F^{opt}_{i+1}$ , that satisfies the following equation :

$$T_{F_{i+1}^{opt}}^{i+1} \leq (1 + PF_{loss}) \cdot T_{F_{min}}^{i}$$

 $T_{F^{opt}}^{i+1}$  : The expected execution time of quantum i+1 at  $F^{opt}_{i+1}$ 

 $T_{F_{max}}^{i}$ : The execution time of quantum i at  $F_{max}$ 

## The BitsyX Platform

 The ADS's BitsyX board has a PXA255 microprocessor which is a 32-bit RISC processor core, with a 32KB instruction cache and a 32KB write-back data cache, a 2KB mini-cache, a write buffer, and a memory management unit (MMU) combined in a single chip













#### Conclusions

- A fine-grained DVFS technique based on online decomposition of the application workload into on-chip and off-chip components was presented
- Based on actual current measurements in the BitsyX platform
  - For memory-bound programs, an average of 70% PXA255 energy savings was achieved with 30% performance degradation
  - For CPU-bound programs, an average of 40% PXA255 energy savings was achieved at the cost of 30% performance penalty
- Future work will consider the impact of the DVFS on the total system energy consumption