A Mathematical Solution to Power Optimal Pipeline Design by Utilizing Soft Edge Flip Flops

M. Ghasemazar, B. Amelifard, M. Pedram

University of Southern California
Department of Electrical Engineering

August 11, 2008

ISLPED 2008
Outline

- Soft-Edge Flip Flops
- Power Optimal Pipeline Design
- Problem Formulation
- SEFF Modeling
- Experimental Results
- Conclusion
Soft Edge Flip Flop

- Key idea: Allow the data to pass through a flip flop during a transparency window, instead of on a triggering clock edge.

- Key advantage: Enable slack passing between adjacent pipeline stages which are separated by (master-slave) flip-flops.

- Circuit implementation: Delay the clock of the master latch to create a window during which both the master and slave latches are ON.
SEFF Implementation

Conventional (Hard Edge) Master-Slave FF

Soft Edge Master-Slave FF
SEFF Characteristics

- Setup and hold times, and clock-to-q delay of a soft-edge flip-flop are all functions of the transparency window width, $w$
- Simulations show a linear dependency on $w$

\[
\begin{align*}
\{ t_{s,i}(w_i) & = a_1w_i + a_0 \\
t_{h,i}(w_i) & = b_1w_i + b_0 \\
t_{cq,i}(w_i) & = c_1w_i + c_0 \}
\]

\[y = 0.921x - 30.45\]

\[y = -0.651x + 33.34\]
SEFF Characteristics – cont’d

• Power consumption of a SEFF is monotonically increasing with its window size \( w \). This is due to:
  – Higher switching activities in the internal nodes in the transparency window
  – Higher dynamic and leakage power consumption in the additional delay generation circuitry

• Experimental evaluation of total power consumption:

\[
P_{FF,i} = d_2 w_i^2 + d_1 w_i + d_0
\]
Pipeline Basics

- Timing constraints for a linear pipeline

\[ d_i + t_{s,i} + t_{cq,i-1} \leq T_{clk} \quad 1 \leq i \leq N \]  
\[ \delta_i + t_{cq,i-1} \geq t_{h,i} \quad 1 \leq i \leq N \]

- Substitute FFs with SEFFs
  - First and Last FF’s remain hard-edge ones
    - This is needed to avoid imposing constraints on the sender/receiver of data
  - Intermediate stage FF’s may be substituted by SEFFs

\[ d_i \leq T_{clk} - t_{s,i}(w_i) - t_{cq,i-1}(w_{i-1}) \quad 1 \leq i \leq N \]
\[ \delta_i \geq t_{h,i}(w_i) - t_{cq,i-1}(w_{i-1}) \quad 1 \leq i \leq N \]
Power Optimal Pipeline

- Main Idea: Passing available slack of some stages to more timing critical stages to provide them with more freedom in power optimization through voltage scaling.
- For example, let $T_{clk}=T_{clk,\text{min}}=560\text{ps}$ and $t_s=t_h=t_{cq}=30\text{ps}$
  - If FF1 is replaced with a SEFF with a window size of 50ps
    - the first stage borrows 50ps from the second stage
    - the circuit can be powered with a lower supply voltage level
  - Ideally, 10% $V_{dd}$ reduction -> 19% power saving
PSLP Problem Statement

- **Power-optimal Soft Linear Pipeline Design**
  - Goal: Minimize the total power consumption of an N-stage linear pipeline circuit
  - Variables:
    - Optimal supply voltage level (1 variable)
    - Transparency windows size of the individual soft-edge FF-sets (N-1)
    - Delay elements to avoid hold time violations (N)
  - Constraints:
    - Setup/hold times
    - Window size limits
    - Single supply voltage

\[
\begin{align*}
\text{Min.} & \quad P_{\text{total}} = \sum_{i=1}^{N} P_{\text{Comb},i}(v) + \sum_{i=1}^{N-1} P_{\text{FF},i}(w_i,v) + \sum_{i=1}^{N} P_{\text{DE},i}(z_i,v) \\
\text{s.t.} \quad & d_i(v) \leq T_{\text{clk}} - t_{s,i}(w_i,v) - t_{cq,i-1}(w_{i-1},v); 1 \leq i \leq N \\
& \delta_i(v) + z_i \geq t_{h,i}(w_i,v) - t_{cq,i-1}(w_{i-1},v); 1 \leq i \leq N \\
& w_{\text{min}} \leq w_i \leq w_{\text{max}}; 1 \leq i \leq N - 1 \\
& v \in \{V_0, V_1, \ldots, V_{m-1}\}
\end{align*}
\]
SEFF Modeling

- Setup time, hold time, clock-to-q delay, and power dissipation are functions of both voltage and transparency window size
  - Voltage-dependent coefficients are determined from SPICE simulations

\[
\begin{align*}
  t_{s,i}(w_i, v) &= a_1(v)w_i + a_0(v) \\
  t_{h,i}(w_i, v) &= b_1(v)w_i + b_0(v) \\
  t_{cq,i}(w_i, v) &= c_1(v)w_i + c_0(v) \\
  P_{FF,i} &= d_2(v)w_i^2 + d_1(v)w_i + d_0(v)
\end{align*}
\]
Combinational Circuit Modeling

- Total power consumption at voltage level, $v$:

$$P_{comb,i}(v) = \left(\frac{v}{V_0}\right)^2 P_{dyn,i} + \left(\frac{v}{V_0}\right)^3 P_{leak,i}$$

- Max and Min combinational logic cell delays (calculated from the alpha power law):

$$d_i(v) = \left(\frac{V_0 - V_t}{v - V_t}\right)^\alpha d_i(V_0)$$

$$\delta_i(v) = \left(\frac{V_0 - V_t}{v - V_t}\right)^\alpha \delta_i(V_0)$$

- Power dissipation overhead of a delay element:

$$P_{DE}(z, v) = k(v) \cdot z$$
Solving the PSLP

• To solve PSLP
  – Enumerate all possible values for $v$
  – PSLP with fixed voltage ($PSLP$-$FV$)
    • $P_{comb,i}$ terms drop out of the cost function
    • Voltage constraint (IV) disappears
    • All other timing and power parameters become only dependent on $w_i$ and $z_i$ variables
  – For each fixed $v$, a quadratic program is set up and solved
    • We must minimize a quadratic cost function subject to linear inequality constraints
    • PSLP-$FV$ can be solved optimally in polynomial time
Experimental Setup

• Hspice simulations were used to extract parameters that are needed for the problem formulation
  – 65nm Predictive Technology Model (PTM)
  – Nominal supply voltage 1.2V
  – Die temperature 100°C

• The SIS optimization package was used to synthesize a set of linear pipelines as test-bench circuits

• The MOSEK toolbox used to solve the mathematical optimization problem

• All results were collected on a 2.4GHz Pentium 4PC with 2GB memory
## Benchmark Spec

<table>
<thead>
<tr>
<th>Testbench (# of stages)</th>
<th>(max, min) stage delays at nominal voltage (ps)</th>
<th>Clock freq. (GHz)</th>
</tr>
</thead>
<tbody>
<tr>
<td>TB1 (4)</td>
<td>(320,140), (332,150), (308,150), (320,170)</td>
<td>2.0</td>
</tr>
<tr>
<td>TB2 (5)</td>
<td>(320,140), (332,150), (308,150), (280,145), (320,170)</td>
<td>2.0</td>
</tr>
<tr>
<td>TB3 (3)</td>
<td>(325, 150), (310,155), (219,160)</td>
<td>2.0</td>
</tr>
<tr>
<td>TB4 (5)</td>
<td>(275,40), (235,40), (245,60), (275,50), (275,70)</td>
<td>2.5</td>
</tr>
<tr>
<td>TB5 (4)</td>
<td>(310,100), (245,40), (245,50), (245,60)</td>
<td>2.5</td>
</tr>
</tbody>
</table>
Experimental Results

Using slack passing to minimize power without degrading performance

<table>
<thead>
<tr>
<th>TB</th>
<th>Power Red. (%)</th>
<th>Optimum Vdd (V)</th>
<th>Optimum Window size (ps)</th>
</tr>
</thead>
<tbody>
<tr>
<td>TB1</td>
<td>32.1</td>
<td>1.0</td>
<td>40, 49, 22</td>
</tr>
<tr>
<td>TB2</td>
<td>33.8</td>
<td>1.0</td>
<td>40, 49, 46, 21</td>
</tr>
<tr>
<td>TB3</td>
<td>48.1</td>
<td>0.95</td>
<td>43, 52</td>
</tr>
<tr>
<td>TB4</td>
<td>16.3</td>
<td>1.10</td>
<td>36, 35, 35, 20</td>
</tr>
<tr>
<td>TB5</td>
<td>25.4</td>
<td>1.05</td>
<td>60, 41, 36</td>
</tr>
</tbody>
</table>

- Area overhead: Negligible compared to size of the rest of the pipeline circuit
- Runtime for all benchmarks: Less than one second

Utilizing slack passing to improve performance

<table>
<thead>
<tr>
<th>Testbench</th>
<th>Performance Improvement (%)</th>
</tr>
</thead>
<tbody>
<tr>
<td>TB1</td>
<td>14%</td>
</tr>
<tr>
<td>TB2</td>
<td>15%</td>
</tr>
<tr>
<td>TB3</td>
<td>20%</td>
</tr>
<tr>
<td>TB4</td>
<td>5%</td>
</tr>
<tr>
<td>TB5</td>
<td>10%</td>
</tr>
</tbody>
</table>
A Case Study: 34-bit Adder

• Problem: How to partition a 34-bit adder into 4 stages of pipeline to achieve maximum performance?
A Case Study: 34-bit Adder

<table>
<thead>
<tr>
<th>Configuration</th>
<th>Vdd (V)</th>
<th>Min Clock Period (ps)</th>
<th>Power Consumption (mW)</th>
</tr>
</thead>
<tbody>
<tr>
<td>10–8–8–8</td>
<td>1.2</td>
<td>450</td>
<td>6.42</td>
</tr>
<tr>
<td>8–10–8–8</td>
<td>1.2</td>
<td>472</td>
<td>6.50</td>
</tr>
<tr>
<td>8–8–10–8</td>
<td>1.2</td>
<td>472</td>
<td>6.51</td>
</tr>
<tr>
<td>8–8–8–10</td>
<td>1.2</td>
<td>486</td>
<td>6.55</td>
</tr>
<tr>
<td>9–9–8–8</td>
<td>1.2</td>
<td>455</td>
<td>6.42</td>
</tr>
<tr>
<td>9–8–9–8</td>
<td>1.2</td>
<td>433</td>
<td>6.51</td>
</tr>
</tbody>
</table>
A Case Study: 34-bit Adder

• Problem: How to partition a 34-bit adder into 4 stages of pipeline to achieve minimum power at target performance level?

Minimum Power @ 2.0GHz

<table>
<thead>
<tr>
<th>Configuration</th>
<th>Vdd (V)</th>
<th>Power Consumption (MW)</th>
</tr>
</thead>
<tbody>
<tr>
<td>10–8–8–8</td>
<td>1.05</td>
<td>4.9</td>
</tr>
<tr>
<td>8–10–8–8</td>
<td>1.15</td>
<td>5.1</td>
</tr>
<tr>
<td>9–9–8–8</td>
<td>1.05</td>
<td>4.9</td>
</tr>
<tr>
<td>9–8–8–9</td>
<td>1.10</td>
<td>4.9</td>
</tr>
</tbody>
</table>
Conclusion

• We presented a new technique to minimize the total power consumption of a linear pipeline circuit by utilizing soft-edge flip-flops and choosing the optimal supply voltage level for the pipeline.

• We formulated the problem as a mathematical program and solved it efficiently.

• Our experimental results demonstrate that this technique is quite effective in reducing the power consumption of a pipeline circuit under a performance constraint.

• Future work will focus on problem of minimizing the energy cost of throughput in a linear pipeline circuit with dynamic error detection and correction capability.