当前位置：文档库 › Energy-conscious HWSW-partitioning of embedded systems A Case Study on an MPEG-2 Encoder

Energy-conscious HWSW-partitioning of embedded systems A Case Study on an MPEG-2 Encoder

Energy-Conscious HW/SW-Partitioning of Embedded Systems:

A Case Study on an MPEG-2Encoder

J¨o rg Henkel Yanbing Li

C&C Research Laboratories Dept.of EE,Princeton University

NEC USA,Princeton,NJ08540Princeton,NJ08544 henkel@https://www.wendangku.net/doc/4513774007.html, yanbing@https://www.wendangku.net/doc/4513774007.html,

Abstract

Energy dissipation is a hot topic in the design of–especially mobile–embedded systems.This is because applications like digital video cameras,cellular phones etc.draw their current from batteries that spend a limited amount of energy only.In this paper we show that energy-conscious HW/SW-partitioning can lead to drastic reductions of energy dissi-pation of a whole embedded system.Subject of investigation is an MPEG-2encoder.Therefore,we introduce our frame-work for estimating and optimizing system energy as well as all conducted design steps.The obtained results show energy savings up59%while the performance remains ap-proximately the same or becomes even slightly higher.As a main result,energy-conscious HW/SW-partitioning is a promising method to be deployed in addition to classical energy and/or power reduction methods.

1Introduction

Reducing energy dissipation is a prominent tasks during the design phase of embedded systems.One of the manifold reasons is that the market share of one class–the mobile embedded systems–is steadily increasing.A characteris-tic peculiarity of such a system is that it draws the current from batteries that spend a limited amount of energy only. Hence,the design goal is to minimize the energy dissipa-tion in order to enable a longer operation period.Practical examples of mobile embedded systems are consumer prod-ucts like digital video cameras,cellular phones etc.Reduc-tion of energy dissipation increases their"mobility".This small embedded systems come often as a systems-on-a-chip where components like processors,caches,application spe-ci?c hardware and even the main memory[1]reside on the same piece of silicon.The according design process is often referred to as core-based system design.In this context, a core can be an already optimized hardware(a hard core or?rm core)like a processor core or a more abstract de-scription(e.g.algorithm in C or VHDL)that still offers the possibility for parameterizing,optimizing and even gives the leeway to choose the form of implementation(e.g.soft-ware or hardware implementation).The latter is called a soft core.As a result,the designer has powerful design aids to tailor an embedded system that exactly matches the required constraints.Energy-conscious system design requires a so-phisticated adaption and tailoring of all system resources to prevent energy dissipations caused by under-utilized system parts.As one major step,tailoring includes the selection of an appropriate HW/SW-partitioning as well as?xing of pa-rameters like cache sizes,cache associativities,cache line sizes,main memory size etc.Core-based system design makes these design goals possible.

In this case study we investigate the power dissipation of an MPEG-2encoder for different HW/SW-partitions and different system con?gurations like data cache size,instruc-tion cache size,cache line size cache associativity and main memory size.Therefore,we deploy our research framework in order to estimate and optimize energy dissipation for most of the system parts.The remaining system parts are treated by in-house production tools.

This paper is structured as follows:the next section summa-rizes the research activities for estimating and minimizing energy and/or power dissipation of various system parts and also recent approaches in co-synthesis for low energy/power. In3the estimation models of our framework are presented. The MPEG-2application as well as the performed design steps and the partitioning considerations are outlined in4 whereas the obtained results are presented and discussed in 5.Finally,section6gives a conclusion.

2Related Research

Estimating and optimizing energy/power dissipation has been addressed from a software point of view1as well as from a hardware point of view(application speci?c hard-ware,memory hierarchy).

Tiwari and Malik[2]investigated the energy that is dissi-pated during the execution of programs within a processor. They found out that the energy dissipation is sensitive to the executed instruction.From these investigations they derived speci?c optimization techniques to minimize the software energy.They did not take into consideration the energy dissipation of the memory hierarchy and its impact on the whole system’s energy dissipation.Ong and Ynn [3]have shown that the energy dissipation may drastically vary depending on the algorithms running on a processor.A power and performance simulation tool for a RISC design has been developed by Sato et al.[4].Their tool has been used by designers to conduct architectural-level optimiza-tions.

1This means the power that is dissipated during the execution of a soft-ware program on a programmable processor.Since the power dissipation varies according to the executed instruction,the term software energy is justi?ed.We will use this term in the following.

Further work deals with energy/power dissipation from a hardware point of view.Investigations of the major sources of power dissipation within a processor have been under-taken by Burd and Peters[5].Gonzales and Horowitz[6] explored the power dissipation depending on the processor architecture(pipelined,un-pipelined,super-scalar).

Other system parts like caches have been analyzed by Kam-ble and Ghose[7].A model for main memory power dissi-pation from a behavioral level point of view has been derived by Itoh et al.[20].A method for estimating power dissipa-tion of a synthesized hardware(ASIC)from a behavioral-level of abstraction has been proposed by Raghunathan and Dey[8].Optimizing energy dissipation by means of high-level transformations has been addressed by Potkonjak et al.[9],for example.

As a conclusion,none of the described approaches offers a comprehensive solution i.e.an approach that takes into con-sideration the interdependency of all relevant system parts in order estimate/minimize energy/power dissipation of the complete system.This lack has been recognized by recent research activities in low-power co-synthesis[10,11]even if these approaches do not re?ect all system resources(no caches).Co-synthesis driven by other constraints(perfor-mance and in parts hardware effort)has been explored by [12,13,14,15]among others.

The case study presented in this paper quanti?es the in-terdependency of various system parts in terms of energy dissipation.Our framework estimates and optimizes sys-tem energy dissipation and is prominent for the conducted case study.

3Estimation Models

A brief summary of the energy estimation models within our framework is presented in this section.More details can be retrieved from[16].

3.1Data and Instruction Cache

The models for data and instruction caches are generic: cache size,associativity and line size are the parameters. Modeled are the switching capacitances of the major sources for energy dissipation(due to[7]):word-line capacitances, bit-line capacitances,decoder capacitances and capaci-tances of the output driver.The topology of the tag array and the decoder depends on the cache size and caches pa-rameters whereas the topology of the data array can directly be derived from the cache size.Data and tag array are com-posed of6-transistor SRAM cells.Our model captures the dynamic energy dissipation i.e.the energy dissipated dur-ing accesses to the caches.This is the?nal formula in its simpli?ed version:

123(1) ,,,and are the effective capac-itances for a bit-line read access,a word-line access,a bit-line write access,decoder and output driver,respectively.2 gives the number of caches accesses and is obtained by a cache simulator(see section3.3).Note,that the capac-itances are itself complex expressions composed of basic

1,2and3are expressions that contain in parts statistical access numbers.capacitances3.The basic capacitances have been obtained through the tool Cacti[17],an analytical suite of models for a08CMOS technology.

3.2Software Energy

The model is based on the techniques described in[2]but enhanced by the impact the number of cache miss reads (),cache miss writes()and instruction fetch misses()have on the energy dissipation of the pro-cessor.It is modeled as follows:

100%

data read miss penalty

data write miss penalty

instruction fetch miss penalty

(2)

There it is,the number of instructions of the program, the number of cycles instruction resides in the processor pipeline and the current that is drawn during the execution of.Furthermore,is the cy-cle time,,,and

are the number of cycles that are imposed through a cache read/write/fetch penalty,respectively.The drawn current during penalty cycles is given by.100%is the execution time of the program with an assumption of a100%cache hit rate.4In the current state this model is implemented for a33MHz SPARCLite processor.The total program execution time is modeled by:

100%

(3) 3.3Design Flow of our Framework

The input is a C program(Fig.1)of the application5.The upper branch comprises a behavior simulator[23]that is attached to our software energy and performance models (Eq.2and3).The branch below contains the trace tool QPT and the Dinero III[19]cache simulator who feeds software,cache and main memory energy models with the numbers of total cache accesses,data cache hits,data cache misses and instruction fetch misses.Output are the numbers for software energy dissipation and performance numbers. The feedback loop contains an optimization algorithm6that adapts cache sizes,associativities and cache line sizes in order to reduce the total energy dissipation.

3Gate and drain capacitances,line capacitances etc.

4This is because the deployed behavior simulator primarily does not re?ect the impact of caches.It assumes a100%cache hit rate.So,the equation re?ects the corrected model.

5i.e.only those parts that are implemented as software.

6The implementation of the optim.loop is described in[16].

Figure1:Design?ow of our framework for estimating and optimizing system energy dissipation.

4Approach and Design Methodology Subject of investigation for this case study is an MPEG-2 encoder which is intended to run on the architecture shown in Fig.2.It features a CPU,instruction and data cache,a main memory and an application speci?c hardware(ASIC).In the following sub-section the approach for minimizing/reducing system energy dissipation is described.

4.1The Approach

It should be noted that decreasing power dissipation can not be achieved by reducing the performance since an MPEG-2application is a real–time application that has to meet performance constraints.Rather than that,the principal idea to reduce energy dissipation comprises the following two steps:

A.Energy-conscious HW/SW-partitioning

A general purpose processor7is not designed to handle ef?-ciently special applications like an MPEG-2encoder.As one example,the number of available FUs(Functional Units)is limited.As a consequence,the processor executes slower compared to a hardware with multiple FUs.During this time,other FUs might be un-utilized(e.g.multiplier in a application that features no multiplications)but still con-tribute to the total energy dissipation without increasing the performance.8On the other side,a hardware can be designed in such a way that the utilization of each FU is much higher as opposed to the processor i.e."wasting of energy"can be minimized through removement of low or non-utilized FUs. Even if the power dissipation of the hardware is higher,there is still a hope that the dissipated energy is smaller since the instruction-level parallelism leads to much shorter execu-tion times and hence,to a smaller energy dissipation.This effect of energy saving is the larger the smaller the selected application is because a smaller part can be better adapted i.e.high utilization of each FU is much more likely.

For our case study that means,we have to detect small system parts that can ef?ciently(high utilization)be synthe-sized in hardware.All other system parts are still carried

7In the following we will use the terms"processor"when we mean "general purpose processor"and"hardware"when we mean an"application speci?c hardware"i.e.an"ASIC".

8This is correct in cases where no gated clocks are

used.

Figure2:The target architecture.

out by the processor.

B.Optimizing the remaining system parameters Besides the processor and the hardware,the memory hier-archy is an additional major source for energy dissipation. In case of dis-con?guration it can even dominate.Due to the chosen size of data cache,instruction cache and their parameters(associativity,line size etc.)the contribution of each system part to the total energy dissipation can vary drastically and hence,it is hard to predict which parameters and sizes lead to a minimum energy dissipation of the whole system.

’’MPEG’’

0.01

0.1

icache size [2**val]

dcache size [2**val]

energy [Joule]

’’MPEG’’

1e+06

1e+07

1e+08

icache size [2**val]

dcache size [2**val]

exex. time [cycles]

Figure2a:Energy dissipation(left)and execution time (right)of the MPEG-2Encoder for various data and in-struction cache sizes.

As an illustration,Fig.2a shows the energy dissipation in Joule(left)and execution time in cycles(right)of the MPEG-2encoder for different data and instruction cache sizes9.This is part of the results of a pre-investigation (i.e.no hardware considered).Surprisingly,neither the largest nor the smallest caches sizes lead to a minimum energy dissipation.Additionally,not even the largest cache sizes lead to the highest performance(i.e.smallest number of cycles).What actually happens is,demonstrates Eq.1: large caches sizes increase the switching capacitances,such that each cache access dissipates more energy.In such a case the caches dominate the total energy dissipation.Small cache sizes lead to more data and instruction caches misses. Due to Eq.2,the software energy increases.Apparently, this is a complex optimization problem.It is addressed by our framework.

4.2The MPEG-2Encoder

Figure3shows a part of the data-?ow within the MPEG-2 https://www.wendangku.net/doc/4513774007.html,pression for both,spatial and temporal(mo-tion)reduction of redundancy is based on a DCT(discrete cosine transformation).Quantization(Q)is the process that determines what information can be discarded without a sig-ni?cant loss.Following the?ow,Huffman coding is used

9Given in the?gures is the value of the exponent of the cache sizes: a value of12for example means a cache size of2122048bytes.The results have been obtained by our framework.

to code the entropy10.Dequantization(1)11and inverse DCT(IDCT)are deployed as a part of generating the future frame.For more details,see[21],for example.The basis for our case study is a complete MPEG-2software encoder (200KB of C source code).The code has been explored by techniques of pro?ling,behavior simulation,scheduling for different resource constraints and energy investigation on the software implementation(see last section).Finally three functions have been selected for a hardware imple-mentation:fdct()as a part of the DCT,idctcol()as a part of the the IDCT as well as dist()as part of the quantization process

(Q).

Figure3:The interesting parts of the MPEG-2encoder.

4.3Design Methodology

For each of the selected co-design con?gurations,the fol-lowing steps have been performed:

I.The candidate hardware part is synthesized and an-

alyzed.This incorporates the following sub-steps

(see also Fig.4):

a)The respective functionality is encoded in BDL

and synthesized by means of the high–level syn-

thesis tool Cyber12.

b)The VHDL–RTL code output of Cyber is simu-

lated with VSIM[22].It delivers the number of

clock cycles for execution and allows to evaluate

the RTL code.

c)The RTL and Logic synthesis tool Varchsyn13is

fed with the VHDL code.Technology mapping is

done by means of NEC’s CMOS6library.Output

of Varchsyn is a netlist in PWC format as well as

information about the total number of cells and

the maximum possible clock frequency.

d)Gate-level simulation is performed by CSIM14.

One operation mode of CSIM delivers the energy

dissipation based on the switching activities.For

that step,stimuli vectors have to be provided.

II.Our framework optimizes the data cache size,the instruction cache size and the according associativ-

ities and line sizes.It provides furthermore detailed

information about the energy dissipation of the soft-

ware,the main memory and the caches.Also,the

performance of the software part is provided.

III.Results of the previous steps are energy and perfor-mance data of the whole HW/SW-system as well

as system con?guration parameters that ensure low

energy dissipation.

10Entropy is a measure for randomness and disorder.

11The inverse of quanti?cation.

12Cyber is NEC’s in-house high-level synthesis tool.BDL is the HDL used in Cyber.

13Varchsyn is NEC’s in-house RTL and Logic synthesis tool

14NEC’s in-house gate-level

simulator.

Figure4:Hardware synthesis steps

Energy[mJ]

case

i-cache d-cache mem sw hw total 044.7917.98 2.30574.32n/a140.92

140.5414.84 2.46449.3314.09121.26

235.3613.14 2.94427.160.70979.31

339.4315.93 2.52049.510.156107.55

414.59 4.127 1.37216.9922.2759.35 Table1:Contribution of all system resources to total energy dissipation.

5Results

We distinguish?ve different cases in the following discus-sion:

case0:the all-software solution

case1:is synthesized in hardware,rest is software case2:in hardware,rest is software

case3:in hardware,rest is software

case4:and in hardware,rest is software Our primary goal is to reduce the total energy dissipation rather than the power dissipation(see section1).In practice that means,even an(ASIC-)hardware featuring an aver-age power dissipation close to,for example,the processors power dissipation,might be an appropriate candidate for reducing system energy in case the throughput is high due to high utilization of internal resources(FUs).Furthermore, we assume that the clock of the whole processor and hard-ware is gated.So,whenever the processor is performing and the hardware is idle(they perform mutual exclusive), no energy will be dissipated by the hardware.In the com-plementary case,the processor behaves accordingly.For practical reasons(keeping simulation times of VHDL sim-ulation and behavioral simulation reasonable),all obtained numbers refer to a sequence of six frames running through the MPEG-2encoder.Each frame has a size of6464 pixels.This con?guration is appropriate to ensure that all parts of the encoder work almost like in the steady-state. The current energy model of our framework is valid for a SPARCLite processor running at33MHz.The hardware is working at15MHz(delivered by logic synthesis through use of NEC’s CMOS6library)The memory parameters have been optimized for low energy dissipation by use of our framework:instruction cache size and data cache size are 2KB each,the associativity is2and the line size is4for both caches.The main memory size is128KB.The main results are illustrated in Tab.1:Each of the selected co-designs of cases1to4yields a energy reduction(up to59%,see also Fig.5)compared to the reference case0.The average

Software Hardware(ASIC)

cs time#cycs time#cycs

#cells av.Pw.

[s](33MHz)[ms](15MHz)[mW]

00.1555,167,958n/a n/a n/a n/a 10.1103,682,97887.271,309,06843,364161.48 20.0511,697,95123.82357,4088,60029.79 30.1113,693,138 3.3149,63218,41247.07 40.0331,093,911110.09n/a51,298202.28 Table2:Execution time of SW and https://www.wendangku.net/doc/4513774007.html,st two columns:

hardware effort and average power dissipation of synthe-sized hardware.

power dissipation is shown in Tab.2,last column.Accord-ing to Tab.1,a larger reduction of the energy dissipation through an even larger hardware is hardly possible since a great amount of energy is dissipated within the memory hierarchy(e.g.47.7%in case1).

Of course,there is a simple way to reduce energy dissi-pation by simply reducing the performance.But our aim was to keep the performance approximately the same while reducing the energy.These results are shown in Tab.2, columns2and4(time)and columns3and5(cycles).The results in terms of energy reduction and performance change (compared to case0)are shown in Fig.5.Only in case1 a performance decrease has occured.In all other cases,the performance is even moderately higher.

The decrease in energy dissipation is due to the high oper-ator parallelism within the hardware as well as to the high degree of utilization of each FU.To achieve these results, the system parts to be implemented in hardware,have to be selected thoroughly.In case,the presupposition described in4.1are not ful?lled,a hardware will even increase the energy dissipation.

Other restrictions are the hardware costs of an ASIC hard-ware.In our case the largest ASIC comprised a number51k cells,the smallest less than9k cells(Tab.2).Furthermore,it should be noted that the partitioning was done from a design-ers point of view–the functional level.Investigation of the schedule revealed that in parts of the synthesized functions the utilization rate of HW resources was much smaller than the average utilization within the same function.Appar-ently,an even better reduction of energy dissipation would have been achieved through a?ner-grain partitioning.

6Conclusion

We have presented a case study for energy-conscious HW/SW-partitioning.Based on the idea that an applica-tion speci?c hardware can be much more energy ef?cient due to higher resource utilization,energy savings up to59% have been achieved compared to a pure software solution. Furthermore,it has been shown that estimating energy dis-sipation of a whole system is not a trivial task since the sources of energy dissipation are manifold and depend on each other.This requires tools that can capture these inter-dependencies.Therfore,we presented our framework for estimating and optimizing system energy dissipation.

7Acknowledgment

We would like to thank Bo Tao from Princeton University who provided us with his version of the MPEG-2encoder source

code.Figure5:Energy reduction and performance change in%. References

[1]Y.Nunomura,T.Shimizu,O.Tomisawa,M32/D-Integrating DRAM and Mi-

croprocessor,IEEE Micro Magazine,V ol.17,No.6,pp.40–48,1997.

[2]V.Tiwari,S.Malik,A.Wolfe,Instruction Level Power Analysis and Opti-

mization of Software,Kluwer Academic Publishers,Journal of VLSI Signal Processing,pp.1–18,1996.

[3]P.-W.Ong,R.-H.Ynn,Power-Conscious Software Design–a framework for

modeling software on hardware,IEEE Proc.of Symposium on Low Power Electronics,pp.36–37,1994.

[4]T.Sato,M.Nagamatsu,H.Tago,Power and Performance Simulator:ESP and

its Application for100MIPS/W Class RISC Design,IEEE Proc.of Symposium on Low Power Electronics,pp.46–47,1994.

[5]T.Burd,B.Peters,A Power Analysis of a Microprocessor:A Study of an

Implementation of the MIPS R3000Architecture,Technical Report,University of California at Berkeley,May1994.

[6]R.Gonzales,M.Horowitz,Energy Dissipation in General Purpose Processors,

IEEE Proc.of Symposium on Low Power Electronics,pp.12–13,1995. [7]M.B.Kamble,K.Ghose,Analytical Energy Dissipation Models For Low Power

Caches,IEEE Proc.of Symposium on Low Power Electronics and Design, pp.143–148,1997.

[8] A.Raghunathan,S.Dey,N.Jha,estimation Register-Transfer Level Estimation

Techniques for Switching Activity and Power Consumption,IEEE Proc.of Int.

Conf.on CAD(ICCAD96),pp.158-165,1996.

[9]I.Hong,D.Kirovski,M.Potkonjak,Potential-Driven Statistical Ordering of

Transformations,IEEE Proc.of34th.Design AutomationConference(DAC97), pp.347-352,1997.

[10] D.Kirovski,M.Potkonjak,System-Level Synthesis of Low-Power Hard Real-

Time Systems,IEEE Proc.of34th.Design Automation Conference(DAC97), pp.697-702,1997.

[11] B.P.Dave,https://www.wendangku.net/doc/4513774007.html,kshminarayana,N.K.Jha,COSYN:Hardware-Software Co-

Synthesis of Embedded Systems’IEEE Proc.of34th.Design Automation Con-ference(DAC97),pp.703-708,1997.

[12]R.K.Gupta,G.D.Micheli,System-level Synthesis using Re-programmable

Components,IEEE/ACM Proc.of EDAC’92,pp.2–7,1992.

[13] F.Vahid,D.D.Gajski,J.Gong,A Binary–Constraint Search Algorithm for Min-

imizing Hardware during Hardware/Software Partitioning,IEEE/ACM Proc.

of EuroDAC’94,pp.214–219,1994.

[14]P.V.Knudsen,J.Madsen,P ACE:A Dynamic ProgrammingAlgorithm for Hard-

ware/Software Partitioning,IEEE Proc.of Codes/CASHE’96,pp.85–92,1996.

[15]J.Henkel,R.Ernst,A Hardware/Software Partitioner using a dynamically

determined Granularity,IEEE Proc.of34th.Design Automation Conference (DAC97),pp.691-696,1997.

[16]J.Henkel,Y.Li,An Approach for Estimating and Minimizing Energy Dis-

sipation of Embedded HW/SW Systems,NEC USA Inc.,Technical Report# 97-C071-4-5110-2,Oct.1997.

[17]S.J.E Wilton,N.P.Jouppi,An Enhanced Access and Cycle Time Model for

On-Chip Caches,DEC,WRL Research Report93/5,July1994.

[18] D.L.Weaver,T.Germond,The SP ARC Architecture Manual,Version9,PTR

Prentice Hall,1994.

[19]M.D.Hill,https://www.wendangku.net/doc/4513774007.html,urus,A.R.Lebeck et al.,WARTS:Wisconsin Architectural

Research Tool Set,Computer Science Department University of Wiscocnsin.

[20]K.Itoh,K.Sasaki and Y.Nakagome,Trends in Low-Power RAM Circuit Tech-

nologies,Proceedings of the IEEE,VOL.83,No.4,April1995.

[21]P.K.Andleigh,K.Thakrar,Multimedia Systems Design,Prentice Hall,1996.

[22]V-System/Workstation User’s Manual,VHDL Simulation for Workstations,

Model Technology,1995.

[23]R.Ernst,W.Ye,Embedded program timing analysis based on path clustering

and architectural classi?cation,IEEE Proc.of Int.Conf.on CAD(ICCAD97), pp.598-604,1997.