10
.......................................................................................................................................................................................................................... DYNAMIC MULTICORE RESOURCE MANAGEMENT:AMACHINE LEARNING APPROACH .......................................................................................................................................................................................................................... A MACHINE LEARNING APPROACH TO MULTICORE RESOURCE MANAGEMENT PRODUCES SELF-OPTIMIZING ON-CHIP HARDWARE AGENTS CAPABLE OF LEARNING, PLANNING, AND CONTINUOUSLY ADAPTING TO CHANGING WORKLOAD DEMANDS.THIS RESULTS IN MORE EFFICIENT AND FLEXIBLE MANAGEMENT OF CRITICAL HARDWARE RESOURCES AT RUNTIME. ......Over the last five years, multicore architectures have emerged as the primary mechanism to reap the benefits of Moore’s Law in the billion-transistor era. If CMOS scaling continues to follow Moore’s Law, multicore systems could deliver as many as twice the number of cores (and cache space) with every technology generation. Unfortunately, not all multicore resources scale as easily. For example, whereas these systems’ off-chip bandwidth requirements will likely grow proportionally to the number of cores, the number of signaling pins in con- ventional packaging technologies grows by only 10 percent with each technology gener- ation. Also, because of limited projected sup- ply voltage scaling, the average switching power per transistor will likely decrease by only 40 percent per technology node. Based on these trends and other factors, we predict that effective resource manage- ment in future multicore architectures will be most important to delivering cost- effective, high-performance products. This includes problems such as multiresource allo- cation (for example, cache, power budget, and memory bandwidth allocation), memory- scheduling optimization, cross-core load balancing, and instruction steering in clus- tered organizations. Alas, how these architec- tural mechanisms operate is traditionally fully specified by human experts at design time, and as a result they tend to be somewhat rigid and unsophisticated. Machine learning is the study of computer programs and algorithms that learn about their environment and improve automatically with experience. Researchers and practitioners have successfully applied machine learning techniques and tools in a variety of contexts to tackle resource allocation, scheduling, load balancing, and design space exploration, among other problems (see the ‘‘Related work in machine learning’’ sidebar). In this ar- ticle, we advocate dynamic multicore resource management based on machine learning tech- niques. Our position is that the expert’s time is best invested in determining which objective function to optimize, which automatic learn- ing technique best suits the problem, and which variables should drive the resulting self-optimizing mechanism at runtime. This approach thus contrasts with today’s predom- inant approach of directly specifying at design time how the hardware should accomplish the desired goal. We illustrate this approach’s Jose ´ F. Martı´nez Cornell University Engin I ˙ pek University of Rochester .............................................................. 8 Published by the IEEE Computer Society 0272-1732/09/$26.00 c 2009 IEEE

mmi2009050008 8.cs.rochester.edu/~ipek/ieeemicro09.pdf · 2011-07-27 · icant performance hit, or an overprovisioned (and therefore expensive) memory system.2-4 Figure 1 shows a

  • Upload
    others

  • View
    5

  • Download
    0

Embed Size (px)

Citation preview

Page 1: mmi2009050008 8.cs.rochester.edu/~ipek/ieeemicro09.pdf · 2011-07-27 · icant performance hit, or an overprovisioned (and therefore expensive) memory system.2-4 Figure 1 shows a

..........................................................................................................................................................................................................................

DYNAMIC MULTICORE RESOURCEMANAGEMENT: A MACHINE

LEARNING APPROACH..........................................................................................................................................................................................................................

A MACHINE LEARNING APPROACH TO MULTICORE RESOURCE MANAGEMENT PRODUCES

SELF-OPTIMIZING ON-CHIP HARDWARE AGENTS CAPABLE OF LEARNING, PLANNING, AND

CONTINUOUSLY ADAPTING TO CHANGING WORKLOAD DEMANDS. THIS RESULTS IN MORE

EFFICIENT AND FLEXIBLE MANAGEMENT OF CRITICAL HARDWARE RESOURCES AT RUNTIME.

......Over the last five years, multicorearchitectures have emerged as the primarymechanism to reap the benefits of Moore’sLaw in the billion-transistor era. If CMOSscaling continues to follow Moore’s Law,multicore systems could deliver as many astwice the number of cores (and cachespace) with every technology generation.Unfortunately, not all multicore resourcesscale as easily. For example, whereas thesesystems’ off-chip bandwidth requirementswill likely grow proportionally to the numberof cores, the number of signaling pins in con-ventional packaging technologies grows byonly 10 percent with each technology gener-ation. Also, because of limited projected sup-ply voltage scaling, the average switchingpower per transistor will likely decrease byonly 40 percent per technology node.

Based on these trends and other factors,we predict that effective resource manage-ment in future multicore architectures willbe most important to delivering cost-effective, high-performance products. Thisincludes problems such as multiresource allo-cation (for example, cache, power budget,and memory bandwidth allocation), memory-scheduling optimization, cross-core load

balancing, and instruction steering in clus-tered organizations. Alas, how these architec-tural mechanisms operate is traditionallyfully specified by human experts at designtime, and as a result they tend to be somewhatrigid and unsophisticated.

Machine learning is the study of computerprograms and algorithms that learn abouttheir environment and improve automaticallywith experience. Researchers and practitionershave successfully applied machine learningtechniques and tools in a variety of contextsto tackle resource allocation, scheduling, loadbalancing, and design space exploration,among other problems (see the ‘‘Relatedwork in machine learning’’ sidebar). In this ar-ticle, we advocate dynamic multicore resourcemanagement based on machine learning tech-niques. Our position is that the expert’s time isbest invested in determining which objectivefunction to optimize, which automatic learn-ing technique best suits the problem, andwhich variables should drive the resultingself-optimizing mechanism at runtime. Thisapproach thus contrasts with today’s predom-inant approach of directly specifying at designtime how the hardware should accomplish thedesired goal. We illustrate this approach’s

Jose F. Martınez

Cornell University

Engin Ipek

University of Rochester

..............................................................

8 Published by the IEEE Computer Society 0272-1732/09/$26.00 !c 2009 IEEE

Page 2: mmi2009050008 8.cs.rochester.edu/~ipek/ieeemicro09.pdf · 2011-07-27 · icant performance hit, or an overprovisioned (and therefore expensive) memory system.2-4 Figure 1 shows a

...............................................................................................................................................................................................

Related work in machine learningResearchers and practitioners have successfully applied machine

learning techniques and tools to important real-life problems, including

not only resource allocation, but also scheduling, load balancing, and de-

sign space exploration.

Tesauro et al. explore a decompositional reinforcement learning

approach to making autonomic resource-allocation decisions in data cen-

ters.1 Vengerov proposes a reinforcement learning framework to perform

adaptive job scheduling in soft real-time systems, in which resource-

allocation decisions must be made concurrently with scheduling

decisions.2 Fedorova et al. use reinforcement learning to produce oper-

ating system scheduling policies that balance optimal performance,

core assignment balance, and response time fairness in a heterogeneous

multicore system.3

Vengerov et al. apply fuzzy rule bases and reinforcement learning to

dynamic load balancing in distributed Web caching.4 Platt et al. use ap-

proximate Bayesian inference to diagnose problems with Web servers.5

Their proposed framework lets them analyze billions of network logs

within a few hours. Ganapathi et al. leverage supervised learning and

principal component analysis for query optimization in a large data ware-

house.6 Whiteson and Stone use Q-learning in network routing and job

scheduling.7

Researchers have also used machine learning techniques in

code generation and optimization. Calder et al. use neural net-

works and decision trees to predict conditional branch directions

statically, using static program features.8 McGovern and Moss

apply reinforcement learning and Monte Carlo roll-outs to code

scheduling.9 They find that the learned scheduling policies outper-

form commercial compilers when generating code for the Alpha

21064 microprocessor. Coons et al. use reinforcement learning

and genetic algorithms to learn instruction placement heuristics

for distributed microarchitectures.10 In this case, the derived pol-

icies match the performance of highly hand-tuned, specialized

heuristics.

Few researchers have applied machine learning to microarchitecture

(at runtime or otherwise). Jimenez and Lin propose a perceptron-

based dynamic branch predictor, which significantly outperforms

prior table-based prediction schemes.11 Several years earlier,

Emer and Gloy used genetic programming for automatic design-

time derivation of hardware predictors.12 I.pek et al. use arti-

ficial neural networks to cull high-performing configurations from

very large design spaces for both single-core and multicore

microarchitectures.13

References

1. G. Tesauro et al., ‘‘Online Resource Allocation Using Decom-

positional Reinforcement Learning,’’ Proc. Nat’l Conf. Artifi-

cial Intelligence (AAAI 05), AAAI Press, 2005, pp. 287-299.

2. D. Vengerov, ‘‘Adaptive Utility-based Scheduling in

Resource-Constrained Systems,’’ Proc. Australian Joint

Conf. Artificial Intelligence (AI 05), LNCS 3809, Springer,

2005, pp. 477-488.

3. A. Fedorova, D. Vengerov, and D. Doucette, ‘‘Operating Sys-

tem Scheduling on Heterogeneous Core Systems,’’ Proc.

Operating System Support for Heterogeneous Multicore

Architectures, (OSHMA), 2007; http://research.sun.com/

people/vengerov/OSHMA_2007.pdf.

4. D. Vengerov, H.R. Berenji, and A. Vengerov, ‘‘Adaptive Coor-

dination among Fuzzy Reinforcement Learning Agents Per-

forming Distributed Dynamic Load Balancing,’’ Proc. IEEE

Int’l Conf. Fuzzy Systems (FUZZ-IEEE 02), IEEE Press,

2002, pp. 179-184.

5. J. Platt, E. Kiciman, and D. Maltz, ‘‘Fast Variational Inference

for Large-Scale Internet Diagnosis,’’ Proc. Advances in Neural

Information Processing Systems Conf. (NIPS 07), 2007; http://

books.nips.cc/nips20.html.

6. A. Ganapathi et al., ‘‘Predicting Multiple Performance Metrics

for Queries: Better Decisions Enabled by Machine Learning,’’

Proc. Int’l Conf. Data Eng. (ICDE 09), IEEE CS Press, 2009,

pp. 592-603.

7. S. Whiteson and P. Stone, ‘‘Adaptive Job Routing and Sched-

uling,’’ Eng. Applications of Artificial Intelligence, vol. 17, no. 7,

2004, pp. 855-869.

8. B. Calder et al., ‘‘Evidence-based Static Branch Prediction

Using Machine Learning,’’ ACM Trans. Programming Lan-

guages and Systems (TOPLAS), vol. 19, no. 1, 1997,

pp. 188-222.

9. A. McGovern and E. Moss, ‘‘Scheduling Straight Line Code

Using Reinforcement Learning and Rollouts,’’ Proc. Advan-

ces in Neural Information Processing Systems Conf. (NIPS

99), MIT Press, 1999, pp. 903-909.

10. K.E. Coons et al., ‘‘Feature Selection and Policy Optimiza-

tion for Distributed Instruction Placement Using Reinforce-

ment Learning,’’ Proc. Int’l Conf. Parallel Architectures and

Compilation Techniques (PACT 08), ACM Press, 2008,

pp. 32-42.

11. D.A. Jimenez and C. Lin, ‘‘Dynamic Branch Prediction with

Perceptrons,’’ Proc. Int’l Symp. High-Performance Computer

Architecture (HPCA 01), IEEE CS Press, 2001, p. 197.

12. J. Emer and N. Gloy, ‘‘A Language for Describing Predictors

and Its Application to Automatic Synthesis,’’ Proc. Int’l

Symp. Computer Architecture (ISCA 97), ACM Press, 1997,

pp. 304-314.

13. E. Ipek et al., ‘‘Efficiently Exploring Architectural Design

Spaces via Predictive Modeling,’’ Proc. Int’l Conf. Architec-

tural Support for Programming Languages and Operating Sys-

tems (ASPLOS 06), ACM Press, 2006, pp. 195-206.

....................................................................

SEPTEMBER/OCTOBER 2009 9

Page 3: mmi2009050008 8.cs.rochester.edu/~ipek/ieeemicro09.pdf · 2011-07-27 · icant performance hit, or an overprovisioned (and therefore expensive) memory system.2-4 Figure 1 shows a

potential by describing two recent cases thathave shown improvement significantly beyondthe state of the art.

Case 1: DRAM schedulingProviding workloads with adequate off-

chip memory bandwidth is often critical todelivering high performance. Unfortunately,DRAM scheduling is a complex problem,requiring a delicate balance between circum-venting many access-scheduling constraints(for example, more than 50 timing con-straints in Micron’s DDR2-800 products),prioritizing requests properly, and adaptingto a dynamically changing memory referencestream. When confronted with this chal-lenge, existing memory controllers tend tosustain only a small fraction of the peakbandwidth.1 The end result is either a signif-icant performance hit, or an overprovisioned(and therefore expensive) memory system.2-4

Figure 1 shows a representative example ofpotential performance loss due to access-sched-uling constraints with an existing DRAM con-troller. Figure 1a shows the sustained DRAMbandwidth of an example parallel application(the Scalable Parallel Classifier [Scalparc]5)executing on a detailed simulation model ofa quadcore-based architecture with both a real-istic, contemporary controller design (usingthe first ready, first come, first served [FR-FCFS] scheduling policy1), and an optimistic(and unrealizable) design that can sustain100 percent of the controller’s peak band-width, provided enough demand. With an op-timistic scheduler, Scalparc uses more than 85percent of the memory bandwidth effectively.

With the realistic scheduler, however, theapplication’s data bus utilization falls to 46percent. Figure 1b shows this scheduling inef-ficiency’s impact on average L2 load miss pen-alty. Whereas the optimistic scheduler attainsan average load miss penalty of 482 cycles,the realistic scheduler experiences an 801-cycle average penalty. The end result is a 55percent performance loss compared to the op-timistic scheduler (Figure 1c).

In current memory controller designs, ahuman expert typically chooses a few attrib-utes that will likely be relevant to optimizinga particular performance target (for example,a request’s average waiting time), based onprior experience. The expert then devises afixed, rigid scheduling policy incorporatingthese attributes, and evaluates it in a simula-tion model.

The resulting controller usually lacks twoimportant functionalities:

" It cannot anticipate its scheduling deci-sions’ long-term consequences (that is,it cannot do long-term planning).

" It cannot generalize and use the experi-ence obtained through past schedulingdecisions to act successfully in new sys-tem states (that is, it cannot learn).

DRAM scheduling based on reinforcement learningCase 1 uses reinforcement learning (some-

times called learning from interaction).6 Rein-forcement learning studies how autonomousagents in stochastic environments can learnto maximize the cumulative sum of a numer-ical reward signal received over time through

0(a) (b) (c)

0.2

0.4

0.6

0.8

1.0

Dat

a bu

s ut

iliza

tion

0

200

400

600

800

1,000

L2 lo

ad m

iss

pena

lty

0

0.5

1.0

1.5

2.0

Spee

dup

FR-FCFS Optimistic

Figure 1. Performance comparison of a first ready, first come, first served (FR-FCFS)

versus an optimistic memory controller that can sustain 100 percent of the peak

bandwidth (provided enough demand) for the Scalparc application.

....................................................................

10 IEEE MICRO

...............................................................................................................................................................................................COVER FEATURE

Page 4: mmi2009050008 8.cs.rochester.edu/~ipek/ieeemicro09.pdf · 2011-07-27 · icant performance hit, or an overprovisioned (and therefore expensive) memory system.2-4 Figure 1 shows a

interaction with their environment.7,8

Figure 2a depicts the agent-environment in-terface that defines a reinforcement learningagent’s operation. The agent interacts withits environment over a discrete set of timesteps. At each step, the agent senses its envi-ronment’s current state and executes anaction. This results in a change in the envi-ronment’s state (which the agent can sensein the next time step) and produces an imme-diate reward. The agent’s goal is to maximizeits long-term cumulative reward by learningan optimal policy that maps states to actions.

Figure 2b shows how a self-optimizingDRAM controller fits within this framework.The agent represents the DRAM commandscheduler, while the environment comprisesthe rest of the system—cores, caches, buses,DRAM banks, and the scheduling queueare all part of the agent’s environment.Each time step corresponds to a DRAMclock cycle, during which the agent can ob-serve the system state and execute an action.(Relevant attributes for describing the envi-ronment’s state are determined via an auto-mated, offline feature selection process atdesign time.) The actions available to theagent cover the legal DRAM commandsthat can be issued in the subsequentDRAM cycle (precharge, activate, read, orwrite). Finally, to motivate the agent to max-imize data bus utilization, the controllerrewards the agent with #1 each time it issuesa command that uses the data bus (that is, aread or write command) and with zero allother times.

Reinforcement learning agents face threemajor challenges. The first challenge is tem-poral credit assignment. The agent mustlearn how to assign credit and blame topast actions for each observed immediate re-ward. In some cases, a seemingly desirableaction that yields a high immediate rewarddrives the system toward undesirable, stag-nant states that offer no rewards. At othertimes, executing an action with no immedi-ate reward is critical to reaching desirable fu-ture states. For example, a write commandthat fills an otherwise unused DRAM databus cycle might result in several future cyclesin which DRAM bandwidth is underused,whereas a precharge command that doesnot result in any immediate data bus

utilization might facilitate better bandwidthutilization in future cycles. Hence, acting op-timally requires planning: the agent must an-ticipate its actions’ future consequences andact to maximize its long-term cumulativepayoffs. To do this, a reinforcement learn-ing-based memory controller records aQ-value indicating the expected long-termcumulative reward value of scheduling acommand in a given system state. Thescheduler iteratively approximates Q-valuesbased on trial-and-error interaction with thesystem.

Second, the agent must balance explora-tion versus exploitation. The agent must ex-plore its environment sufficiently (andcollect training data) before it can learn ahigh-performance control policy, but itmust also exploit the best policy it hasfound at any point in time. Too little explo-ration of the environment can cause the agentto commit to suboptimal policies early on,whereas excessive exploration can result inlong periods during which the agent executessuboptimal actions. Furthermore, the agentmust continue exploring its environmentand improving its policy (life-long learning)to accommodate changes in its environment(due, for example, to phase changes or

Environment

AgentReward r(t)State s(t)Action a(t + 1)

(a)

(b)

System

SchedulerData bus utilization (t)State attributes (t)Scheduled command (t + 1)

Figure 2. A reinforcement learning framework. An intelligent agent based

on reinforcement learning principles reacts to changes in its environment

by executing actions that change the environment’s state (a). A DRAM

scheduler as a reinforcement learning agent observes its environment in

time steps corresponding to a DRAM clock cycle and executes actions

accordingly (b).

....................................................................

SEPTEMBER/OCTOBER 2009 11

Page 5: mmi2009050008 8.cs.rochester.edu/~ipek/ieeemicro09.pdf · 2011-07-27 · icant performance hit, or an overprovisioned (and therefore expensive) memory system.2-4 Figure 1 shows a

context switches). To balance exploration andexploitation, a reinforcement learning-basedmemory controller can implement a simpleyet effective exploration mechanism knownas e-greedy action selection. Each DRAMcycle, the scheduler randomizes its schedulingdecision by picking a random (but legal)command with a small probability e.

The third challenge is generalization. Be-cause the state spaces’ size is exponential inthe number of attributes considered, an over-whelming number of possible states can rep-resent the agent’s environment. In such cases,the agent is highly unlikely to experience thesame state more than once over its lifetime.Consequently, the only way to learn a map-ping from states to actions is to generalizeand apply the experience gathered over previ-ously encountered (but different) systemstates to act successfully in new states. Todo this, a reinforcement learning-basedmemory controller records its Q-value esti-mates in a practical hardware data structurecalled a cerebellar model arithmetic com-puter (CMAC). A CMAC model facilitatesgeneralization by effectively quantizing thestate space, while also allowing for fast look-ups in a realistic hardware setting.

Figure 3 gives an overview of the pro-posed reinforcement learning-based memorycontroller. In each DRAM cycle, the schedu-ler examines valid transaction queue entries,each requiring a precharge, activate, read, orwrite command to be scheduled next.

The scheduler aims to maximize DRAMutilization by choosing the legal commandwith the highest expected long-term perfor-mance benefit under the optimal policy. Todo this, the scheduler first derives a state-action pair for each candidate commandunder the current system state, and usesthis information to calculate the corre-sponding Q-values. The scheduler imple-ments its control policy by scheduling thecommand with the highest Q-value eachDRAM cycle.

ResultsIpek et al. compare the proposed rein-

forcement learning-based memory controllerto various other schedulers,6 including thefollowing:

" Rixner et al.’s FR-FCFS schedulingpolicy1,9

" a conventional in-order memory con-troller,1 and

" an optimistic (that is, ideally efficient)scheduler that can sustain 100 percentof the peak DRAM throughput givenenough demand.

The experimental setup uses nine memory-intensive parallel workloads, representing amix of scalable scientific and data miningapplications.

Figure 4 shows the results. On a four-corechip with a single-channel DDR2-800 mem-ory subsystem (6.4 GBytes per second peakbandwidth), the reinforcement learning-based memory controller improves the per-formance of a set of parallel applications by19 percent on average (up to 33 percent),and DRAM bandwidth utilization by 22 per-cent on average, over a FR-FCFS scheduler.

Additional experiments show that a tradi-tional memory controller design cannotmatch the reinforcement learning-basedmemory controller’s performance, even ifthe best attributes (as determined by the off-line feature-selection procedure) are avail-able. In addition, the proposed onlinereinforcement learning-based scheduling pol-icy’s ability to continuously adapt to changesin workload demands significantly outper-forms a fixed policy designed offline usingreinforcement learning through simulations,and then installed into the chip.

Transaction queue

Sche

dule

r

Valid Bank Row Column Data Requeststate

DR

AM

Data

Address

CommandState

Reward

Action

Figure 3. High-level overview of a reinforcement learning-based scheduler.

In each DRAM cycle, the scheduler examines valid transaction queue

entries, each requiring a precharge, activate, read, or write command to be

scheduled next.

....................................................................

12 IEEE MICRO

...............................................................................................................................................................................................COVER FEATURE

Page 6: mmi2009050008 8.cs.rochester.edu/~ipek/ieeemicro09.pdf · 2011-07-27 · icant performance hit, or an overprovisioned (and therefore expensive) memory system.2-4 Figure 1 shows a

Case 2: Multiresource allocationAlthough resources that are relatively in-

dependent of one another can be managedin isolation, many shared resources on multi-core systems interact. Unrestricted sharing ofmicroarchitectural resources can lead to de-structive interference. Although several pro-posals that address the management of asingle microarchitectural resource exist inthe literature, proposals to manage multipleinteracting resources on multicore chips atruntime are much scarcer.

Consider, for example, the case of a multi-core system in which the on-chip L2 cachespace, off-chip bandwidth, and the chip’spower budget are shared among applications,and each resource’s use is regulated via an in-dependent QoS knob. As the amount ofone resource allocated to an applicationchanges, the application’s demands on theother resources also change. For example,increasing an application’s allocated cachespace can cause its working set to fit in thecache, and can dramatically reduce its off-chip bandwidth demand (which could inturn be allocated to other applications withhigher demand). Similarly, increasing anapplication’s power budget could cause it torun at a higher frequency and to demandmore bandwidth. Hence, the optimal alloca-tion of one resource type depends in part onthe allocated amounts of other resources,requiring a coordinated resource-managementscheme for optimal performance.

Figure 5 shows an example of performanceloss due to uncoordinated resource manage-ment in a multicore system incorporatingthree QoS knobs for regulating the system’s

shared cache, off-chip bandwidth, and powerbudget. A four-application, desktop-stylemultiprogrammed workload is executed on aquadcore chip with an associated DDR2-800 memory subsystem. The figure comparesconfigurations that allocate one or more of theresources in an uncoordinated fashion to astatic, fair-share allocation of the resources,as well as an unmanaged sharing scenario inwhich all applications can access all resourcesat all times. In this workload, unmanaged re-source sharing delivers considerable slow-downs even when compared to a rigid, staticresource distribution among the cores (Fair-Share). Moreover, managing a subset of theresources in an uncoordinated fashion is infe-rior to static partitioning, indicating that re-source interactions render individual adaptivemanagement policies largely ineffective.

Multiresource allocation based on artificialneural networks

Case 2 addresses these limitations by pro-posing a resource-allocation framework thatmanages multiple shared multicore resources,in a coordinated fashion, to enforce higher-level performance objectives.12 At runtime,and without prior knowledge of the work-load’s characteristics (for example, applica-tion profiles), a resource manager monitorseach application’s execution, and learns apredictive model of their responses to alloca-tion decisions. These predictive models thenmake it possible to anticipate the system-levelperformance impact of allocation decisions.The resource manager uses this knowledgeto determine what resource distribution toenforce at each point in time.

0.58

1.001.19

1.70

0.400.600.801.001.201.401.601.802.002.20

Art CG Equake FFT MG Ocean Radix Scalparc Swim G-Mean

Spee

dup

over

FR

-FC

FS In-order FR-FCFS RL Optimistic

Figure 4. Performance comparison. The proposed reinforcement learning-based mechanism comes closest to the

optimistic scheduler.

....................................................................

SEPTEMBER/OCTOBER 2009 13

10,11

admin
16
Page 7: mmi2009050008 8.cs.rochester.edu/~ipek/ieeemicro09.pdf · 2011-07-27 · icant performance hit, or an overprovisioned (and therefore expensive) memory system.2-4 Figure 1 shows a

The predictive models are based on artifi-cial neural networks. ANNs are machinelearning models that automatically learn toapproximate a target function (performancein our case) based on a set of inputs. In afully connected feed-forward ANN, an inputunit passes the data presented to it to all hid-den units via a set of weighted edges. Hiddenunits operate on this data to generate theinputs to the output unit, which in turn cal-culates ANN predictions. Hidden and outputunits form their results by taking a weightedsum of their inputs based on edge weights,and passing this sum through a nonlinear acti-vation function. Increasing the number of

hidden units in an ANN leads to better repre-sentational power and the ability to modelmore complex functions, but increases theamount of training data and time requiredto arrive at accurate models. ANNs representone of the most powerful machine learningmodels for nonlinear regression; their repre-sentational power is high enough to modelmultidimensional functions involving com-plex relationships among variables.

Figure 6 gives an overview of the resource-allocation framework, which comprises dy-namically trained, per-application hardwareperformance models and a global resourcemanager. The global resource manager

Global resourcemanager

App 0Performance

model

App 1Performance

model

App 2Performance

model

App 3Performance

model

Prog

ram hi

story

+

reso

urce

alloc

ation

s

Pred

icted

per

form

ance

Figure 6. High-level view of the global resource manager’s interaction with each

application’s performance model. The mechanism that we describe uses an ensemble

of four artificial neural networks (ANNs) per performance model.

1.5

Unman

aged

Cache

Power BW

Cache

+Power

Cache

+BW

Power+

BW

Cache

+Power+

BW

Fair-S

hare

1.61.71.81.92.02.1

Wei

ghte

d sp

eedu

p

Figure 5. Weighted speedup of a quad-app workload under different resource-management

schemes. Despite its simplicity, Fair-Share outperforms all other configurations.

....................................................................

14 IEEE MICRO

...............................................................................................................................................................................................COVER FEATURE

Page 8: mmi2009050008 8.cs.rochester.edu/~ipek/ieeemicro09.pdf · 2011-07-27 · icant performance hit, or an overprovisioned (and therefore expensive) memory system.2-4 Figure 1 shows a

redistributes shared system resources amongapplications at fixed decision-making intervals,allowing it respond to dynamic changes inworkload behavior.

Each application’s performance is mod-eled as a function of its allocated resourcesand recent behavior, and an ensemble ofANNs is used to learn an approximation ofthis function. The global resource managerpresents input values summarizing past pro-gram behavior and indicating allocated re-source amounts at each network’s inputunits, and obtains performance predictionsfrom an output unit. The final performanceprediction is obtained by averaging the pre-dictions of all ANNs in the ensemble—atechnique that often increases accuracy overa single network and lets us assign confidencelevels to ANN predictions.

At the end of every interval, the global re-source manager searches the space of possibleresource allocations by repeatedly queryingeach application’s performance model. Themanager then aggregates these predictionsinto a system-level performance prediction(for example, by calculating the weightedspeedup). This process is repeated for a fixednumber of query-response iterations on differ-ent candidate resource distributions, afterwhich the global resource manager installsthe configuration estimated to yield the high-est aggregate performance. At no point in thisiterative process does the global resource man-ager try candidate resource distributions live.The iterative querying of the predictive mod-els lets the manager try many candidates in ashort amount of time. It also helps avoid thenegative impact on performance that live trialsof subprime candidates would have.

Ideally, we would like to conduct thissearch exhaustively on all possible combina-tions of resource allocations. In practice, how-ever, even a quadcore allocation scenario withthree resources can result in more than onebillion system configurations, rendering ex-haustive search intractable. Instead, a heuris-tic search algorithm can navigate a relativelysmall portion of the search space, and installthe allocation with the highest estimated per-formance found during this process. Specifi-cally, the proposed solution uses a modifiedversion of stochastic hill climbing that factorsin the confidence level of the ANNs and

focuses its trials on the most promising subre-gions of the global allocation space.

As a result of context switches and appli-cation phase behavior, workloads can exertdrastically different demands on each re-source at different points in time. To antici-pate and respond to dynamically changingworkload demands, the global resource man-ager keeps the predictive models trained byperiodically (in 5 percent of the decisionintervals) making a random allocation deci-sion, and using the predicted and observedperformance to retrain the models.

ResultsBitirgen et al. evaluate the proposed

mechanism’s efficacy and versatility usingfour optimization targets:12

" weighted speedups," sum of IPCs," harmonic mean of normalized IPCs,

and" weighted sum of IPCs.

Here we only show results for weightedspeedups; however, additional experimentsshow that the trends carry over to the otheroptimization targets. The evaluation uses asimulation model of a quadcore chip witha DDR2-800 memory subsystem, runningnine quadcore multiprogrammed workloads,incorporating applications with diverse char-acteristics from the SPEC 200013 and NASAAdvanced Supercomputing suites.14

The evaluation compares the proposedANN-based mechanism against 10 resource-sharing schemes:

" one design that leaves all resourcesunmanaged;

" one design that gives each applicationits fair share of all resources;

" seven schemes that manage either one,two, or all three resources in an uncoor-dinated fashion; and

" a port of a previously proposed, hill-climbing-based coordinated SMTresource-allocation framework15 tothis multicore context.

Figure 7 shows the weighted speedupof each quad-workload, normalized to theFair-Share configuration. In general, most

....................................................................

SEPTEMBER/OCTOBER 2009 15

Page 9: mmi2009050008 8.cs.rochester.edu/~ipek/ieeemicro09.pdf · 2011-07-27 · icant performance hit, or an overprovisioned (and therefore expensive) memory system.2-4 Figure 1 shows a

configurations with uncoordinated manage-ment of resources yield performance inferiorto plain Fair-Share. Surprisingly, in someinstances, the uncoordinated combination oftwo or three resource-management policiesdegrades performance with respect to apply-ing only one of those policies. This suggeststhat not only are the effects of multiple, unco-ordinated resource-management policies notadditive, they can actually incur destructiveinterference. Indeed, Coordinated-HC gener-ally outperforms its uncoordinated counter-part, Cache # Power # BW; however, it isusually still inferior to Fair-Share.

In contrast, by estimating the shape of theallocation space in advance, and by navigat-ing the space via queries that are essentiallyoverhead-free, Coordinated-ANN copeswith both the exponential size and dynamicnature of the global resource-allocation prob-lem, significantly outperforming bothCoordinated-HC and Fair-Share. Specifi-cally, Coordinated-ANN delivers 14 percentaverage speedup over Fair-Share, and is con-sistently the best resource-managementscheme across all nine workloads.

Although it would be naıve to presumethat either of our techniques can

provide an optimum solution to the(daunting) problem of multicore resourcemanagement, we believe the two cases pre-sented demonstrate that machine learning-

based approaches have the potential to effecta significant leap in the performance andflexibility of future microarchitectures. M I CR O

....................................................................References

1. S. Rixner et al., ‘‘Memory Access Schedul-

ing,’’ Proc. Int’l Symp. Computer Architecture

(ISCA 00), IEEE CS Press, 2000, pp. 128-138.

2. T. Dunigan et al., ‘‘Early Evaluation of the

Cray X1,’’ Proc. Supercomputing (SC 03),

IEEE CS Press,2003, p. 18.

3. P. Kongetira et al., ‘‘Niagara: A 32-Way Multi-

threaded SPARC Processor,’’ IEEE Micro,

vol. 25, no. 2, 2005, pp. 21-29.

4. J.D. McCalpin, ‘‘Sustainable Memory Band-

width in Current High Performance Com-

puters,’’ http://www.cs.virginia.edu/~mccalpin/

papers/bandwidth/bandwidth.html.

5. M. Joshi et al., ‘‘ScalParC: A New Scalable

and Efficient Parallel Classification Algo-

rithm for Mining Large Datasets,’’ Proc.

Int’l Parallel Processing Symp. (IPPS 98),

IEEE CS Press, 1998, pp. 573.

6. E. Ipek et al., ‘‘Self-Optimizing Memory

Controllers: A Reinforcement Learning

Approach,’’ Proc. Int’l Symp. Computer Ar-

chitecture (ISCA 08), IEEE CS Press, 2008,

pp. 39-50.

7. D. Bertsekas, Neuro Dynamic Program-

ming, Athena Scientific, 1996.

8. R. Sutton and A. Barto, Reinforcement

Learning, MIT Press, 1998.

0.40

0.60

0.80

1.00

1.20

1.40

ERWG GRFC LDAP MRGT RCLS SWTC TGCD URFG VRFE G-MeanW

eigh

ted

spee

dup

norm

aliz

ed to

Fai

r-Sh

are

BWCache+Power+BW

UnmanagedCache+PowerFair-Share

CacheCache+BWCoordinated-HC

PowerPower+BWCoordinated-ANN

Figure 7. Performance comparison using weighted speedup. Results are normalized to

Fair-Share. The proposed Coordinated-ANN scheme consistently outperforms the other

configurations.

....................................................................

16 IEEE MICRO

...............................................................................................................................................................................................COVER FEATURE

Page 10: mmi2009050008 8.cs.rochester.edu/~ipek/ieeemicro09.pdf · 2011-07-27 · icant performance hit, or an overprovisioned (and therefore expensive) memory system.2-4 Figure 1 shows a

9. S. Rixner, ‘‘Memory Controller Optimiza-

tions for Web Servers,’’ Proc. Int’l Symp.

Microarchitecture (MICRO 04), IEEE CS

Press, 2004, pp. 355-366.

11. M. Qureshi and Y. Patt, ‘‘Utility-based

Cache Partitioning: A Low-Overhead, High-

Performance, Runtime Mechanism to Parti-

tion Shared Caches,’’ Proc. Int’l Symp.

Microarchitecture (MICRO 06), IEEE CS

Press, 2006, pp. 423-432.

12. R. Bitirgen, E. Ipek, and J.F. Martınez,

‘‘Coordinated Management of Multiple

Interacting Resources in Chip Multiproces-

sors: A Machine Learning Approach,’’ Proc.

Int’l Symp. Microarchitecture (MICRO 08),

IEEE CS Press, 2008, pp. 318-329.

13. J.L. Henning, ‘‘SPEC CPU2000: Measuring

CPU Performance in the New Millennium,’’

Computer, vol. 33, no. 7, 2000, pp. 28-35.

14. D.H. Bailey et al., NAS Parallel Benchmarks,

tech. report RNR-94-007, NASA Ames Re-

search Center, 1994.

15. S. Choi and D. Yeung, ‘‘Learning-based

SMT Processor Resource Distribution via

Hill-Climbing,’’ Proc. Int’l Symp. Computer

Architecture (ISCA 06), IEEE CS Press,

2006, pp. 239-251.

Jose F. Martınez is an associate professor ofelectrical and computer engineering at Cor-nell University. His research interests includereconfigurable and self-optimizing architec-tures. Martınez has a PhD in computerscience from the University of Illinois,Urbana-Champaign. He is a member of theACM and the IEEE.

˙Engin Ipek is an assistant professor ofcomputer science and electrical and computerengineering at the University of Rochester.His research interests include self-optimizingarchitectures and low-power memory subsys-

˙tems. Ipek has a PhD in electrical and com-puter engineering from Cornell University.He is a member of the ACM and the IEEE.

Direct questions and comments to Jose F.Martınez at Cornell Univ., 336 Rhodes Hall,Ithaca, NY, 14850; [email protected].

....................................................................

SEPTEMBER/OCTOBER 2009 17

10. C. Isci et al., ‘‘An Analysis of Efficient Multi-

core Global Power Management Policies:

Maximizing Performance for a Given

Power Budget,’’ Proc. Int’l Symp. Micro-

architecture (MICRO 06), IEEE CS Press,

2006, pp. 347-358.

admin
16. K.J. Nesbit et. al., "Fair Queueing Memory Systems," Proc. Int'l Symp. Microarchitecture (MICRO 06), IEEE CS Press, 2006, pp. 208–222.
admin