43
Técnicas de Paralelização de Laços André Oliveira [email protected] Nov/2011

Técnicas de Paralelização de Laçosedson/disciplinas/mo615/2011-2s/semin… · 1 8 1. m c f 1 8 3. e q u a k e 1 8 8. a m m p 2 5 6. b z i p 2 a d p c m d e c e p i c e c j p e

  • Upload
    others

  • View
    4

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Técnicas de Paralelização de Laçosedson/disciplinas/mo615/2011-2s/semin… · 1 8 1. m c f 1 8 3. e q u a k e 1 8 8. a m m p 2 5 6. b z i p 2 a d p c m d e c e p i c e c j p e

Técnicas de Paralelização de Laços

André [email protected]

Nov/2011

Page 2: Técnicas de Paralelização de Laçosedson/disciplinas/mo615/2011-2s/semin… · 1 8 1. m c f 1 8 3. e q u a k e 1 8 8. a m m p 2 5 6. b z i p 2 a d p c m d e c e p i c e c j p e

Agenda• Motivação

• DOALL e Spec-DOALL

• DOACROSS e Spec-DOACROSS

• DSWP e Spec-DSWP

• PS-DSWP e Spec-PS-DSWP

• Conclusões

• Referências

Page 3: Técnicas de Paralelização de Laçosedson/disciplinas/mo615/2011-2s/semin… · 1 8 1. m c f 1 8 3. e q u a k e 1 8 8. a m m p 2 5 6. b z i p 2 a d p c m d e c e p i c e c j p e

Motivação

• Lei de Moore e arquiteturas atuais

Page 4: Técnicas de Paralelização de Laçosedson/disciplinas/mo615/2011-2s/semin… · 1 8 1. m c f 1 8 3. e q u a k e 1 8 8. a m m p 2 5 6. b z i p 2 a d p c m d e c e p i c e c j p e

• Tendência dos transistores

• Frequencia de clock• Consumo de energia• Dissipação de calor• Complexidade de design

Motivação

Page 5: Técnicas de Paralelização de Laçosedson/disciplinas/mo615/2011-2s/semin… · 1 8 1. m c f 1 8 3. e q u a k e 1 8 8. a m m p 2 5 6. b z i p 2 a d p c m d e c e p i c e c j p e

Motivação

• Processos

• Aplicação pesada single-thread

• Reescrever?

• Paralelização automática

• O que paralelizar?

• Custos de comunicação, sync, etc.

Page 6: Técnicas de Paralelização de Laçosedson/disciplinas/mo615/2011-2s/semin… · 1 8 1. m c f 1 8 3. e q u a k e 1 8 8. a m m p 2 5 6. b z i p 2 a d p c m d e c e p i c e c j p e

DOALL

for (i = 0; i < 900000; i++) A[i] = 100;

•Todas as iterações são independentes

•Podem ser processadas em paralelo

Page 7: Técnicas de Paralelização de Laçosedson/disciplinas/mo615/2011-2s/semin… · 1 8 1. m c f 1 8 3. e q u a k e 1 8 8. a m m p 2 5 6. b z i p 2 a d p c m d e c e p i c e c j p e

DOALLint *p = &y[7]for (i = 0; i < N; i++) x[i] += f(p);

•Assumindo que f não introduz loop-carried

•Simplicidade (sync, comm, etc.) e desempenho

•Como mapear, desempenho de caches, FS

Page 8: Técnicas de Paralelização de Laçosedson/disciplinas/mo615/2011-2s/semin… · 1 8 1. m c f 1 8 3. e q u a k e 1 8 8. a m m p 2 5 6. b z i p 2 a d p c m d e c e p i c e c j p e

Spec-DOALLint *p = &y[7]for (i = 0; i < N; i++) x[i] += f(p);

•Compilação estática

•Na dúvida... código correto!

•Profiling

•Especulação e suporte HW/SW

•Aplicações científicas e gráficas

Page 9: Técnicas de Paralelização de Laçosedson/disciplinas/mo615/2011-2s/semin… · 1 8 1. m c f 1 8 3. e q u a k e 1 8 8. a m m p 2 5 6. b z i p 2 a d p c m d e c e p i c e c j p e

DOACROSS

• Aplicações de propósito geral

• Loops irregulares

• Fluxo de controle irregular

• DOALL é inaplicável

• Spec-DOALL é inviável (mispeculation)

• Escalona cada iteração para uma thread

Page 10: Técnicas de Paralelização de Laçosedson/disciplinas/mo615/2011-2s/semin… · 1 8 1. m c f 1 8 3. e q u a k e 1 8 8. a m m p 2 5 6. b z i p 2 a d p c m d e c e p i c e c j p e

A

C

B

D

statement B;statement C;statement D;

A: while (condition){

}

DOACROSS - exemplo

Page 11: Técnicas de Paralelização de Laçosedson/disciplinas/mo615/2011-2s/semin… · 1 8 1. m c f 1 8 3. e q u a k e 1 8 8. a m m p 2 5 6. b z i p 2 a d p c m d e c e p i c e c j p e

A

C

B

D

Core 1 Core 2 Core 1 Core 2

A2

B2

C2

D2A3

B3

C3

D3 A4

B4

C4

D4

A1

B1

C1

D1

A3

C3

A4

C4

A1

C1

A2

C2 B1

D1

B2

D2

B3

D3

B4

D4

0

2

1

3

4

5

6

7

8

9

10

11

12

13DOACROSS DSWP

DOACROSS - exemplo

Page 12: Técnicas de Paralelização de Laçosedson/disciplinas/mo615/2011-2s/semin… · 1 8 1. m c f 1 8 3. e q u a k e 1 8 8. a m m p 2 5 6. b z i p 2 a d p c m d e c e p i c e c j p e

DOACROSS

• Comunicação entre threads no caminho crítico

• Dependente da eficiência do mecanismo de comunicação

• Shared memory e pressão no sistema de memória

• Especulação e Spec-DOACROSS

Page 13: Técnicas de Paralelização de Laçosedson/disciplinas/mo615/2011-2s/semin… · 1 8 1. m c f 1 8 3. e q u a k e 1 8 8. a m m p 2 5 6. b z i p 2 a d p c m d e c e p i c e c j p e

DSWP

• Aplicações de propósito geral

• Comunicação e caminho crítico

Page 14: Técnicas de Paralelização de Laçosedson/disciplinas/mo615/2011-2s/semin… · 1 8 1. m c f 1 8 3. e q u a k e 1 8 8. a m m p 2 5 6. b z i p 2 a d p c m d e c e p i c e c j p e

A

C

B

D

Core 1 Core 2 Core 1 Core 2

A2

B2

C2

D2A3

B3

C3

D3 A4

B4

C4

D4

A1

B1

C1

D1

A3

C3

A4

C4

A1

C1

A2

C2 B1

D1

B2

D2

B3

D3

B4

D4

0

2

1

3

4

5

6

7

8

9

10

11

12

13DOACROSS DSWP

DSWP

Page 15: Técnicas de Paralelização de Laçosedson/disciplinas/mo615/2011-2s/semin… · 1 8 1. m c f 1 8 3. e q u a k e 1 8 8. a m m p 2 5 6. b z i p 2 a d p c m d e c e p i c e c j p e

DSWP - Algoritmo

• Construir grafo de dependências

• Encontrar SCCs (se for 1 termina)

• Coalescing de SCCs

• TPP (se nao houver partição que valha a pena, termina)

• Code spliting

• Flows

Page 16: Técnicas de Paralelização de Laçosedson/disciplinas/mo615/2011-2s/semin… · 1 8 1. m c f 1 8 3. e q u a k e 1 8 8. a m m p 2 5 6. b z i p 2 a d p c m d e c e p i c e c j p e

Loop

DSWP - Exemplo

Page 17: Técnicas de Paralelização de Laçosedson/disciplinas/mo615/2011-2s/semin… · 1 8 1. m c f 1 8 3. e q u a k e 1 8 8. a m m p 2 5 6. b z i p 2 a d p c m d e c e p i c e c j p e

Loop

DSWP - Exemplo

Page 18: Técnicas de Paralelização de Laçosedson/disciplinas/mo615/2011-2s/semin… · 1 8 1. m c f 1 8 3. e q u a k e 1 8 8. a m m p 2 5 6. b z i p 2 a d p c m d e c e p i c e c j p e

Loop

DSWP - Exemplo

Page 19: Técnicas de Paralelização de Laçosedson/disciplinas/mo615/2011-2s/semin… · 1 8 1. m c f 1 8 3. e q u a k e 1 8 8. a m m p 2 5 6. b z i p 2 a d p c m d e c e p i c e c j p e

T1 T2

Loop

DSWP - Exemplo

Page 20: Técnicas de Paralelização de Laçosedson/disciplinas/mo615/2011-2s/semin… · 1 8 1. m c f 1 8 3. e q u a k e 1 8 8. a m m p 2 5 6. b z i p 2 a d p c m d e c e p i c e c j p e

DSWP - TPP

• P1, P2, ..., Pn

• Conjuntos de SCCs

• 1 <= n <= t

• Cada SCC pertence a exatamente uma partição

• (u[Pi], v[Pj]), i <= j

• Partição válida com menor tempo de execução

• Dependente de máquina e NP-completo

Page 21: Técnicas de Paralelização de Laçosedson/disciplinas/mo615/2011-2s/semin… · 1 8 1. m c f 1 8 3. e q u a k e 1 8 8. a m m p 2 5 6. b z i p 2 a d p c m d e c e p i c e c j p e

DSWP - TPP

• Heurística para maximizar balanceamento de carga

• Custo em ciclos

• Latencia de cada instrução

• Profile

Page 22: Técnicas de Paralelização de Laçosedson/disciplinas/mo615/2011-2s/semin… · 1 8 1. m c f 1 8 3. e q u a k e 1 8 8. a m m p 2 5 6. b z i p 2 a d p c m d e c e p i c e c j p e

DSWP - TPP• Estima o balanceamento

• Conjunto de SCCs cujos antecessores já foram para alguma partição

• Utiliza a maior SCC disponível

• Se empatar, utiliza a que reduza o número de dependências de saída da partição atual

• Ao terminar uma partição, considera a comunicação e veja se vale a pena

• Pode terminar o algoritmo

Page 23: Técnicas de Paralelização de Laçosedson/disciplinas/mo615/2011-2s/semin… · 1 8 1. m c f 1 8 3. e q u a k e 1 8 8. a m m p 2 5 6. b z i p 2 a d p c m d e c e p i c e c j p e

+45%

+48%

+43%

-2%

DSWP - TPP

Page 24: Técnicas de Paralelização de Laçosedson/disciplinas/mo615/2011-2s/semin… · 1 8 1. m c f 1 8 3. e q u a k e 1 8 8. a m m p 2 5 6. b z i p 2 a d p c m d e c e p i c e c j p e

DSWP - Code spliting

• BBs que contenham alguma instrução de Pi ou alguma instrução da qual algo em Pi dependa

• produce and consume

• Criar BBs

• Colocar as instruções alocadas para pi nos BBs correspondentes

• Alvos de branches e pós-dominador relevante mais próximo

Page 25: Técnicas de Paralelização de Laçosedson/disciplinas/mo615/2011-2s/semin… · 1 8 1. m c f 1 8 3. e q u a k e 1 8 8. a m m p 2 5 6. b z i p 2 a d p c m d e c e p i c e c j p e

DSWP - Flows

• Data dependence

• Control dependence

• Transmite a direção de um branch para um branch duplicado

• Memory/sync dependence

• Apenas o token para preservar a ordem

Page 26: Técnicas de Paralelização de Laçosedson/disciplinas/mo615/2011-2s/semin… · 1 8 1. m c f 1 8 3. e q u a k e 1 8 8. a m m p 2 5 6. b z i p 2 a d p c m d e c e p i c e c j p e

Loop

DSWP - Flows

Page 27: Técnicas de Paralelização de Laçosedson/disciplinas/mo615/2011-2s/semin… · 1 8 1. m c f 1 8 3. e q u a k e 1 8 8. a m m p 2 5 6. b z i p 2 a d p c m d e c e p i c e c j p e

Loop

L1

L2

DSWP - Flows

Page 28: Técnicas de Paralelização de Laçosedson/disciplinas/mo615/2011-2s/semin… · 1 8 1. m c f 1 8 3. e q u a k e 1 8 8. a m m p 2 5 6. b z i p 2 a d p c m d e c e p i c e c j p e

DSWP - Resultados

• Compilador IMPACT

• Accurate dual-core Itanium 2 simulator

Page 29: Técnicas de Paralelização de Laçosedson/disciplinas/mo615/2011-2s/semin… · 1 8 1. m c f 1 8 3. e q u a k e 1 8 8. a m m p 2 5 6. b z i p 2 a d p c m d e c e p i c e c j p e

DSWP - Resultados

Page 30: Técnicas de Paralelização de Laçosedson/disciplinas/mo615/2011-2s/semin… · 1 8 1. m c f 1 8 3. e q u a k e 1 8 8. a m m p 2 5 6. b z i p 2 a d p c m d e c e p i c e c j p e

1.0

1.2

1.4

1.6

Speedup

129.compress179.art

181.mcf

183.equake

188.ammp

256.bzip2

adpcmdec

epicdec

jpegenc wc

GeoMean

Fully AutomaticBest Manually Directed

(a) Speedup of DSWP over single-threaded.

0

2

4

6

IPC

129.compress179.art

181.mcf

183.equake

188.ammp

256.bzip2

adpcmdec

epicdec

jpegenc wc

Average

DSWP - Core 2

DSWP - Core 1

BASE

(b) Baseline and DSWP IPC.

Figure 6. Performance summary: full-width Itanium 2 baseline.

ing sections of the program while keeping the caches andbranch predictors warm.The first comparison made was between the dual-

threaded, DSWP version of the selected loops against theirsingle-threaded, base versions. In this experiment, the la-tency to produce/consume a value to/from the synchroniza-tion array was set to 1 cycle in our simulator (minimum 2cycles core-to-core). In order to evaluate both the effec-tiveness of our partitioning heuristic (from Section 2.2.2)and the potential of better heuristics, Figure 6(a) presentstwo speedup bars per benchmark loop. The first bar is thefully automated DSWP, using the heuristic. The second onecorresponds to the best performing partitioning found by it-eratively specifying the desired partitioning to the compilerand measuring its resulting performance. Figure 6(a) showsthat in many cases the heuristic found the best partitioningwe were able to find in our iterative search. The geometricmean across these benchmark loops is 14.4% and 19.4%,for the automatically created and manually specified par-titions respectively. In terms of whole-program speedup,these geometric means translate into 6.6% and 9.2% respec-tively. The average baseline IPC is 1.65 and the IPC aver-ages for the producer and consumer cores are 0.88 and 1.24respectively as shown in Figure 6(b). Notice that these IPCnumbers do not include the produce and consume in-structions inserted by DSWP.For simplicity, the simulator used did not model the cost

of coherence protocol. To gauge the effect of this coher-ence cost on our results, we analyzed, for all benchmarks,the memory traces of both cores for false-sharings. Noticethat true-sharings cannot occur inside the loop because ofthe memory dependence arcs in the dependence graph. Ifa load and a store may access the same memory address,there will be arcs in both directions between these instruc-tions, because of the RAW andWAR dependences (one loopcarried, and the other intra-iteration). This makes both in-structions part of the same SCC, assigning them to the samecore. Future work may relax this condition.

In order to analyze false-sharings, we replayed the mem-ory accesses from the traces in an invalidation-based co-herence model offline. Out of the nine applications, onlythree (181.mcf, 256.bzip2, and jpegenc) exhib-ited false-sharing. In 181.mcf and jpegenc, the false-sharing was always caused by writes in the consumer core tolocations already present in the producer core’s L1D cache.While in 181.mcf, the miss-rate of the producer core’sL1D went up by 0.01% to 98.62%, in jpegenc, therewas no change in the miss-rate of both cores. The reasonwhy the miss-rate is unaffected is because although thereis false-sharing, the producer core always runs ahead andaccesses any locations it needs before those locations areinvalidated by writes to memory from the consumer’s core.256.bzip2 has a slightly more interesting behavior. Wefind that a particular store from the producer core causesa lot of false-sharing and hence invalidations in the con-sumer’s L1 data cache. Since the consumer trails the pro-ducer, any extra latency arising out of such events could ad-versely affect the consumer thread and ought to have beenmodeled. However, on examining the code in detail, wefound that all these coherence conflicts are caused by false-sharing due to a write to a global variable (bsLive) in theproducer core. We promoted this global variable to a regis-ter and used the modified version of 256.bzip2 for all exper-iments. Thus, even without coherence modeling, our resultsare valid and not overstated.Figure 7 illustrates the importance of balancing work

across threads when partitioning loops. The figure showsthe DAGSCC for the target loop in 181.mcf. Each SCCis labeled with the number of instructions it contains. Eachleft-to-right line crossing the DAGSCC illustrates one pos-sible way of partitioning it into two threads. For eachpossible partitioning, the figure also illustrates the result-ing speedup, the corresponding synchronization array oc-cupancy for a sample period and the cumulative cycle dis-tribution at each possible occupancy level. An occupancy ofnegative one means the corresponding queue is empty and

DSWP - Resultados

Page 31: Técnicas de Paralelização de Laçosedson/disciplinas/mo615/2011-2s/semin… · 1 8 1. m c f 1 8 3. e q u a k e 1 8 8. a m m p 2 5 6. b z i p 2 a d p c m d e c e p i c e c j p e

0.50

0.75

1.00

1.25

1.50

Speedup

129.compress179.art

181.mcf

183.equake

188.ammp

256.bzip2

adpcmdec

epicdec

jpegenc wc

GeoMean

Half-width BaseHalf-width DSWPFull-width DSWP

(a) Varying issue-widths

1.0

1.2

1.4

1.6

Speedup

129.compress179.art

181.mcf

183.equake

188.ammp

256.bzip2

adpcmdec

epicdec

jpegenc wc

GeoMean

DSWP - 1 cycleDSWP - 10 cyclesDSWP - 100 cycles

(b) Varying communication latencies

Figure 9. Performance compatibility and sensitivity analyses: full-width Itanium 2 baseline.

DSWP over half-width ST is greater than the speedup offull-width DSWP over full-width ST. This occurs becauseDSWP sometimes trades ILP for TLP. Thus, the simplerand less powerful a core is, the more pronounced the bene-fits of DSWP are.

4.4 Sensitivity Analyses: Communication La-tency and Queue Size

In order to quantify the importance of communicationlatency for DSWP, we ran our experiments again with thefull-width CMP model, modified to have communicationlatencies of 10 and 100 cycles between the two cores. Wemodeled this pipelined delay in the produce instructions,while consume instructions continued to take one cycle(representing queue locality at the receiving side). The re-sults are presented in Figure 9(b) and they show that DSWPis not very sensitive to the communication latency. In fact,this was expected due to the design of the DSWP transfor-mation, as discussed in Section 2.We also evaluated the impact of queue size on perfor-

mance by varying the queue size to 8 and 128. We foundthat DSWP executions are fairly insensitive to queue size,with the mean slowdown with size 8 being 2% and the av-erage speedup with size 128 being 1% compared to 32-element queues. The highest slowdown was 6% and thepeak speedup was 7% respectively.

5 Detailed AnalysisAs already mentioned, DSWP is not applicable in cases

where the loop dependence graph has a single SCC. Ad-ditionally, even with multiple SCCs, DSWP does not havegood partitioning opportunities if there exists large SCCsthat create a thread imbalance. In our experiments, wefound that DSWP is generally applicable in spite of thisrequirement. Nevertheless, a few cases were encounteredwhere this restriction was limiting. This section describesthese cases and highlights opportunities to mitigate or elim-inate the problem.

(1) for (i=0; i<x_size*y_size; i++){

(2) dtemp = result[i] / scale_factor;(3) if (dtemp < 0) result[i] = 0;(4) else if (dtemp > 255) result[i] = 255;(5) else result[i] = (int) (dtemp + 0.5);

}

Figure 10. Important loop for epicdec.

5.1 Case Study: epicdec loopDespite epicdec’s simplicity, it illustrates the poten-

tial of having a more accurate memory analysis for DSWP.The source code of the loop in question is shown on Fig-ure 10. In the IMPACT compiler, this loop is unrolled once.During this process, we notice that both memory loads ofthe result array (line (2)) become memory dependent onall the stores. Because of this, the resulting dependencegraph has only 4 SCCs, with all the loads and stores as partof a single SCC. Despite this, DSWP achieves a speedup.Closer inspection revealed that these were false memory de-pendences, conservatively inserted by earlier optimizations.A solution proposed in [10] addresses this problem by per-forming accurate memory analysis at the assembly level.Better scheduling and partitioning resulting from re-

moval of these false memory dependences gave the DSWPand the base codes a 98% and an 87% speedup over theoriginal base code. We then measured the average instruc-tions per cycle (IPC), for both new versions, and verifiedthat the base had an average of 2.33, while the average forthe DSWP threads were 1.26 and 1.37 respectively. Thisshowed us that DSWP was trading ILP for TLP, and that ad-ditional ILP optimizations could potentially improve the re-sults. We then applied more aggressive unrolling (8x), andrecompiled both versions once again. The new DSWP’edversion resulted in a 45% speedup relative to the new baseversion by better utilizing the dual-core resources.

5.2 Case Study: adpcmdec loopLack of predicate-aware dependence analysis for DSWP

caused spurious dependences to be inserted in our depen-

DSWP - Resultados

Page 32: Técnicas de Paralelização de Laçosedson/disciplinas/mo615/2011-2s/semin… · 1 8 1. m c f 1 8 3. e q u a k e 1 8 8. a m m p 2 5 6. b z i p 2 a d p c m d e c e p i c e c j p e

Full-width

Half-width

• DSWP melhor independentemente do width• DSWP-half melhor que normal-full• DSWP-half x base-half melhor que DSWP-full x base-full• Trocando ILP por TLP (maior diferença em cores simples)

DSWP - Variando width

Page 33: Técnicas de Paralelização de Laçosedson/disciplinas/mo615/2011-2s/semin… · 1 8 1. m c f 1 8 3. e q u a k e 1 8 8. a m m p 2 5 6. b z i p 2 a d p c m d e c e p i c e c j p e

X

X

Apenas uma SCC

Spec-DSWP164.gzip: 38% speedup

com 3 threads

181.mcf: 2.9x speedup com 4 threads

Misspeculation detectada

Page 34: Técnicas de Paralelização de Laçosedson/disciplinas/mo615/2011-2s/semin… · 1 8 1. m c f 1 8 3. e q u a k e 1 8 8. a m m p 2 5 6. b z i p 2 a d p c m d e c e p i c e c j p e

PS-DSWP

• Escalabilidade e DOALL

• Mais cores

• Replicação de estágios

• s - p; s - p - s; p - s

Page 35: Técnicas de Paralelização de Laçosedson/disciplinas/mo615/2011-2s/semin… · 1 8 1. m c f 1 8 3. e q u a k e 1 8 8. a m m p 2 5 6. b z i p 2 a d p c m d e c e p i c e c j p e

PS-DSWP

A1

A2

A3

A4

A5

A6

A7

A8

A9

B1

B4

B7

B2

B5

C1

C2

C3

C4

C5

C6

B3

B6

0

1

2

3

4

5

6

7

8

9

Core 1 Core 2 Core 3 Core 4 Core 5

Page 36: Técnicas de Paralelização de Laçosedson/disciplinas/mo615/2011-2s/semin… · 1 8 1. m c f 1 8 3. e q u a k e 1 8 8. a m m p 2 5 6. b z i p 2 a d p c m d e c e p i c e c j p e

PS-DSWP - ResultadosBenchmark Description Loop Hotness Pipeline stagesks Kernighan-Lin graph partitioning FindMaxGpAndSwap (outer) 98% s! potter theorem prover for first-order logic find_lightest_geo_child 15% s! p! s300.twolf placement and routing new_dbox_a (outer) 30% s! p456.hmmer hidden markov model P7_Viterbi (inner) 85% p! s458.sjeng chess program std_eval 26% s! p

Table 2: Benchmark Details

1.0

1.5

2.0

2.5

3.0

LoopSpeedup

3 4 5 6Number of Threads

DSWPPS-DSWP

(a) ks

1.0

1.5

2.0

2.5

3.0

LoopSpeedup

4 5 6Number of Threads

DSWPPS-DSWP

(b) otter

1.0

1.5

2.0

2.5

3.0

LoopSpeedup

3 4 5 6Number of Threads

DSWPPS-DSWP

(c) 300.twolf

1.0

1.5

2.0

2.5

3.0

LoopSpeedup

3 4 5 6Number of Threads

DSWPPS-DSWP

(d) 458.sjeng

1.0

1.5

2.0

2.5

3.0

LoopSpeedup

3 4 5 6Number of Threads

DSWPPS-DSWP

(e) 456.hmmer

1.0

1.5

2.0

2.5

3.0

LoopSpeedup

CS1 CS1-Perf CS32 CS32-Perf

(f) Effects of chunking on 456.hmmerwith 4 threads

Figure 7: Speedups Over Single-threaded Execution

6. RELATEDWORKSeveral automatic techniques have been proposed to extractthread-level parallelism. The techniques closest to PS-DSWP areDSWP [13, 16] and loop distribution [11]. As discussed earlier,PS-DSWP is an extension of DSWP to also exploit iteration-levelparallelism. PS-DSWP is similar to loop distribution in the sensethat both techniques isolate the parts of loop with loop-carried de-pendences from the parts without these dependences. The maindifference between PS-DSWP and loop distribution combined withDOALL is that PS-DSWP allows the execution of the sequentialpart of the loop to overlap with the DOALL part, using pipelineparallelism. Moreover, loop distribution, as proposed in literature,does not permit arbitrary control flow within the region. Finally,the use of synchronization array in PS-DSWP allows it to operateon uncounted loops.DOPIPE [7, 14] is another technique that exploits pipeline paral-lelism. Unlike DSWP, DOPIPE does not handle loops with controlflow inside. Davies [7] also proposes executing multiple copiesof a DOPIPE stage if that stage contains an inner DOALL loop.The PS-DSWP technique is more general because it can create par-allel stages without requiring the operations to be inside an inner

DOALL loop.DOACROSS [5] is another non-speculative technique that can

be used to extract iteration-level parallelism on loops with loop-carried dependences. DOACROSS does not permit control flowwithin the loop. Moreover, DOACROSS does not result in a pipelineof stages. The inter-thread dependences in DOACROSS form a cy-cle and would be the bottleneck as the latency of inter-core com-munication increases. Similar to DSWP, inter-core communicationlatency has negligible impact on the performance of PS-DSWP.While PS-DSWP is a non-speculative technique, several specu-

lative techniques have been proposed to extract iteration-level par-allelism. Thread-level speculation (TLS) [10, 12, 19, 20] tech-niques speculatively execute consecutive iterations of a loop con-currently, assuming that the dependences between those iterationsare infrequent. Inter-iteration dependences that manifest frequentlyare synchronized. Other speculative parallelization techniques suchas LRPD test [17], R-LRPD test [6] and master/slave speculativeparallelization [24] have also been proposed to speculatively ex-tract iteration-level parallelism.Speculative DSWP [23] adds speculation support to DSWP. The

addition of speculation can increase the number of pipeline stages

• Cache perfeita, chunk32, chunk32 + perfect cache• Chunking e perda de paralelismo

Page 37: Técnicas de Paralelização de Laçosedson/disciplinas/mo615/2011-2s/semin… · 1 8 1. m c f 1 8 3. e q u a k e 1 8 8. a m m p 2 5 6. b z i p 2 a d p c m d e c e p i c e c j p e

Spec-PS-DSWP

• Arquiteturas atuais

• Especulação de locais de acesso de memória

• Especulação de dados

• Versionamento

• Especulação de controle

• Profile

Page 38: Técnicas de Paralelização de Laçosedson/disciplinas/mo615/2011-2s/semin… · 1 8 1. m c f 1 8 3. e q u a k e 1 8 8. a m m p 2 5 6. b z i p 2 a d p c m d e c e p i c e c j p e

Spec-PS-DSWP

0

2

4

6

8

10

12

14

16

18

20

22

1 2 4 6 8 10 12 14 16 18 20 22 23

Spe

edu

p

Number of Parallel Stage Threads

130.li164.gzip

197.parser256.bzip2

456.hmmercrc32

blackscholesgimp-oilify-range-checkgimp-nova-range-check

Figure 9: Scaling of application performance on machine M2: Parallel stage threads are executed on up to 23 cores, after leaving cores freefor execution of sequential stages.

are several global data structures that are modified inside an itera-tion, but they are reset at the end of the iteration to the same valuesthat they had at the beginning of that iteration. The value specula-tion described in Section 4.2 is used to break the dependences aris-ing out of the stores to these data structures. Also, some branchesthat are taken under error conditions are speculated to not be taken.The loop speedup is affected by the variability in sentence lengthand is mainly limited by the number of sentences to parse.

256.bzip2 performs data compression using the Burrows-Wheeler transform. Again, only the compression part is paral-lelized. The parallelization is similar to that of 164.gzip. In bothcases, a variable-sized block array (produced by applying Run-Length Encoding to the input data) is the data structure used tostore the blocks on which compression happens. Since the size ofthis array is unknown statically, very complicated whole-programanalyses would be required for static privatization of the data struc-ture. Memory Versioning allows the dynamic privatization of thesedata structures thus enabling parallelization.

456.hmmer is a computational biology application that searchesfor patterns in DNA sequences using profile Hidden Markov Mod-els (HMMs). The loop in main loop serial is parallelized. Scoresare calculated in parallel on sequences which are randomly se-lected. The Commutative annotation is used to break the depen-dence inside the random number generator. Following this parallelstage is a sequential stage in which a histogram is computed andthe maximum score selected by using max-reduction. The speedupis limited by this sequential phase.

crc32 computes the 32-bit CRC of files specified on the com-mand line. Very little speculation is needed: error conditions thatoccur during the computation are speculated to not occur. Thespeedup is limited by the number of files and the variability in thefile sizes.

blackscholes is an Intel RMS benchmark. It calculates theprices for a portfolio of European options analytically with theBlack-Scholes partial differential equation. As in crc32, the only

speculation needed is that an error will not occur in the pricing. Thespeedup is limited only by the number of options to price.

5.3.3 An in-depth look at GIMP

gimp-oilify and gimp-nova are artistic transformation filtersthat are part of GIMP- the GNU Image Manipulation Program. Theoilify filter makes an image look like an oil painting. The nova filterinserts a big star and associated light effects in the image. Rows areprocessed in parallel.

Figure 10 shows the speedup of gimp using Spec-DOALLon machine M1. The try-commit unit based checking algorithmscales up to 5 worker threads. Beyond that, the try-commit unitbecomes the bottleneck. There is not enough computation hap-pening in the worker threads in parallel with the replay of loadsand stores in the try-commit unit. Micro-architectural profiling re-veals that the try-commit unit spends a significant fraction of itsexecution time stalling due to the number of load-store instruc-tions in the processor pipeline reaching the limit the processorcan handle. This overhead is so much on machine M2 that thereis no speedup to be gained. Parallelizing the execution of the

0

1

2

3

4

5

6

7

8

1 2 3 4 5 6 7 8

Speedup

Number of Parallel Stage Threads

gimp-oilify-mt-nocheckgimp-oilify-trycommit-check

gimp-oilify-range-check

Figure 10: Overhead of the checking algorithm on M1

Page 39: Técnicas de Paralelização de Laçosedson/disciplinas/mo615/2011-2s/semin… · 1 8 1. m c f 1 8 3. e q u a k e 1 8 8. a m m p 2 5 6. b z i p 2 a d p c m d e c e p i c e c j p e

Spec-PS-DSWPWhen misspeculation (inter-transaction conflict) is detected, the

commit process is notified of this condition. It then takes the fol-lowing actions:1. The relevant pages are marked as COW.2. For each worker, the page table entries of the commit process

are copied into the worker’s page tables.3. The TLB of the core on which the worker is executing is

flushed so that stale virtual-to-physical address mappings arenot reused.

Loads in subsequently executed MTXs will transparently read com-mitted, non-speculative values. Figure 5(b) illustrates this mem-ory rollback operation. Observe that rollback does not involve anycostly application-data copying operations. The speculative data issimply discarded. Comparing with Figure 5(a), the page tables havebeen remapped and the private pages have been discarded.

4. Loop Parallelization With SMTX

The SMTX system developed in the preceding section is imple-mented as a library. Table 1 presents the interface in two sec-tions. The first section consists of one-time setup and finalizationprimitives. The second section consists of primitives that are in-voked during the execution of SMTXs. The library may be usedin conjunction with many parallelization schemes; its use with onescheme is described below.

4.1 Parallelization Scheme

Recent work has shown that there is significant loop-level paral-lelism in the outermost loops of applications [7, 29, 30, 33]. In mostcases, the loop body is decomposed into three phases; dynamic in-stances of each phase are called tasks. Each phase exhibits a dif-ferent dependence pattern (ignoring the speculated dependences).The first phase depends only on prior tasks of the same phase. Thesecond phase depends only on the corresponding task of the firstphase. The third phase depends on the corresponding task of thesecond phase and prior tasks of itself. Such a partitioning of theloop body is shown in Figure 3(a); statements B, C, and D consti-tute the three phases.

Bridges et al. and Thies et al. present a parallelization schemewhich incorporates pipelining into the basic loop partitioning dis-cussed above [7, 29]. By inserting decoupling buffers in betweenthe stages of the pipeline, the different stages could be execut-ing simultaneously on different iterations. This technique is calledDecoupled Software Pipelining (DSWP)[21]. This base techniqueis extended with speculation to break dependence-recurrences tocreate a long pipeline in Spec-DSWP [32]. Speculation is fur-ther used to break inter-iteration dependences to expose data-levelparallelism, allowing the replication of pipeline stages with noloop-carried dependences. This is called Speculative Parallel-StageDSWP (Spec-PS-DSWP)[8]. Figure 6 shows this execution model.As described in Section 2, this pipelined parallelization has benefitsover TLS techniques. Thus, Spec-PS-DSWP is used as the generalparallelization strategy. Note that if the pipeline consists of just onestage and it is a speculatively parallel (DOALL) stage, the execu-tion model degenerates to Speculative DOALL [18, 33].

Loop parallelization is done as follows: The speculative part ofthe loop is wrapped in an MTX, with each iteration split into mul-tiple subTXs (shown in Figure 6 as similarly shaded ovals). ThesesubTXs are executed by the stages of the Spec-PS-DSWP pipeline.To orchestrate speculative execution, Spec-PS-DSWP relies on acommit stage (Figure 6, Commit Stage). The functionality neededin the commit stage is already encapsulated by the commit unit’sfunctionality in the SMTX system described in the previous sec-tion.

1

0

2

3

4

5

6

Core 1 Core 3Core 2 Core 4 Core 5

CommitStageStage 2

Iteration 1

Iteration 2

B.1

B.2

B.3

B.4

B.5

B.6

C.1

C.3

C.5

C.2

C.4

D.1

D.2

D.3 commit.2

commit.1

Stage 1 Stage 3

Figure 6: Pipelined Execution with Stage Replication: A dashedhorizontal line indicates the time at which all subTXs in an MTXhave finished and the MTX is ready to commit.

1 status CONFLICT DETECTOR(version) {2 for (each worker that entered this version) {3 do {4 status = ver checkForConflict(worker);5 } while (status == CONTINUE);6 if (status == MISSPEC)7 return CONFLICT;8 }9 return NO CONFLICT;

10 }11

12 status ver checkForConflict (worker) {13 token = sq consume(q[worker]);14 value = sq consume(q[worker]);15 if (token) {16 addr = mask1(token);17 isRead = mask2(token);18 if (isRead) {19 if (!addr != value) return MISSPEC;20 } else {21 !addr = value;22 }23 return CONTINUE;24 }25 return BREAK;26 }

Figure 7: A value-based conflict detector is used to detectmemory conflicts between transactions.

4.2 Conflict Detection

In the context of loops, there are flow, anti, or output dependencesbetween the loop iterations. The use of memory versioning ensuresthat MTXs write to different physical versions of the same logicalmemory address. This removes all false (anti and output) depen-dences without code transformations.

Loads and stores that may alias forward the !addr, value" tu-ple by calling ver read and ver writeTo (Table 1). The conflictdetector does a sequential replay of this memory trace in its pri-vate memory to detect whether flow-dependence violations haveoccurred (Figure 7: lines 2-8). For example, let a store operation inMTX1 store value v1 in location l, and a load operation in MTX2

Page 40: Técnicas de Paralelização de Laçosedson/disciplinas/mo615/2011-2s/semin… · 1 8 1. m c f 1 8 3. e q u a k e 1 8 8. a m m p 2 5 6. b z i p 2 a d p c m d e c e p i c e c j p e

Spec-PS-DSWP

0x

10x

20x

30x

40x

50x

60x

70x

80x

90x

100x

110x

120x

8 16 24 32 40 48 56 64 72 80 88 96 104 112 120 128

Spee

dup

Number of Cores

Spec-DOALLTLS

(a) 052.alvinn

0x

10x

20x

30x

40x

50x

60x

70x

80x

90x

100x

110x

120x

8 16 24 32 40 48 56 64 72 80 88 96 104 112 120 128

Spee

dup

Number of Cores

DSWP+[Spec-DOALL,S]TLS

(b) 130.li

0x

10x

20x

30x

40x

50x

60x

70x

80x

90x

100x

110x

120x

8 16 24 32 40 48 56 64 72 80 88 96 104 112 120 128

Spee

dup

Number of Cores

Spec-DSWP+[S,DOALL,S]TLS

(c) 164.gzip

0x

10x

20x

30x

40x

50x

60x

70x

80x

90x

100x

110x

120x

8 16 24 32 40 48 56 64 72 80 88 96 104 112 120 128

Spee

dup

Number of Cores

Spec-DSWP+[S,DOALL,S]TLS

(d) 179.art

0x

10x

20x

30x

40x

50x

60x

70x

80x

90x

100x

110x

120x

8 16 24 32 40 48 56 64 72 80 88 96 104 112 120 128

Spee

dup

Number of Cores

Spec-DSWP+[S,DOALL,S]TLS

(e) 197.parser

0x

10x

20x

30x

40x

50x

60x

70x

80x

90x

100x

110x

120x

8 16 24 32 40 48 56 64 72 80 88 96 104 112 120 128Sp

eedu

pNumber of Cores

Spec-DSWP+[S,DOALL,S]TLS

(f) 256.bzip2

0x

10x

20x

30x

40x

50x

60x

70x

80x

90x

100x

110x

120x

8 16 24 32 40 48 56 64 72 80 88 96 104 112 120 128

Spee

dup

Number of Cores

Spec-DSWP+[DOALL,S]TLS

(g) 456.hmmer

0x

10x

20x

30x

40x

50x

60x

70x

80x

90x

100x

110x

120x

8 16 24 32 40 48 56 64 72 80 88 96 104 112 120 128

Spee

dup

Number of Cores

Spec-DSWP+[DOALL,S]TLS

(h) 464.h264ref

0x

10x

20x

30x

40x

50x

60x

70x

80x

90x

100x

110x

120x

8 16 24 32 40 48 56 64 72 80 88 96 104 112 120 128

Spee

dup

Number of Cores

DSWP+[Spec-DOALL,S]TLS

(i) blackscholes

0x

10x

20x

30x

40x

50x

60x

70x

80x

90x

100x

110x

120x

8 16 24 32 40 48 56 64 72 80 88 96 104 112 120 128

Spee

dup

Number of Cores

DSWP+[Spec-DOALL,S]TLS

(j) crc32

0x

10x

20x

30x

40x

50x

60x

70x

80x

90x

100x

110x

120x

8 16 24 32 40 48 56 64 72 80 88 96 104 112 120 128

Spee

dup

Number of Cores

Spec-DOALLTLS

(k) swaptions

0x

10x

20x

30x

40x

50x

60x

70x

80x

90x

100x

110x

120x

8 16 24 32 40 48 56 64 72 80 88 96 104 112 120 128

Spee

dup

Number of Cores

Spec-DSWPTLS

DSMTX Best

(l) Geomean

Figure 4: Performance scalability of Spec-DSWP and TLS using DSMTX

Page 41: Técnicas de Paralelização de Laçosedson/disciplinas/mo615/2011-2s/semin… · 1 8 1. m c f 1 8 3. e q u a k e 1 8 8. a m m p 2 5 6. b z i p 2 a d p c m d e c e p i c e c j p e

Spec-PS-DSWP

0x

10x

20x

30x

40x

50x

60x

70x

80x

90x

100x

110x

120x

8 16 24 32 40 48 56 64 72 80 88 96 104 112 120 128

Spee

dup

Number of Cores

Spec-DOALLTLS

(a) 052.alvinn

0x

10x

20x

30x

40x

50x

60x

70x

80x

90x

100x

110x

120x

8 16 24 32 40 48 56 64 72 80 88 96 104 112 120 128

Spee

dup

Number of Cores

DSWP+[Spec-DOALL,S]TLS

(b) 130.li

0x

10x

20x

30x

40x

50x

60x

70x

80x

90x

100x

110x

120x

8 16 24 32 40 48 56 64 72 80 88 96 104 112 120 128

Spee

dup

Number of Cores

Spec-DSWP+[S,DOALL,S]TLS

(c) 164.gzip

0x

10x

20x

30x

40x

50x

60x

70x

80x

90x

100x

110x

120x

8 16 24 32 40 48 56 64 72 80 88 96 104 112 120 128

Spee

dup

Number of Cores

Spec-DSWP+[S,DOALL,S]TLS

(d) 179.art

0x

10x

20x

30x

40x

50x

60x

70x

80x

90x

100x

110x

120x

8 16 24 32 40 48 56 64 72 80 88 96 104 112 120 128

Spee

dup

Number of Cores

Spec-DSWP+[S,DOALL,S]TLS

(e) 197.parser

0x

10x

20x

30x

40x

50x

60x

70x

80x

90x

100x

110x

120x

8 16 24 32 40 48 56 64 72 80 88 96 104 112 120 128

Spee

dup

Number of Cores

Spec-DSWP+[S,DOALL,S]TLS

(f) 256.bzip2

0x

10x

20x

30x

40x

50x

60x

70x

80x

90x

100x

110x

120x

8 16 24 32 40 48 56 64 72 80 88 96 104 112 120 128

Spee

dup

Number of Cores

Spec-DSWP+[DOALL,S]TLS

(g) 456.hmmer

0x

10x

20x

30x

40x

50x

60x

70x

80x

90x

100x

110x

120x

8 16 24 32 40 48 56 64 72 80 88 96 104 112 120 128

Spee

dup

Number of Cores

Spec-DSWP+[DOALL,S]TLS

(h) 464.h264ref

0x

10x

20x

30x

40x

50x

60x

70x

80x

90x

100x

110x

120x

8 16 24 32 40 48 56 64 72 80 88 96 104 112 120 128

Spee

dup

Number of Cores

DSWP+[Spec-DOALL,S]TLS

(i) blackscholes

0x

10x

20x

30x

40x

50x

60x

70x

80x

90x

100x

110x

120x

8 16 24 32 40 48 56 64 72 80 88 96 104 112 120 128

Spee

dup

Number of Cores

DSWP+[Spec-DOALL,S]TLS

(j) crc32

0x

10x

20x

30x

40x

50x

60x

70x

80x

90x

100x

110x

120x

8 16 24 32 40 48 56 64 72 80 88 96 104 112 120 128

Spee

dup

Number of Cores

Spec-DOALLTLS

(k) swaptions

0x

10x

20x

30x

40x

50x

60x

70x

80x

90x

100x

110x

120x

8 16 24 32 40 48 56 64 72 80 88 96 104 112 120 128Sp

eedu

pNumber of Cores

Spec-DSWPTLS

DSMTX Best

(l) Geomean

Figure 4: Performance scalability of Spec-DSWP and TLS using DSMTX

Page 42: Técnicas de Paralelização de Laçosedson/disciplinas/mo615/2011-2s/semin… · 1 8 1. m c f 1 8 3. e q u a k e 1 8 8. a m m p 2 5 6. b z i p 2 a d p c m d e c e p i c e c j p e

Conclusões

• DOALL

• DSWP e PS-DSWP para poucos cores

• Muitos cores

• Reescrita

• Ajuda do programador

• Especulação

Page 43: Técnicas de Paralelização de Laçosedson/disciplinas/mo615/2011-2s/semin… · 1 8 1. m c f 1 8 3. e q u a k e 1 8 8. a m m p 2 5 6. b z i p 2 a d p c m d e c e p i c e c j p e

Referências

• Ottoni, G.; Rangan, R.; Stoler, A. e August, D. Automatic Thread Extraction with Decoupled Software Pipelining. Proc. of the 38th MICRO, 2005.

• Raman, E.; Ottoni, G.; Raman, A.; Bridges, M.; August, D. Parallel-Stage Decoupled Software Pipelining. Proc. of the CGO-2008.

• Raman, A.; Kim, H.; Mason, T.; Jablin, T.; August, D. Speculative Parallelization Using Software Multi-threaded Transactions. Proc. of the ASPLOS-2010.

• Kim, H.; Raman, A.; Liu, F.; Lee, J.; August, D. Scalable Speculative Parallelization on Commodity Clusters. Proc of the 43rd MICRO, 2010.