DEXL Lab - Programa de Verão LNCC 2015 Jornada Ciência ...dexl.lncc.br/seminarios/JornadaBigData-Parte-III-2015.pdfPgpool II – Implemented on top of PostgreSQL 9.1 – Central

2/5/15

1

Programa de Verão LNCC 2015

Jornada Ciência de Dados

Fabio Porto ([email protected]) LNCC – CCC - DEXL Lab http://dexl.lncc.br

Gerenciamento de Grandes Volumes de Dados – Parte III

1

Particionameto de Dados

2

Intuição Q

Consultas e processos de workflow sobre dados particionados apresentam um tempo de processamento correspondente ao tamanho da maior partição

Q’ Q’’ Q’’’

3

Imagem de uma região do céu

2/5/15 LIneA - HQOOP 4

2/5/15

2

Distribuição Uniforme

5

Distribuição não uniforme

6

Particionamento de Dados

l  Definição: –  Dado um conjunto de dados D={ e1, e2,…, en} definir

uma função de distribuição –  f (ei) -> v , v um número inteiro correspondente a um item de

alocação de dados (por exemplo um máquina na rede)

l  Considerações: –  sensível à carga de trabalho W={w1, w2,…, wn}

7

Particionamento x Carga de Trabalho

não local

local

8

2/5/15

3

Objetivo

l  minimizar o volume de dados acessado pelas aplicações, em um contexto de execução paralela; –  maximiza a localidade de acesso –  idealmente, cada aplicação acessaria localmente

(i.e. sem transferência de dados pela rede)

9

Funções de distribuição

l  Round-Robin l  Histograma l  chave l  baseada em carga-trabalho

10

Round-Robin

l  Dado D={ e1, e2,…, en} e N={ n1n2,…, nn} , onde ni é um nó de alocação de dados –  f(ei) -> nj , tal que j=i , se i ≤ n, ou j=mod(i/n)

–  Cada elemento é associado a um nó , segundo sua ordem de leitura

–  aplicação não apresenta restrição quanto ao conjunto de dados processado em cada nó;

11

Por Intervalo de valores

l  A função de distribuição associa um valor do objeto a um intervalo conhecido do domínio

l  Seja A={1,2,3,.., k} –  P1=[1,..,1000] –  P2=[1001, ..,5000] –  .. –  Pn=[…, k]

l  se ei(v) = 2000 então f(ei) = P2

12

2/5/15

4

Por Intervalo de Valores

l  Interessante quando aplicações processam conjunto de dados segundo intervalos: –  data inicio – data fim –  intervalo de valor de venda de produto –  intervalo de nomes

l  É um caso especial de histogramas

13

Intervalo de Valores (ex bidimensional)

14

Intervalo de Valores (ex)

15

Intervalo de Valores (ex)

16

2/5/15

5

Distribuição não uniforme

17

Distribuição por Intervalo não uiformes

18

Histograma de igual largura

19

Histogramas: Largura x Altura

20

2/5/15

6

Uso do Histograma por altura

21

Particionamento ciente da carga de trabalho

R(a1,…,a9)

B2

B1

Bm

…

22

Problem statement l  Given

–  Single relation database R(a1,…,an), n ~1000 –  Initial workload: set of k queries W0 = {q1,…,qk} –  m empty fixed size blocks

l  Assumptions –  Accessing a block ≈ accessing all its tuples –  Periodically new tuples and queries arrive –  No privilege to a particular attribute

l  Goal –  Minimize the total block access during the execution of queries by:

l  Optimal partitioning of R’s data in blocks l  Optimal query execution

–  Adapt to the arrival of new data and queries

23

Overview of the solution l  Data partitioning : graph based algorithm

–  Nodes: each data item (e.g. tuple) represent a node in the graph –  Edges: an edge between two data items if are accessed by a

common query –  Edge weight : the number of queries that access both data items –  Goal: partition the graph into m equal size sub-graphs with minimum

edge cut l  Use a min-cut algorithm

l  Block explanation –  Blocks are explained in terms of queries

l  Each block is assigned an explaining query Bi = vi(R)

l  Query processing –  Queries are compared to explaining queries

–  Matching blocks are selected (we haven’t worked on that yet)

24

2/5/15

7

Partitioning strategy

1

We create a node for each row

Schism: VLDB2010

25

1

2



26

1

2

3



27

1 2 3 4 5 6 7

1

2

3

4

5

6

7


For each vertical fragment


28

2/5/15

8

1 2 3 4 5 6 7

1

2

3

4

5

6

1

1

1

7

We increment the arc weight when two rows are accessed together

q1



29

1 2 3 4 5 6 7

1

2

3

4

5

6

1

1

2 1

1

7


q2



30

1

2

3

4

5

6

1

1

7 3

2

7 2

5

4

7

1

1 2 3 4 5 6 7


W = {q1,…,qn}



31

1

2

3

4

5

6

1

1

7 3

2

7 5

4

7

1

1 2 3 4 5 6 7

We execute a min-cut algorithm


2


32

2/5/15

9

1 2 3 4 5 6 7

1 2 4

3 5 6 7

1

2

3

4

5

6

1

1

7 3

2

7 5

4

7

1

Each partition is assigned a block B1 B2

Catalog

2


33

Partitioned data with queries

1

2

4

3

5

6

7

B1 B2

{q3,q5,…, q13}

{q1,q2,…, q14}

Each block is associated with the queries that access Some records of the block

For a given query q the number of accessed blocks is minimized

34

l  New tuple arrival: [DEXA 2012] –  Select the best block

l  i.e. block to which the new tuple is more correlated –  Challenges:

l  How to select the best block with minimum effort? –  Initial approach : find it based on the correlation of queries

to blocks –  Define optimal allocation –  Compute actual allocation efficiency –  Compute block affinity

l  What if the best block is full? –  Initial approach: split the block

Adaptive Strategy

35

Allocation based on affinity to blocks

36

2/5/15

10

Elapsed-time of incrementing the DB as the size increases

0.1

1

10

100

1000

2M 4M 6M 8M 10M 12M 14M 16M 18M 20M

Execution

time(s)

DB size

staticDynPart, |D′| = 500 k

+

+ +

++

+

+

+

+

+

+

+

+

+

+

+

+ +

+

+

+ + + +

+

++ +

++ + + +

+

+ + ++

+

DynPart, |D′| = 1M

37

Experiment: - Sloan DR8 – 350 million tuples - workload- synthetic 27000 queries - PaToH – hyper-graph partitioner

By Key

l  Partitioning is fine by how to avoid skewed distribution of data through nodes?

l  Are there strategies to partition data into hundreds or thousands of nodes with balanced guaranties?

l  Lessons learned from P2P systems key distribution strategies

38

P-Grid [Aberer04]

l  Completely decentralized l  Use randomized algorithm to create search

structure l  Scale gracefully in the total number of nodes

and total number of data items;

39

System model l  Set of peer nodes linked through neighbourhood

links; l  Each peer stores a subset of the keys; l  A virtual binary tree is built based on the key value

space l  Each peer keeps a branch of an index tree pointing

to the nodes storing the keys the peer manages l  Additionally, at each level of the tree pointers

indicate the nodes keeping the complementing key set;

l  A search may start at any peer;

40

2/5/15

11

Key management

l  Each peer is responsible for an interval of keys: –  k= pi={0,1}, val(k)= ∑i=1,n 2-ipi

–  I(k)=[val(k),val(k)+2n] l  For each prefix kl of length l, a peer maintains

references to other peers that have the same prefix as l but a different value at position l+1 for the keys the peer is responsible for

l  A binary tree orders the key space and points to peers managing keys with prefixes appearing at tree nodes;

l  Managing the tree is independent of managing keys

p1 p2… pn

41

Example

root

0 1

00 01 10 11

1 6 2 3 4 5

0 - 3,4,1 00 - 2

Complement pointer

Each node in the tree keeps pointers to peers holding the complementary key: (p1 p2… pk) -> (p1 p2… pk)

index keys Starting with

00…

42

P-Grid Construction a)

a

Initially each peer points to the whole key space

b root root

a b

0 1

a(0)

b(1)

b)

a

Meet peers managing the same set of keys

b p1 p2… pk p1 p2… pk

a b

p1 p2… pk 0 p1 p2… pk1

a(p1 p2… pk 0)

b(p1 p2… pk1)

43

P-Grid Construction

c)

a

One peer manages a longer path

b p1 p2… pk p1 p2… pkpk+1.. pn

a b

p1 p2… pkpk+1 p1 p2… pkpk+1.. pn

a(p1 p2… pk pk+1)

b(p1 p2… pk pk+1.. pn)

d)

a

The keys peers manage intercept up to position k

b p1 p2… pkqk+1.. qm

p1 p2… pkpk+1.. pn

Ref1=refs(qk+1)/{b}

meet(ref1,{b}) Ref2=refs(pk+1)/{a}

meet(ref2,{a}) 44

2/5/15

12

Distributed BY Range

l  Pgpool II –  Implemented on top of PostgreSQL 9.1 –  Central node coordinates data/query distribution/

replication –  Requests distributed through nodes –  Parallel query Processing

l  data partitioning based on a table column range (e.g. id)

l  For short queries, may reduce the number of accessed data

–  Load Balance l  Concurrent requests directed to different DB copies

45

Parallelism & LoadBalance

46

Parallel query Pgpool II

Replication Pgpool II

Replication Pgpool II

postgreSQL postgreSQL PostgreSQL PostgreSQL

Evaluation

l  Strength –  Extends PostgreSQL –  Load balance queries from concurrent workflows –  Scales up to 128 DB nodes

l  Weaknesses –  Lack of support to spatial functions –  Partitioning based on a single column –  Ingestion can’t use COPY

47

QServ - LSST

l  Developed by the LSST DM team l  Astronomy data management l  Horizontal partitioning based on declination

zones (nodes) and data on each node distributed into chunks based on RA-chunk

l  Approx. 1000 partitions l  Native support to spatio-temporal functions l  Built on top of MySQL

48

2/5/15

13

Evaluation

l  Strong –  Designed to support astronomy data surveys –  Highly scalable: ~1000 nodes –  First performance results are very promising –  Alignment with the LSST project

l  Weaknesses –  Current culture based on PostgreSQL

49

Considerações Finais l  Gerenciar grandes volumes de dados requer:

–  estratégias escaláveis –  foco do desenho que favoreça localidade de

dados –  tratamento de incerteza –  tratamento da heterogeneidade –  particionamento dos dados –  processos de otimização

50

Considerações Finais

l  Framework atuais de processamento analíticos –  MapReduce (Hadoop, PivotalHD, Cloudera,…) –  Apache Spark –  Sagitari (CEFET-RJ) – em breve disponível –  HaQoop – em desenvolvimento no DEXL

–  http://dexl.lncc.br

51

Documents

DEXL Lab - Programa de Verão LNCC 2015 Jornada Ciência ...dexl.lncc.br/seminarios/JornadaBigData-Parte-III-2015.pdfPgpool II – Implemented on top of PostgreSQL 9.1 – Central