Upload
angelique-beat
View
216
Download
2
Embed Size (px)
Citation preview
Implementação do DWImplementação do DW
SAD Tagus 2004/05 H. Galhardas
O problema e as O problema e as soluçõessoluções
Grandes quantidades de dados => Grandes quantidades de dados => Métodos de acessoMétodos de acesso e e processamento de interrogaçõesprocessamento de interrogações eficientes eficientes
Materialização de vistas para informação agregadaMaterialização de vistas para informação agregada: : Identificar, explorar e actualizar de forma eficienteIdentificar, explorar e actualizar de forma eficiente
Índices:Índices: Index intersection, bit map indices, join indicesIndex intersection, bit map indices, join indices
SAD Tagus 2004/05 H. Galhardas
Escolha dos agregadosEscolha dos agregados
ObjectivoObjectivo: Minimizar o tempo de resposta: Minimizar o tempo de resposta
PerguntaPergunta: Quais os agregados a materializar?: Quais os agregados a materializar?
FactoresFactores: : Espaço de armazenamento – custo baixo! Custo de actualização – custo alto!
CompromissoCompromisso::
Tempo de resposta vs custo de actualizaçãoTempo de resposta vs custo de actualização
SAD Tagus 2004/05 H. Galhardas
Cube OperationCube Operation Transform it into a SQL-like language (with a new operator Transform it into a SQL-like language (with a new operator
cube bycube by, introduced by Gray et al.’96), introduced by Gray et al.’96)SELECT item, city, year, SUM (amount)
FROM SALES
CUBE BY item, city, year Need compute the following Group-BysNeed compute the following Group-Bys
(date, product, customer),(date,product),(date, customer), (product, customer),(date), (product), (customer)()
2^n cuboids
(item)(city)
()
(year)
(city, item) (city, year) (item, year)
(city, item, year)
SAD Tagus 2004/05 H. Galhardas
Goal: Goal: efficientefficient computation of aggregations across computation of aggregations across many sets of dimensionsmany sets of dimensions
Take into account Take into account amount of main memoryamount of main memory available and available and computation timecomputation time
Data cube can be viewed as a lattice of cuboids Data cube can be viewed as a lattice of cuboids
Materialization of data cubeMaterialization of data cube Materialize every (cuboid) (full materialization), none (no
materialization), or some (partial materialization)
Efficient Data Cube Efficient Data Cube ComputationComputation
)11(
n
i iLT
SAD Tagus 2004/05 H. Galhardas
Partial materialization of Partial materialization of cuboidscuboids
Three factors must be considered:Three factors must be considered:1. Identify the subset of cuboids to materialize
Take into account queries in the workload, their frequencies and accessing costs; cost of incremental updates, total storage requirements
Popular approaches: use heuristics or materialize cuboids based on other frequently referenced cuboids
2. Exploit the materialized cuboids during query processing
How to use indexes and how to transform OLAP operations onto the selected cuboids
3. Efficiently update the materialized cuboids during load and refresh
Explore parallelism and incremental update
SAD Tagus 2004/05 H. Galhardas
Classification of agg. facts Classification of agg. facts wrt the agg. functions wrt the agg. functions
DistributiveDistributive: if the result derived by applying the function to : if the result derived by applying the function to n n aggregate values is the same as that derived by applying aggregate values is the same as that derived by applying the function on all the data without partitioning.the function on all the data without partitioning.
Ex: count(), sum(), min(), max().
AlgebraicAlgebraic:: if it can be computed by an algebraic function with if it can be computed by an algebraic function with MM arguments (where arguments (where M M is a constant), each of which is is a constant), each of which is obtained by applying a distributive aggregate function.obtained by applying a distributive aggregate function.
Ex: avg(), min_N(), standard_deviation().
HolisticHolistic:: if there is no algebraic function with M arguments if there is no algebraic function with M arguments that characteriezes the computation.that characteriezes the computation.
Ex: median(), mode(), rank().
SAD Tagus 2004/05 H. Galhardas
Arquitecturas de servidor Arquitecturas de servidor OLAPOLAP
Relational OLAP (ROLAP)Relational OLAP (ROLAP) Usa SGBDs relacionais ou relacional extendido para armazenar e gerir os
dados do datawarehouse e usa middleware OLAP para suportar funcinalidades específicas do OLAP.
Inclui optimização suportada pelo SGBDR, implementa lógica de navegação de agregação e serviços/ferramentas adicionais
Maior escalabilidade
Multidimensional OLAP (MOLAP) Multidimensional OLAP (MOLAP) Motor de armazenamento multidimensional baseado em arrays (sparse
matrix techniques) Indexação rápida de dados sumarizados pré-calculados
Hybrid OLAP (HOLAP)Hybrid OLAP (HOLAP) Flexibilidade: baixo nível: relacional, alto nível: array
Specialized SQL serversSpecialized SQL servers Suporte especializado para interrogações SQL sobre esquemas em estrela e
floco de neve
SAD Tagus 2004/05 H. Galhardas
Some efficient cube Some efficient cube computation methodscomputation methods
ROLAP-based cubing algorithms (Agarwal et al’96)
Array-based cubing algorithm (Zhao et al’97)Bottom-up computation method (Bayer &
Ramarkrishnan’99)And others...
SAD Tagus 2004/05 H. Galhardas
ROLAP-Based Cubing ROLAP-Based Cubing AlgorithmAlgorithm
Value-based addressing: dimension values accessed by
key-based addressing strategies
Sorting, hashing, and grouping operations are applied to the
dimension attributes in order to reorder and cluster related
tuples
Grouping is performed on some subaggregates as a “partial
grouping step”
Aggregates may be computed from previously computed
aggregates, rather than from the base fact table
SAD Tagus 2004/05 H. Galhardas
Iceberg CubeIceberg Cube Computing only the cuboid cells whose count Computing only the cuboid cells whose count
or other aggregates satisfying the condition or other aggregates satisfying the condition likelike
HAVING COUNT(*) >= minsup
MotivationMotivation Only a small portion of cube cells may be “above the
water’’ in a sparse cube Only calculate “interesting” cells—data above certain
threshold Avoid explosive growth of the cube
Suppose 100 dimensions, only 1 base cell. How many aggregate cells if count >= 1? What about count >= 2?
SAD Tagus 2004/05 H. Galhardas
Multi-way Array Multi-way Array Aggregation for Cube Aggregation for Cube
ComputationComputation Direct array addressingDirect array addressing: dimension values accessed via the index of their corresponding array : dimension values accessed via the index of their corresponding array locationslocations
Partition arrays into Partition arrays into chunkschunks (a small subcube which fits in memory). (a small subcube which fits in memory). Compressed sparse array addressing: (chunk_id, offset)Compressed sparse array addressing: (chunk_id, offset) Compute aggregates in “multiway” Compute aggregates in “multiway” by visiting cube cellsby visiting cube cells in the order which minimizes the # of times to in the order which minimizes the # of times to
visit each cell, thereby reducing memory access and storage cost.visit each cell, thereby reducing memory access and storage cost.
What is the best traversing order to do multi-way aggregation?
A
B
29 30 31 32
1 2 3 4
5
9
13 14 15 16
6463626148474645
a1a0
c3c2
c1c 0
b3
b2
b1
b0
a2 a3
C
B
4428 56
4024 52
3620
60
SAD Tagus 2004/05 H. Galhardas
CuboidsCuboids
ABC – base cuboid – already computedABC – base cuboid – already computed AB, AC, BC – 2-D cuboidsAB, AC, BC – 2-D cuboids A, B, C – 1-D cuboidA, B, C – 1-D cuboid All – 0-D cuboidAll – 0-D cuboid
Many possible orderings w/ which chunks Many possible orderings w/ which chunks can be read into memorycan be read into memory
SAD Tagus 2004/05 H. Galhardas
Example (1)Example (1)
A
B
29 30 31 32
1 2 3 4
5
9
13 14 15 16
6463626148474645
a1a0
c3c2
c1c 0
b3
b2
b1
b0
a2 a3
C
4428 56
4024 52
3620
60
B
SAD Tagus 2004/05 H. Galhardas
Example (2)Example (2)
A
B
29 30 31 32
1 2 3 4
5
9
13 14 15 16
6463626148474645
a1a0
c3c2
c1c 0
b3
b2
b1
b0
a2 a3
C
4428 56
4024 52
3620
60
B
SAD Tagus 2004/05 H. Galhardas
Multi-Way Array Multi-Way Array Aggregation for Cube Aggregation for Cube Computation (Cont.)Computation (Cont.)
MethodMethod: the planes should be sorted and computed : the planes should be sorted and computed according to their size in ascending order.according to their size in ascending order. See the details of Example 2.12 (pp. 75-78) Idea: keep the smallest plane in the main memory, fetch and
compute only one chunk at a time for the largest plane
Limitation of the methodLimitation of the method: computing well only for a small : computing well only for a small number of dimensionsnumber of dimensions If there are a large number of dimensions, “bottom-up
computation” and iceberg cube computation methods can be explored
SAD Tagus 2004/05 H. Galhardas
Indexing OLAP Data: Indexing OLAP Data: Bitmap IndexBitmap Index
Index on a particular columnIndex on a particular column Each value in the column has a bit vectorEach value in the column has a bit vector: bit-op is fast: bit-op is fast The length of the bit vector: # of records in the base tableThe length of the bit vector: # of records in the base table The The i i-th bit is set if the -th bit is set if the i i-th row of the base table has the value for -th row of the base table has the value for
the indexed columnthe indexed column Not suitable for high cardinality domainsNot suitable for high cardinality domains
RecID Retail Dealer1 1 02 0 13 0 14 1 05 0 1
RecIDAsia Europe America1 1 0 02 0 1 03 1 0 04 0 0 15 0 1 0
Base table Index on Region Index on TypeCust Region TypeC1 Asia RetailC2 Europe DealerC3 Asia DealerC4 America RetailC5 Europe Dealer
SAD Tagus 2004/05 H. Galhardas
Indexing OLAP Data: Join Indexing OLAP Data: Join IndicesIndices
Join index:Join index: JI(R-id, S-id) where R (R-id, JI(R-id, S-id) where R (R-id, …) …) S (S-id, …) S (S-id, …)
Traditional indices map the values to a Traditional indices map the values to a list of record idslist of record ids It materializes relational join in JI file and
speeds up relational join — a rather costly operation
In data warehouses, join index relates In data warehouses, join index relates the values of the the values of the dimensionsdimensions of a star of a star schema to schema to rowsrows in the fact table. in the fact table.Ex: fact table: Sales and two dimensions city
and productA join index on city maintains for each distinct
city a list of R-IDs of the tuples recording the Sales in the city
SAD Tagus 2004/05 H. Galhardas
Example of join indexesExample of join indexes
SAD Tagus 2004/05 H. Galhardas
Efficient Processing of Efficient Processing of OLAP QueriesOLAP Queries
1.1. Determine Determine which operationswhich operations should be performed should be performed
on the available cuboids:on the available cuboids:
transform drill, roll, etc. into corresponding SQL and/or
OLAP operations, e.g, dice = selection + projection
2.2. Determine Determine to which materialized cuboid(s)to which materialized cuboid(s) the the
relevant operations should be applied.relevant operations should be applied.
3.3. Exploring indexing structures and compressed vs. Exploring indexing structures and compressed vs.
dense array structures in MOLAPdense array structures in MOLAP
SAD Tagus 2004/05 H. Galhardas
BibliografiaBibliografia
(Livro) (Livro) Data Mining: Concepts and TechniquesData Mining: Concepts and Techniques, J. , J. Han & M. Kamber, Morgan Kaufmann, 2001 Han & M. Kamber, Morgan Kaufmann, 2001 (Capítulo 2 no livro e Capítulo 3 nos drafts)(Capítulo 2 no livro e Capítulo 3 nos drafts)
(Artigo) Data Cube: A relational aggregation operator (Artigo) Data Cube: A relational aggregation operator generalizing group-by, cross-tab, and sub-totals, J. generalizing group-by, cross-tab, and sub-totals, J. Gray et al, DMKD 1(1), 1997Gray et al, DMKD 1(1), 1997
(Artigo) On the computation of multidimensional (Artigo) On the computation of multidimensional aggregates, Agarwal et al, VLDB, 1996aggregates, Agarwal et al, VLDB, 1996