Upload
leonardo-pitta
View
215
Download
0
Embed Size (px)
Citation preview
8/20/2019 Estruturas de dados II LISCH EISCH
1/38
9/24/2014 1
Gazi University
Computer Engineering Department
BM307 –
File Organization
8/20/2019 Estruturas de dados II LISCH EISCH
2/38
9/24/2014 2
Index
Sequential File Organization
Binary Search
Interpolation Search
Self-Organizing Sequential Search
Direct File Organization Locating Information
Hashing Functions Collision Resolution
Coalesced Hashing
8/20/2019 Estruturas de dados II LISCH EISCH
3/38
9/24/2014 3
File Organization
Goal Organizing files efficiently in terms of both
space and performance
File Organization File Access
sequential sequential
indexed sequential sequential & direct
direct direct (random)
8/20/2019 Estruturas de dados II LISCH EISCH
4/38
9/24/2014 4
File Access Types
Sequential – accessing multiple records (often an entirefile) and usually according to a predefined order
Direct (random) – locating a single record
Question How can we have an effective organization?
Answer matching the type of organization with the
type of intended access
8/20/2019 Estruturas de dados II LISCH EISCH
5/38
9/24/2014 5
Sequential File Organization
Background
Fields (eg.: Employee name, number) Records contain data about individual entities
Files (eg.: employee list)
Primary Key field(s) which uniquely distinguishes a record
from all others
Secondary Key all the remaining fields
8/20/2019 Estruturas de dados II LISCH EISCH
6/38
9/24/2014 6
Sequential File Organization
File consists of records of the same format
Fixed-length records
Variable-length records
Sequential File Organization (i+1)st element of a file
is stored immediately after the ith element.
8/20/2019 Estruturas de dados II LISCH EISCH
7/38
9/24/2014 7
Sequential File Organization
Sequential access moving from one record in the
file to the next by incrementing the address of the
current record by the record size
Direct access processing a single record directly if
we know subscript
8/20/2019 Estruturas de dados II LISCH EISCH
8/38
9/24/2014 8
Sequential File Organization
Probe access to a distinct location
Sequential Search
In an entire file of N records
N/2 probes are needed in average Need to probe entire file for an unseccessful retrieval
Computational complexity O(N)
Appropriate when N is small
Performance improvement? Sorting
8/20/2019 Estruturas de dados II LISCH EISCH
9/38
9/24/2014 9
Eg. - Sequential Search
100000 records, each record size is 400 bytes,block size is 2400 bytes.
Sequential search time for retrieving 10000 records?
Each probe one block of data (100000*400)/2400 = 16667 blocks
Reading time for one block 0.84ms (IBM 3380)
Time requirement for each record
(16667/2)*0.84 = 7 sec. For 10000 records 7sec * 10000 = 19 hours
Better organization is needed!!
8/20/2019 Estruturas de dados II LISCH EISCH
10/38
9/24/2014 10
Sequential File Organization
Binary Search
Requires sorting
Compares the key of the sought record with the middle record of the file
Half of the file is eliminated in each turn
Computational complexity O(log2n)
Eg. the key of the sought record 17
8/20/2019 Estruturas de dados II LISCH EISCH
11/38
9/24/2014 11
Sequential File OrganizationBinary Search (Algoritma)
8/20/2019 Estruturas de dados II LISCH EISCH
12/38
9/24/2014 12
Sequential File OrganizationInterpolation Search
Approximate relative position
Eg.: Searching a name in a telephone book
Choses the next position for a comparison based upon the
estimated position of the sought keyrelative to the remainder
of the file to be searched
NEXT := LOWER + ––––––––––––––––––––––––––––––––––––––––– (UPPER-LOWER)
Worst case computational complexity O(n) Average case computational complexity O(log2 log2n)
Its performance improves as the distribution of keys becomes
more uniform
key[sought] – key [LOWER]
key[UPPER] – key [LOWER]
8/20/2019 Estruturas de dados II LISCH EISCH
13/38
binary search should be preferred when the data is stored in
primary memory
Why?
interpolation search should be preferred when the data is stored in
auxilary memory
Why?
9/24/2014 13
8/20/2019 Estruturas de dados II LISCH EISCH
14/38
binary search should be preferred when the data is stored in
primary memory
The additional calculations needed for the interpolation search cancel
any savings gained from fewer probes
interpolation search should be preferred when the data is stored in
auxilary memory
An access of auxiliary storage is an order of magnitude greater than
the time required for the additional calculations
9/24/2014 14
8/20/2019 Estruturas de dados II LISCH EISCH
15/38
9/24/2014 15
Sequential File OrganizationSelf-Organizing Sequential Search
Modifies the order of records
Moves the most frequently retrieved records to the beginning
of the file
Most popular algorithms:
Move_to_front
Transpose
Count
8/20/2019 Estruturas de dados II LISCH EISCH
16/38
9/24/2014 16
Sequential File Organization
Move_to_front
The sought record is moved to the front position of the file
Potential of making big mistakes if a record accessed , moved to the
front of the file, and then rarely if ever accessed again!
A linked implementation is preferable even though it takes more storage
Appropriate when space is not limited and locality of access is important
Essentially the same as the LRU (least recently used) paging algorithm
used by operating systems
8/20/2019 Estruturas de dados II LISCH EISCH
17/38
9/24/2014 17
Eg. - Move_to_front
The records are accessed in the order of “fileediting”
a b c d e f g h i j k l m n o p r q s t v w y z
f a b c d e g h i j k l m n o p r q s t v w y z
i f a b c d e g h j k l m n o p r q s t v w y z l i f a b c d e g h j k m n o p r q s t v w y z
e l i f a b c d g h j k m n o p r q s t v w y z
e l i f a b c d g h j k m n o p r q s t v w y z
d e l i f a b c g h j k m n o p r q s t v w y z
i d e l f a b c g h j k m n o p r q s t v w y z t i d e l f a b c g h j k m n o p r q s v w y z
i t d e l f a b c g h j k m n o p r q s v w y z
n i t d e l f a b c g h j k m o p r q s v w y z
g n i t d e l f a b c h j k m o p r q s v w y z
8/20/2019 Estruturas de dados II LISCH EISCH
18/38
9/24/2014 18
Sequential File Organization
Transpose
Interchanges the sought record with its immediate predecessor
More stable than the Move_to_front algorithm
A record needs to be accessed many times before it is moved to the
front of the list
Easily implemented
Does not need additional space
Should be used when space is premium
8/20/2019 Estruturas de dados II LISCH EISCH
19/38
9/24/2014 19
Eg. - Transpose
The records are accessed in the order of “fileediting”
a b c d e f g h i j k l m n o p r q s t v w y z
a b c d f e g h i j k l m n o p r q s t v w y z
a b c d f e g i h j k l m n o p r q s t v w y z
a b c d e f g i h j k l m n o p r q s t v w y z
a b c e d f g i h j k l m n o p r q s t v w y z
a b c d e f g i h j k l m n o p r q s t v w y z
a b c d e f i g h j k l m n o p r q s t v w y z
a b c d e f i g h j k l m n o p r q t s v w y z
a b c d e i f g h j k l m n o p r q t s v w y z a b c d e i f g h j k l n m o p r q t s v w y z
a b c d e i g f h j k l n m o p r q t s v w y z
8/20/2019 Estruturas de dados II LISCH EISCH
20/38
9/24/2014 20
Sequential File Organization
Count
Keeps count of the number of accesses of each record
The file is always ordered in a decreasing order of frequency of
access
Requires extra sorage to keep the count
Use it only when the counts are needed for another purpose
8/20/2019 Estruturas de dados II LISCH EISCH
21/38
9/24/2014 21
Direct File Organization
Ideally, we want to go directly to the address where the record is stored
A key can be unique address one probe
More address space than needed
Eg.1 billion addresses for 300 million people
Key
space
Address
space
1 – 1
999-99-9999 999-99-9999
0 0
correspondence
8/20/2019 Estruturas de dados II LISCH EISCH
22/38
9/24/2014 22
Direct File Organization
Converting information into a unique address
Eg. : Airline reservation system
Flight numbers from 1 to 999
Days are numbered from 1 to 366
Flight number and day of the year could be concatenated to determine the
location
Location = flight number || day of the year, address range 001001-999366
(???367 - ???999 would not exist)
Location = day of the year || flight number , address range 001001-366999
8/20/2019 Estruturas de dados II LISCH EISCH
23/38
9/24/2014 23
Direct File Organization
The key converts to a probable address
If we remove most of the empty spaces in the address space, we have
lost the 1-1 correspondence btw keys & addresses
Hashing functions are used to map the wider range of key values into
the narrower range of address values
Hash (key) probable address
Initial probable address home address
Hashing function should
Evenly distribute the keys among the addresses
Executes efficiently
8/20/2019 Estruturas de dados II LISCH EISCH
24/38
9/24/2014 24
Direct File Organization
A collision occurs when two distinct keys map to the same
address
Hashing is then composed of two aspects;
The function
The collision resolution method
Key
space
Address
space
999-99-9999
1200
0
0
8/20/2019 Estruturas de dados II LISCH EISCH
25/38
Direct File Organization
Hashing Functions
9/24/2014 25
8/20/2019 Estruturas de dados II LISCH EISCH
26/38
9/24/2014 26
Direct File OrganizationHashing Functions
Squaring
Taking square of a key and then substringing or truncating a portion
of the result
Radix conversion
The key is considered to be in a base other than 10 ans is then
converted into a number in base 10
Eg.: Base 11 1234 = 1 * 113 + 2 * 112 + 3 * 111 + 4 * 110= 1331 + 242 + 33 + 4
= 1610
substringing or truncation could then be used
8/20/2019 Estruturas de dados II LISCH EISCH
27/38
9/24/2014 27
Direct File OrganizationHashing Functions
Polynomial hashing
The key is divided by a polynomial
f(information area) cyclic check bytes
Alphabetic keys
Alphabetic or alphanumeric key values can be input to a hashing
function if the values are interpreted as integers
8/20/2019 Estruturas de dados II LISCH EISCH
28/38
9/24/2014 28
Direct File OrganizationCollisions
For a given set of data, one hashing function may distribute the keysmore evenly over the address space than another
A hashing function that has a large number of collisions is said to
exhibit primary clustering
It is better to have a slightly more expensive hashing function for datathat need to be stored on auxiliary storage
Another method for reducing collisions is reducing the packing factor
number of records stored
total number of storage locationsPacking factor =
storage
c o l l i s i o n s
8/20/2019 Estruturas de dados II LISCH EISCH
29/38
9/24/2014 29
Direct File OrganizationCollision Resolution
Collision resolution with links
Collision resolution without links Static positioning of records
Dynamic positioning of records
Collision resolution with pseudolinks
8/20/2019 Estruturas de dados II LISCH EISCH
30/38
9/24/2014 30
Direct File Organization
Collision resolution with links
If multiple synonyms occur for aparticular home address, we forma chain of synonym records
Disadvantage extra storage isneeded
Collision resolution without links
We can use implied links byapplying a convention , or set ofrules for deciding where to gonext
A simple convention is to look atthe next location in memory
Advantage NO extra storage isneeded
8/20/2019 Estruturas de dados II LISCH EISCH
31/38
9/24/2014 31
Direct File OrganizationCoalesced Hashing
Occurs when we attempt to insert a record with a homeaddress that is already occupied by a record from a chain with
a different home address
The two chains with records having different home addresses
coalesce or grow together
X ,D, Y were inserted
8/20/2019 Estruturas de dados II LISCH EISCH
32/38
9/24/2014 32
Direct File OrganizationCoalesced Hashing
(Eg.) Hash (key) = key mod 11
27, 18, 29, 28, 39, 13, 16
42 & 17 added
Average # of probes 1.8
8/20/2019 Estruturas de dados II LISCH EISCH
33/38
9/24/2014 33
Direct File OrganizationCoalesced Hashing
Discussion Packing factor of the final table = 9/11 (82%)
One method of reducing coalescing is to reduce the packing factor
It would be advisable to place the most frequently accessed records early inthe insertion process
Deleting a record is complicated
If coalescing has occurred,
a simple deletion procedure is
to move a record later in the
probe chain into the position of
the deleted record
Final table after deleting 39 ---------->
8/20/2019 Estruturas de dados II LISCH EISCH
34/38
9/24/2014 34
Direct File OrganizationCoalesced Hashing
Variants Table organization (whether or not a seperate
overflow area is used)
The manner of linking a colliding item into a chain
The manner of choosing unoccupied locations
Table Organization
Table primary area + overflow area
Adres factor = (primary area ) / (total table size)
Best performance when the adres factor is 0.86
8/20/2019 Estruturas de dados II LISCH EISCH
35/38
9/24/2014 35
Direct File OrganizationCoalesced Hashing
Variants Late Insertion Standart Colesced Hashing (LISCH)
New records are inserted at the end ofa probe chain
Lack of a cellar
Late Insertion Coalesced Hashing (LICH)
Uses a cellar
Eg. Keys: 27, 18, 29, 28, 39, 13, 16, 42, 17
hashing function: key mod 7
Average # of probes 1.3
(It was 1.8 for LISCH) In general, for a 90 percent packing factor,
using a cellar will reduce the number of
probes by about 6 percent compared
with LISCH
8/20/2019 Estruturas de dados II LISCH EISCH
36/38
9/24/2014 36
Direct File OrganizationCoalesced Hashing
Variants
Early Insertion Standart Colesced Hashing (EISCH) İnserts a new record into a position on the probe chain immediately after the record
srored at its home address
İnsertion of the record with key 17 according to EISCH algorithm: Hash (key) = key mod 11
8/20/2019 Estruturas de dados II LISCH EISCH
37/38
9/24/2014 37
Direct File OrganizationCoalesced Hashing
Variants
Random Early Insertion Standart Colesced Hashing (REISCH) Choosing a random unoccupied location for the new insertion
Gives only a 1% improvement over EISCH
Random Late Insertion Standart Colesced Hashing (RLISCH)
Bidirectional Late Insertion Standart Colesced Hashing (BLISCH) Choosing the overflow location for a collision insertion by alternating the
selection between the top and bottom of the table
Bidirectional Early Insertion Standart Colesced Hashing (BEISCH)
8/20/2019 Estruturas de dados II LISCH EISCH
38/38
9/24/2014 38
Direct File OrganizationCoalesced Hashing
Comparison