How to choose partitioned primary index (PPI)

This post is an expand to my PPI basic post.

What is the difference between NPPI and PPI?

NPPI: Non partitioned primary index
The good old regular PI. The rows are distributed by HASHAMP(HASHBUCKET(HASHROW(PI))), and ordered by HASHROW(PI), nothing special
PPI: Partitioned primary index
Distribution is the same, but ordering different: <PartitionID><HASHROW(PI)>. The <PartitionID> is a stored value in each rows, allocating 2 or 8 bytes (see below).

The only difference is the storing order of the records (and the 2/8 bytes overhead).

What is PPI good for?

The partitioning feature - like in many other databases - usually solves some performance issues, say enables to eliminate some needless work in specific situations.

SELECT
Eligible "where conditions" result serious partition-elimination, which means that usually only a small fraction of the table should be scanned instead of the whole one.
INSERT
Check the storing order of the NPPI tables: the records are in "hash" order, that is if I want to insert a series of records into a Teradata table, they will reside in spreadly distributed data blocks. If the table is big enough, my new eg. 1000 records will get into ~1000 different data blocks, what means 1000 pieces of expensive "random writes". However if my 1000 records got to a PPI table and they will have the same PartitionID, they will get into far less than 1000 data blocks with high probability. In real life situations we often will write to continuous data blocks with much cheaper "sequential write"
DELETE
Same as INSERT
BACKUP
Teradata allows archiving only one or more partitions, saving lot of time and tape. Older data in transaction tables usually does not change therefore it is unnecessary to backup them every time

"Good-to-know"s

Costs of partitioning

Like all good things in the world, partitioning has trade-offs also:

Extra 2/8 bytes per record allocated storage space
Depending on maximal number of partitions. See "Number of partitions" chapter
Slower "SAMPLE" scans
Proper random sampling is more complex, since the physical storing order is in correlation with partitioning value
Extra sort operations / Sliding window joins
If joined to a table which has NPPI or PPI with not exactly same definition will result a preparation "sort" step, or leads to a "sliding window merge join", which is technically N x M merge joins between the partitions of TableA and TableB.

Number of partitions

How many partitions should I have?
How many partitions do I have?
How is an empty partition looks like?
They are all interesting questions, let's analyze the implementation of Teradata implementation.

Partition is not an object, it is just a calculated (and stored) value in the record, which will determine the physical storing order of the record. A partition will not allocate space, an "empty partition" technically means that no record exists with that partition's partitionID, nothing else.
How many partitions I have in the table? As many different PartitionID in the existing records occure, which depends on the occurring values of the partitioning column.
How many partitions can I have in the table? It depends on the table definition. One must use the RANGE_N or the CASE_N function to define the PartitionID calculation. Its definition unambiguously determines how many different PartitionID values may occur. In versions up to V13.10 65535 is allowed, from V14.00 we can have as many as 9.2 Quintillion (8 bytes PartitionID). The table definition cannot be altered to switch between 2 and 8 bytes layout.

What is the drawback of having many partition? The sliding-window merge join. Mind including partitioning column into the PI if possible (otherwise PI based filtering will cause as many accesses as many partitions exist).

What happens with the out-of-range records?

We have the clauses NO RANGE and NO CASE in the PPI definition. They mean an ID value for that partition that is out of the defined range or case, those records got into this partition. It can be a hidden trap, if you forget to maintain your date partition definition on a transaction table, and all records got to get into this partition from a moment. And the partition keeps fattening, queries keep go slowing somehow...

Multi level partitioning

This is a good trick. One can define partitioning "hierarchically", which is simply a "Cartesian product" of the partitions at each levels, the result is a single PartitionID. In case of 2 bytes partitioning, the "Cartesian product" should fall below 65535.

What is sensational in the Teradata implementation of multi level PPI? You can filter only lower level partitioning key(s) also, partition elimination will happen. How? It calculates all possible combinations, and produces the PartitionID list to be scanned, excellent.

Partitioning granularity

The next good question is: how fine should I define partitioning?

It depends. Basically I'd branch to two main cases:

"Temporal" (date) partitioning
The best partition size is the day. Most of the filtering is on day level, and we have ~365 days a year, not too much partitions for your lifetime. If we partition on monthly units, then the partition elimination ranges are more rough, and we have 12 partitions a year, which is also too much in case of a PI-NPPI join.
All others
It really depends. Depends on the goal, and the value demographics. It's good to correlate with the filtering pattern (what is the frequent relevant 'where' condition parcel).

Hope it helped, please ask, if something is missing or confusing.

Sponsored by PRISE Ltd.
www.prisetools.com

6 comments:

AnonymousApril 7, 2014 at 7:59 PM
Greetings Akos,
Apologies for the split post to facilitate readability.
1. Can you please explain what is the sliding window merge join, singe window merge join and their differences.
2. Request you to go through the following link and suggest the cause.
http://forums.teradata.com/forum/database/row-key-based-merge-join-vs-rowhash-match-scan-sliding-window-join
I do not have neither the query nor the data demographics for the above. I know its hard to analyse without exact information, any predictions for the cause that could help me.
3. In case of PPI with 10000 partitions, if I insert the data into only 1 partition, will that single partition alone will be in transient journal or the entire partition data will be in transient journal causing much more impact on CPU
4. Say if I declare MLPPI with columns in order a,b,c,d,e. If I query with where clause columns b,d will partition elimination take place? If so, what is the basis for the partitionid in the rowkey for the partitioned colums.
-- Thanks, Cheeli
ReplyDelete
Replies
AnonymousApril 7, 2014 at 8:05 PM
Following is the definitions of the indexes on the table.
PRIMARY INDEX ( L_ORDERKEY )
PARTITION BY (RANGE_N(l_shipdate BETWEEN DATE '1992-01-01' AND DATE '1998-12-01' EACH INTERVAL '1' MONTH, no range ),
--range_n(l_quantity between 1.00 and 100.00 each 10.00),
range_n(l_suppkey between 1 and 200 each 10, no range),
range_n(L_RECEIPTDATE BETWEEN DATE '1992-01-01' AND DATE '1998-12-01' EACH INTERVAL '1' year, no range ))
INDEX ( L_PARTKEY )
INDEX ( L_SHIPDATE );

I am confused on the number of partitions accessed by the following queries. Can you please give some insights on this.

1. explain sel * from samples.itemppi_ppichk where l_shipdate='1995-01-01'
EXCERPT FROM EXPLAIN PLAN:
We do an all-AMPs RETRIEVE step from 168 partitions of
samples.itemppi_ppichk with a condition of (
"samples.itemppi_ppichk.L_SHIPDATE = DATE '1995-01-01'") into
Spool 1 (group_amps), which is built locally on the AMPs. The
input table will not be cached in memory, but it is eligible for
synchronized scanning. The size of Spool 1 is estimated with low
confidence to be 24 rows (3,216 bytes). The estimated time for
this step is 0.03 seconds.

2. explain sel * from samples.itemppi_ppichk where l_suppkey=10
EXPLAIN PLAN:
We do an all-AMPs RETRIEVE step from 680 partitions of
samples.itemppi_ppichk with a condition of (
"samples.itemppi_ppichk.L_SUPPKEY = 10") into Spool 1 (group_amps),
which is built locally on the AMPs. The input table will not be
cached in memory, but it is eligible for synchronized scanning.
The size of Spool 1 is estimated with no confidence to be 5,989
rows (802,526 bytes). The estimated time for this step is 0.10
seconds.

3. explain sel * from samples.itemppi_ppichk where l_suppkey=10 and L_RECEIPTDATE='1996-02-02'
EXPLAIN PLAN:
We do an all-AMPs RETRIEVE step from 85 partitions of
samples.itemppi_ppichk with a condition of (
"(samples.itemppi_ppichk.L_RECEIPTDATE = DATE '1996-02-02') AND
(samples.itemppi_ppichk.L_SUPPKEY = 10)") into Spool 1
(group_amps), which is built locally on the AMPs. The input table
will not be cached in memory, but it is eligible for synchronized
scanning. The size of Spool 1 is estimated with no confidence to
be 4,492 rows (601,928 bytes). The estimated time for this step
is 0.08 seconds.

-- Thanks, Cheeli
ReplyDelete
Replies
AnonymousApril 8, 2014 at 6:23 PM
Again, thank you Akos
-- Cheeli
ReplyDelete
Replies

Add comment