Thursday, April 03, 2014

Using Partitioned Primary Index II

Sponsored by PRISE Ltd.
www.prisetools.com

How to choose partitioned primary index (PPI)

This post is an expand to my PPI basic post.

What is the difference between NPPI and PPI?

  • NPPI: Non partitioned primary index
    The good old regular PI. The rows are distributed by HASHAMP(HASHBUCKET(HASHROW(PI))), and ordered by HASHROW(PI), nothing special
  • PPI: Partitioned primary index
    Distribution is the same, but ordering different: <PartitionID><HASHROW(PI)>. The <PartitionID> is a stored value in each rows, allocating 2 or 8 bytes (see below).
The only difference is the storing order of the records (and the 2/8 bytes overhead).

What is PPI good for?

The partitioning feature - like in many other databases - usually solves some performance issues, say enables to eliminate some needless work in specific situations.
  • SELECT
    Eligible "where conditions" result serious partition-elimination, which means that usually only a small fraction of the table should be scanned instead of the whole one.
  • INSERT
    Check the storing order of the NPPI tables: the records are in "hash" order, that is if I want to insert a series of records into a Teradata table, they will reside in spreadly distributed data blocks. If the table is big enough, my new eg. 1000 records will get into ~1000 different data blocks, what means 1000 pieces of expensive "random writes". However if my 1000 records got to a PPI table and they will have the same PartitionID, they will get into far less than 1000 data blocks with high probability. In real life situations we often will write to continuous data blocks with much cheaper "sequential write" 
  • DELETE
    Same as INSERT
  • BACKUP
    Teradata allows archiving only one or more partitions, saving lot of time and tape. Older data in transaction tables usually does not change therefore it is unnecessary to backup them every time

"Good-to-know"s

Costs of partitioning

Like all good things in the world, partitioning has trade-offs also:
  • Extra 2/8 bytes per record allocated storage space
    Depending on maximal number of partitions. See "Number of partitions" chapter
  • Slower "SAMPLE" scans
    Proper random sampling is more complex, since the physical storing order is in correlation with partitioning value
  • Extra sort operations / Sliding window joins
    If joined to a table which has NPPI or PPI with not exactly same definition will result a preparation "sort" step, or leads to a "sliding window merge join", which is technically N x M merge joins between the partitions of TableA and TableB.

Number of partitions

How many partitions should I have?
How many partitions do I have?
How is an empty partition looks like?
They are all interesting questions, let's analyze the implementation of Teradata implementation.

Partition is not an object, it is just a calculated (and stored) value in the record, which will determine the physical storing order of the record. A partition will not allocate space, an "empty partition" technically means that no record exists with that partition's partitionID, nothing else.
How many partitions I have in the table? As many different PartitionID in the existing records occure, which depends on the occurring values of the partitioning column.
How many partitions can I have in the table? It depends on the table definition. One must use the RANGE_N or the CASE_N function to define the PartitionID calculation. Its definition unambiguously determines how many different PartitionID values may occur. In versions up to V13.10 65535 is allowed, from V14.00 we can have as many as 9.2 Quintillion (8 bytes PartitionID). The table definition cannot be altered to switch between 2 and 8 bytes layout.

What is the drawback of having many partition? The sliding-window merge join. Mind including partitioning column into the PI if possible (otherwise PI based filtering will cause as many accesses as many partitions exist).

What happens with the out-of-range records?

We have the clauses NO RANGE and NO CASE in the PPI definition. They mean an ID value for that partition that is out of the defined range or case, those records got into this partition. It can be a hidden trap, if you forget to maintain your date partition definition on a transaction table, and all records got to get into this partition from a moment. And the partition keeps fattening, queries keep go slowing somehow...

Multi level partitioning

This is a good trick. One can define partitioning "hierarchically", which is simply a "Cartesian product" of the partitions at each levels, the result is a single PartitionID. In case of 2 bytes partitioning, the "Cartesian product" should fall below 65535.

What is sensational in the Teradata implementation of multi level PPI? You can filter only lower level partitioning key(s) also, partition elimination will happen. How? It calculates all possible combinations, and produces the PartitionID list to be scanned, excellent.

Partitioning granularity

The next good question is: how fine should I define partitioning?
It depends. Basically I'd branch to two main cases:
  • "Temporal" (date) partitioning
    The best partition size is the day. Most of the filtering is on day level, and we have ~365 days a year, not too much partitions for your lifetime. If we partition on monthly units, then the partition elimination ranges are more rough, and we have 12 partitions a year, which is also too much in case of a PI-NPPI join.
  • All others
    It really depends. Depends on the goal, and the value demographics. It's good to correlate with the filtering pattern (what is the frequent relevant 'where' condition parcel).
Hope it helped, please ask, if something is missing or confusing.

Sponsored by PRISE Ltd.
www.prisetools.com

6 comments:

  1. Greetings Akos,
    Apologies for the split post to facilitate readability.
    1. Can you please explain what is the sliding window merge join, singe window merge join and their differences.
    2. Request you to go through the following link and suggest the cause.
    http://forums.teradata.com/forum/database/row-key-based-merge-join-vs-rowhash-match-scan-sliding-window-join
    I do not have neither the query nor the data demographics for the above. I know its hard to analyse without exact information, any predictions for the cause that could help me.
    3. In case of PPI with 10000 partitions, if I insert the data into only 1 partition, will that single partition alone will be in transient journal or the entire partition data will be in transient journal causing much more impact on CPU
    4. Say if I declare MLPPI with columns in order a,b,c,d,e. If I query with where clause columns b,d will partition elimination take place? If so, what is the basis for the partitionid in the rowkey for the partitioned colums.
    -- Thanks, Cheeli

    ReplyDelete
    Replies
    1. Hi Cheeli, my answers:

      1. I think I cannot explain better than written here (pages 386 and 389):
      http://tunweb.teradata.ws/tunstudent/TeradataUserManuals/SQL_Reference_--_Request_And_Transaction_Processing.pdf

      2. It is really strange for me. I'd strongly need to have the system in front of me to reproduce the execution to examine it.
      The main thing is the radical drop of the I/O none the less we have an additional sort step.
      Maybe the sorted table is small enough to fit into memory in couple of slices (this indicates the sliding window stuff), and that saves I/O operations. Theoretically rowkey based merge join should not read a data block more times from neither of the tables, but
      I would have to see the exact table definitions, and data demographics.
      One guess: if the divcd (and other PPI columns) are not included in the PI, the matchable records may reside in different blocks in both of the tables, therefore a data block may be read more times. In the second case they've been sorted to the proper order, and multiple reads eliminated. Again: should see more details.

      3. The transient journal stores the "before images" of the record (only rowid for newly inserted records) to be able to recover the
      "before-transaction" state of the table in case of rollback. Only the affected rows get into the journal, the number and granularity of partitions will not influence the size of the journal.

      4. Yes, definitely. If you filter on b and d, let's assume that B' number of ranges/cases from b and D' from d will remain to scan after the filter.
      Let's have A,C,E possible number of ranges/cases in the a,c,e partition levels.
      You will get A x B' x C x D' x E number of partitions to be scanned after partition elimination.
      The MLPPI creates cartesian product of the possible ranges/cases on each partitioning levels, and associates a partitionID to each possible combination.


      Delete
  2. Following is the definitions of the indexes on the table.
    PRIMARY INDEX ( L_ORDERKEY )
    PARTITION BY (RANGE_N(l_shipdate BETWEEN DATE '1992-01-01' AND DATE '1998-12-01' EACH INTERVAL '1' MONTH, no range ),
    --range_n(l_quantity between 1.00 and 100.00 each 10.00),
    range_n(l_suppkey between 1 and 200 each 10, no range),
    range_n(L_RECEIPTDATE BETWEEN DATE '1992-01-01' AND DATE '1998-12-01' EACH INTERVAL '1' year, no range ))
    INDEX ( L_PARTKEY )
    INDEX ( L_SHIPDATE );

    I am confused on the number of partitions accessed by the following queries. Can you please give some insights on this.

    1. explain sel * from samples.itemppi_ppichk where l_shipdate='1995-01-01'
    EXCERPT FROM EXPLAIN PLAN:
    We do an all-AMPs RETRIEVE step from 168 partitions of
    samples.itemppi_ppichk with a condition of (
    "samples.itemppi_ppichk.L_SHIPDATE = DATE '1995-01-01'") into
    Spool 1 (group_amps), which is built locally on the AMPs. The
    input table will not be cached in memory, but it is eligible for
    synchronized scanning. The size of Spool 1 is estimated with low
    confidence to be 24 rows (3,216 bytes). The estimated time for
    this step is 0.03 seconds.

    2. explain sel * from samples.itemppi_ppichk where l_suppkey=10
    EXPLAIN PLAN:
    We do an all-AMPs RETRIEVE step from 680 partitions of
    samples.itemppi_ppichk with a condition of (
    "samples.itemppi_ppichk.L_SUPPKEY = 10") into Spool 1 (group_amps),
    which is built locally on the AMPs. The input table will not be
    cached in memory, but it is eligible for synchronized scanning.
    The size of Spool 1 is estimated with no confidence to be 5,989
    rows (802,526 bytes). The estimated time for this step is 0.10
    seconds.

    3. explain sel * from samples.itemppi_ppichk where l_suppkey=10 and L_RECEIPTDATE='1996-02-02'
    EXPLAIN PLAN:
    We do an all-AMPs RETRIEVE step from 85 partitions of
    samples.itemppi_ppichk with a condition of (
    "(samples.itemppi_ppichk.L_RECEIPTDATE = DATE '1996-02-02') AND
    (samples.itemppi_ppichk.L_SUPPKEY = 10)") into Spool 1
    (group_amps), which is built locally on the AMPs. The input table
    will not be cached in memory, but it is eligible for synchronized
    scanning. The size of Spool 1 is estimated with no confidence to
    be 4,492 rows (601,928 bytes). The estimated time for this step
    is 0.08 seconds.

    -- Thanks, Cheeli

    ReplyDelete
    Replies
    1. Hi Cheeli,

      First I answer this question, since it is more simple. You have 3 levels partitioning, with the following possible number of partitions:
      L1: 84 (=8 years * 12 months) + 1 (no range) = 85
      L2: 20 (so many ranges) + 1 (no range) = 21
      L3: 7 (no of years) +1 (no range) = 8

      If you filter to one partition on a specific level, it to scan all combination of the remaining levels, by your cases:
      1. L1 (filtered) -> L1(1) x L2 x L3 = 1 x 21 x 8 = 168
      2. L2 filtered -> L1 x L2(1) x L3 = 85 x 1 x 8 = 680
      3. L2&L3 filtered -> L1 x L2(1) x L3(1) = 85 x 1 x 1 = 85

      Delete
    2. So, no partition elimination?

      Delete
  3. Again, thank you Akos
    -- Cheeli

    ReplyDelete