Tuesday, December 17, 2013

Using Partitioned Primary Index

How to use partitioned primary index (PPI)

This post is about row partitioning and will not discuss columnar.

What is partitioning?

To explain it correctly, let's get back to the basics...
Each Teradata tables (except NoPI type) have "Primary Index", aka. PI, which is not physical index, but rather a logical construction: one or more columns of the table which give the input for hashing method. The hash value determines two things:
  • Which AMP will store the record
  • Storing order of the records within the AMPs
If the PI is non-partitioned then the records are stored in order of hash value of PI.

If you use (row) partitioning, you define it at the Primary Index.
In this case Teradata will associate a 2bytes or 2/8 bytes (at V14.10) "partition code" to the record*, and the storing order is <partition code>,<hash_value> (aka. RowKey).
That way partitions are not sub-tables or other physical objects, but only influence the record storing order.

* This implies that no more than 64k(2bytes)/9Q(8bytes) partitions can exist. For details read the appropriate Teradata version's documentation.

What is the difference between PPI and NUSI?

NUSI (Non Unique Secondary Index) can serve as similar purposes, but is absolutely different.
NUSI is a separate subtable, with analogue PI to base table, but different (value) ordering.
For details please read Teradata documentation.

How to define?

Non partitioned table:
create table tablePI
(
  Trx_id Integer
, Trx_dt Date
)
PRIMARY INDEX (Trx_id)


Partitioned table:
create table tablePPI (   Trx_id Integer
, Trx_dt Date
)
PRIMARY INDEX (Trx_id
, Trx_dt**)
PARTITION BY RANGE_N(Trx_dt BETWEEN DATE '2010-01-01' AND DATE '2013-12-31' EACH INTERVAL '1' DAY , NO RANGE, UNKNOWN)


Highlights
  • **Partitioning key (Trx_dt here) can be part of the PI or not. This is very important, see below.
  • Partitioning can be single or multiple (MLPPI) levels***
  • RANGE_N or CASE_N functions can be used for determining partition code
  • RANGE_N function has constant interval endpoints and partition length.
  • NO RANGE and UNKNOWN partitions will store the out-of-intervals and null value records respectively
***MLPPI is a technique when multiple or nested partitioning is defined on the table. Logically it looks like sub-partitions, but in practice it only influences the calculation of partition code values, which is still a linear 2/8 bytes value overall the table.

Pros - Cons of using PPI

PPI is a very useful feature, but not a silver bullet to use it everywhere. Look the trade offs:
  • (+) Partition elimination
    Only the relevant partitions are scanned while accessing data
  • (+) Interval filtering is supported
  • (+) Accelerates INSERTs
    If we load increment data into a populated table. Very likely less data blocks are affected, since few partitions are involved (if date is the partitioning basis) 
  • (-) 2 or 8 bytes extra space allocation per record
  • (-) Compression is not allowed on PartKey column
  • (-) PartKey inclusion problem (see below)
  • (-) Partition elimination works only with literals
    Subselects cause full table scans

Design aspects

RANGE_N or CASE_N

These functions are used to define partitioning. RANGE_N is for concentrate date (integer) intervals into partitions, while CASE_N is like a CASE-WHEN-THEN expression, where the outcome is the partition.

Typically RANGE_N is used when we partition a transaction table by its date or timestamp, while CASE_N is popular in special cases like categorizing. You can use more columns in the logical expression, but take care, all of them must be used in filter condition to enable partition elimination.

RANGE_N: what interval size?

It depends on the granularity of the data, granularity of filtering and how long interval should be stored in the table. Usually daily partitioning is ideal.

RANGE_N: interval extension or intervals in advance?

If we load transactional data into our partitioned table, the date column we use as partition key is populated later and later dates, while we have a finite partition range definition.
Partition ranges can be added to RANGE_N definition periodically (depends on version), or we can define partitions in far advance. (365 partitions required for a year, 65k partitions cover ~ 180years, which is more than enough) Note that empty partitions do not allocate space.

One of the methods above should be applied, otherwise the NO RANGE partition will grow extensively, which will cause performance degradation due to less effective partition elimination.

Partitioning Key: include in PI or not?

This is the funny point.
Partitioning key is the column(s) that determines the partition, say used in the RANGE_N/CASE_N definition. We can include it in the Primary Index or not, we decide.

Let's take an example. We have a master-detail pair of tables, nicely "equi-PI"-ed for effective join:

CREATE TABLE ORDER_HEAD
(
  ORDER_NO INTEGER
, ORDER_DT DATE
) UNIQUE PRIMARY INDEX (ORDER_NO);

CREATE TABLE ORDER_ITEM
(
  ORDER_NO INTEGER
, ORDER_ITEM_NO
, PROD_NO INTEGER
) PRIMARY INDEX (ORDER_NO);


We modify ORDER_HEAD's PI:
UNIQUE PRIMARY INDEX (ORDER_NO, ORDER_DT)

PARTITION BY RANGE_N(ORDER_DT BETWEEN DATE '2010-01-01' AND DATE '2013-12-31' EACH INTERVAL '1' DAY , NO RANGE, UNKNOWN)

Should we include ORDER_DT or not? Which is better, what is the difference?
  • Not include
    ORDER_HEAD and ORDER_ITEM tables will have similar AMP distribution, but different physical order within the AMPs.
    Each join operation requires sort of the selected ORDER_HEAD records in spool, or ORDER_ITEMS table will be merge joined against each selected non empty partitions of ORDER_HEAD sequentially (called sliding-window merge join)
  • Include
    ORDER_HEAD and ORDER_ITEM tables will have different AMP distribution, each join operation requires redistribution.Why do we not use the same PI at ORDER_ITEM? Because we do not have that column there.
Neither of the above is acceptable in many cases. What should we do? In this case I would copy the ORDER_DT to the ORDER_ITEM table also, and use the same "Included" version of PI. Requires some more space, logic in load time, but great gain while accessing data.

Use cases

Filtering

This select will eliminate all partitions except those three:
select * from ORDER_HEAD where order_dt between '2013-12-12' (date) and '2013-12-14' (date);

This select will generate all rows scan:
select * from ORDER_HEAD where cast( order_dt as char(7)) = '2013-12';

This select will generate all rows scan* either (sub-query):
select * from ORDER_HEAD  where order_dt in (select max(calendar_date) from sys_calendar.calendar  where year_of_calendar=2013 and month_of_year=5);
Why? Optimizer has to determine which partitions to be accessed in time of generating execution plan. That time it cannot know what is the result of the subquery. That is it.

* I got a proper comment on this option to double check. Yes, right, this information is a out-of-date. With actual versions of Teradata (V13.10..V14.10) I experienced 3 different results:
  • Full scan
    Eg. sub-query contains a "group by"
  • Dynamic partition elimination
    Sub-query is simple, indicates "enhanced by dynamic partition elimination" section in the plan
  • Plan-time partititon elimination
    Literal condition or very simple sub query. Parsing time evaluation enables PO to determine which partitions to be scanned.  Plan: "...We do an all-AMPs ... step from 3 partitions of...". Do not really know exactly what decides between full scan, dynamic- or plan-time elimination... Explanations welcome.

Join

We join two tables: T1 and T2. The table shows what happens if they are partitioned, not partitioned and the partitioning key is included or not in the PI:

T2

T1
PI:(a) PI:(a) PART(b) PI:(a,b) PART(b)
PI:(a) Join: T1.a=T2.a
RowHash match
PI:(a) PART(b) Join: T1.a=T2.a
T1 sorted by hash(a) or
Sliding-window MJ
Join: T1.a=T2.a
T1&T2 sorted by hash(a)
or Sliding-window MJ
(NxM combinations)
Join: T1.a=T2.a and T1.b=T2.b
T1&T2 sorted by RowKey
RowKey based MJ
PI:(a,b) PART(b) Join: T1.a=T2.a
T1 Redistributed & sorted
by hash(a)
Join: T1.a=T2.a
T1 Redistributed by hash(a)
T2 sorted by hash(a) and MJ
Join: T1.a=T2.a and T1.b=T2.b
T2 Redistributed and sorted by RowKey
RowKey based MJ
Join: T1.a=T2.a and T1.b=T2.b
RowKey based MJ


Insert

Let's take a transaction table like ORDERS. In practice we load it periodically (eg. daily) with the new increment which is typically focused to a short interval of transaction date/time. If the ORDERS table is not partitioned, then the outstanding hashing algorithm will spread them all over the data blocks of the table evenly, therefore Teradata has to modify far more data blocks than the increment was reside in.

But if the ORDERS table is partitioned, then the physical order of the records is primarily determined by the partition key. This means that the increment will reside in very few partitions, close together, and the insert operation requires approx the same number of blocks to be written than the increment was in.

For more details on PPIs please refer the documentation of the appropriate Teradata version.

To be continued...

10 comments:

  1. Awesome content Akos, thank you for all your articles especially relative to Teradata tuning!! Can you please post more content on DBQL!

    ReplyDelete
    Replies
    1. Thank you very much Cheeli, I plan to continue DBQL thread. Do you have some special questions, or are you interested in general?

      Delete
  2. Akos, Your comment that partitioning will not be used on a sub-select, can you double check that? I believe you should see "enhanced by dynamic partition elimination".

    ReplyDelete
    Replies
    1. Anonymous, Thank you for your remark, I re-examined the case and modified the post. Thanks for your countribution to the more precise content.

      Delete
  3. Hi Akos,
    Thanks for the post. Can you explain the point " Full Table scan :- Eg. sub-query contains a "group by"?

    ReplyDelete
    Replies
    1. Hi Agilan,

      Eg:
      sel * from TBL1 where Col1 in (sel ColA from TBL2 group by 1)

      This case did not result "dynamic partition elimination" at our system (V13.10), instead processed like TBL1 was not partitioned (scanned all partitions). Text in explain:
      3) We do an all-AMPs RETRIEVE step from
      TBL1 by way of an all-rows scan with no residual conditions
      into Spool 7 (all_amps), which is built locally on the AMPs.

      Delete
  4. This comment has been removed by the author.

    ReplyDelete
  5. Hi Akos,Nicely explained! I have a question:
    If I have a table DDL definition like following .. create table Tablename
    (order_no integer,
    order_date.)
    primary index(order_no)
    partitioned by Range_N(order_date between '2007-01-01'
    and '2007-02-01' interval '1' day)
    Then rows in an AMP will be sorted by order_date,But as we do not specify NO RANGE ,
    still the rows which are not in specified range be in one partition?
    will they be arranged by Rowhash ,not order_date?

    ReplyDelete
    Replies
    1. Hi Shubha,

      In this case you will get a "5728 Partitioning violation" error, and your insert will be refused.
      In case of PPI, the records will always be sorted by composite key (Called RowKey)
      If you specify NO RANGE partition, it will be a specific partition with a given ID, and acts like all other ones.

      Delete
  6. excellent explanation akos,

    I have table


    create table Tablename
    (order_no integer,
    order_date.)
    primary index(order_date,order_no)
    partitioned by Range_N(order_date between '2013-01-01'
    and '2099-12-31' interval '1' day)


    I have to extact and load the data into new TD table on daily basis, like incremental load. my question.
    a.i query the data from source(above table) on bases of order_no
    b.what kind of table i can create to extract and load
    c.table which got loaded in step b will be joined with different different tables and get the data for other processes.

    ReplyDelete