online-php-experts: Most useful Teradata interview questions -- PART

Most useful Teradata interview questions -- PART - V

180.What are permanent journals in teradata?

Journals are used to capture information about table in Teradata. In case of Permanent journals they capture details of Journal enabled tables in teradata with all the pre transaction and post transaction details .
Journal tables are assigned PERM space and they reside in same database as of parent or they can reside on different database.
They are mainly used for protection of data and sometimes also for disaster recovery ( fallback is better in this case )
Permanent journal tables can be enabled or disabled by running alter database <databasename> ‘no journal’ /’ journal = <databasename.jrnltbl>’
Arcmain utility provides the feature of backing up Journal tables
We can find details about all journal tables present in teradata database using DBC.JOURNALS table.

181.what are different types of journals in teradata?

There are 3 different types of journals available in Teradata.

1. Transient Journal – This maintains current transaction history. Once the query is successful it deletes entries from its table . If the current query transaction fails, It rolls back data from its table.
2. Permanent Journal – This is defined when a table is created. It can store BEFORE or AFTER image of tables. DUAL copies can also be stored. Permanent Journal maintains Complete history of table.
3.Down AMP recovery Journal (DARJ) – This journal activates when the AMP which was supposed to process goes down. This journal will store all entries for AMP which went down. Once that AMP is up, DARJ copies all entries to that AMP and makes that AMP is sync with current environment.

182.What are benefits of Permanent Journal?

Permits capture of before images for database rollback.
Permits capture of after images for database roll forward.
Permits archiving change images during table maintenance.
Reduces need for full-table backups.
Provides a means of recovering NO FALLBACK tables.
Requires additional disk space for change images.
Requires user intervention for archive and recovery activity

183.What is the purpose of Before Journal and After journal?

We use both before journal and after journal in Permanent journal to do selective or full recovery of data

184.Where does TD store transient journal?

In perm space -> dbc.transientjournal

But that special table can grow over dbc’s perm limit until the whole system runs out of perm space.

collect Statistics

185.What is collect State in Teradata ? What it use and how it works?

This statistics is used by the PE to optimize the plan which can be viewed by explain command.
When collect stats is executed,the statistics is stored for use of PE.
The Optimizer plans an execution strategy for every SQL query submitted to it.
For the Optimizer to consistently choose the optimum strategy, it must be provided with reliable, complete, and current demographic information regarding all of these factors. The best way to assure that the Optimizer has all the information it needs to generate optimum execution strategies is to COLLECT STATISTICS.

Statistics tell the Optimizer how many rows/ value there are.
May improve performance of complex queries and joins.
NUSI Bit Mapping requires collected statistics.
Helpful in accessing a column or index with uneven value distribution.
Stale statistics may mislead the Optimizer into poor decisions.
Statistics remain valid across a reconfiguration of the system.
COLLECT is resource intensive and should be done during off hours.

186.When should the statistics be collected?

Here are some excellent guidelines on when to collect statistics:

All non unique indices
Non index join columns
Primary index of small tables
Secondary indexes mentioned in the tables
Primary Index of a Join Index
Secondary Indices defined on any join index
Join index columns that frequently appear on any additional join index columns that frequently appear in WHERE search conditions
Columns that frequently appear in WHERE search conditions or in the WHERE clause of joins.

187.How to make sure Optimizer chooses NUSI over Full Table Scan (FTS)?

A optimizer would prefer FTS over NUSI, when there are no Statistics defined on NUSI columns.
It would prefer scanning for full table instead of going for Subtables in NUSI since optimizer does not have information about subtables of NUSI
It is always suggested to collect statistics whenever NUSI columns are defined on the table. Verify whether index is being used by checking in Explain plan.

188.what is collected in Collect Statistics

The Teradata Parsing Engine (PE) is in charge of creating the PLAN for the AMPs to follow. The PE works best when Statistics have been collected on a table. Then it knows:

The number of rows in the table
The average row size Information on all Indexes in which statistics were collected
The range of values for the column(s) in which statistics were collected
The number of rows per value for the column(s) in which statistics were collected
The number of NULLs for the column(s) in which statistics were collected

189.Purpose of collect statistics

The optimizer uses this synopsis data to generate efficient table access and join plans
When statistics are collected, Teradata does a full table scan, sorts the column or index
You don’t COLLECT STATISTICS on all columns and indexes because it takes up too much space for unnecessary reasons,
If you run an Explain on a query and the row estimate has No Confidence or Low Confidence, then that is a sign that no statistics were collected.
The first time you collect statistics, you collect them at the index or column level. After that you just collect statistics at the table level and all previous columns collected previously are collected again. It is a mistake to collect statistics only once and then never do it again. COLLECT STATISTICS each time a table’s data changes by 10%

190.What does collect statistics do to help the PE come up with a better plan?•

Access Path – The PE will easily choose and use any Primary Index access (UPI or NUPI) and it will also easily choose a Unique Secondary Index (USI), but statistics really help the PE decide whether or not to do a Full Table Scan or use a Non-Unique Secondary Index (NUSI) or if it can use multiple NUSI’ s AND ed together to perform a NUSI bitmap.•

Join Method – When you collect statistics it gives Teradata a better idea whether or not to do a merge join, product join, hash join or nested join.•

Join Geography – When two rows are joined together they must physically be located on the same AMP. The only way that this happens naturally is if the join column (PK/FK) is the Primary Index of both tables. Most of the time this is not the case and Teradata must decide the Join Geography of how it will relocate the rows to co-locate them on the same AMP. Will it redistribute (rehash by the join column) one or both of the tables, or will it duplicate the smaller table across all AMPs? A redistribution or duplication are the paths to co-location.•

Join Order – All joins are performed two tables at a time. What will be the best order to join the tables together? When two or more tables are involved this becomes very important..

It is the access path, the join method, the join geography and the order that makes statistics collection so vital to all Teradata systems.

191.How do you know if Statistics were collected on a Table?

Syntax: HELP Statistics <Table Name>

192.Where Does Teradata Keep the Collected Statistics?

spread evenly across all AMPs in three tables:

DBC.Indexes(for multi-column indexes only)
DBC.TVFields(for all columns and single column indexes)
DBC.StatsTbl(Teradata V14 and beyond)

193.How to Recollect STATISTICS on a Table

Here is the syntax for re-collecting statistics on a table:

COLLECT STATISTICS ON <Tablename> ;

The first time you collect statistics, you do it for each individual column or index that you want to collect on. When a table changes its data by 10% due to Inserts, Updates, or Deletes you merely use the command above, and it re-collects on the same columns and indexes previously collected on.

194.what is Random AMP Sample

Teradata Always Does a Random AMP Sample

The Parsing Engine will hash the Table ID for a table being queried and then use the Hash Map to determine which AMP will be assigned to do a Random AMP Sample for this table.

Remember that a Random AMP sample only applies to indexed columns and table row counts.

In the old days, Teradata never did a Random AMP Sample unless statistics were not collected, but these days Teradata always does a Random AMP Sample before placing the Table Header inside each AMP’s FSG Cache.

This allows Teradata to compare these statistics with collected statistics to determine if statistics are old and stale.

If the statistics are determined to be out of date then the Random AMP Sample is used.Random Sample is Kept in the Table Header in FSG Cache

Teradata compares the collected statistics to a Random AMP Sample (obtained by sampling a single AMP before placing the Table Header in FSG Cache). This compare determines if the statistics will be used or if they should be replaced by the sample.

195.Multiple Random AMP Samplings

The PE does Random AMP Sampling based on the Table ID.

The Table ID is hashed, and that AMP is always selected as the sampled AMP for that table. This assures that no single AMP will be tasked for too many tables, but if the table is badly skewed this can confuse the PE.

So now, more than one AMP can be sampled when generating row counts for a query plan for much better estimations on row count, row size, and rows per value estimates per table.

In the DBS Control area, field 65 can now set the standard for how AMPs are sampled.

65. RandomAmpSampling – this field determines the number of AMPs to be sampled for getting the row estimates of a table.

The valid values are D, L, M, N or A.

D – The default is one AMP sampling (D is the default unless changed.)

L – Maximum of two AMPs sampling

M – Maximum of five AMPs sampling

N – Node Level sampling (all the AMPs in a node are sampled).

A – System Level sampling (all the AMPs in the system are sampled).

Multiple AMPs can now be used for the random AMP sample so a higher number of AMPs sampled will provide better estimates to counter skewed results. But, it can cause short running queries to run slower just so long running queries can run faster.

Random AMP Estimates for NUSI Secondary Indexes

The Random AMP reads a couple of cylinders and some data NUSI blocks and then does some simple math to estimate the Rows Per NUSI Value. The PE then knows how strong or weak the WHERE clause is using the NUSI and if it should even use the NUSI. This is the most important decision for the Parsing Engine. Should it just do a Full Table Scan or use the NUSI? That is the biggest reason the PE needs statistics. That is why you should always collect statistics on all NUSI indexes.

You really only need to collect statistics on a Unique Secondary Index column if there are a lot of SQL statements on non-equality conditions such as range queries.

There’s No Random AMP Estimate for Non-Indexed Columns

Teradata does not do a Random AMP Sample for non-indexed columns that are used in the WHERE clause of the SQL

Today’s Teradata systems always perform a random AMP sample even if tables have statistics. Then, they compare the random AMP sample with the statistics to determine if the statistics are stale.

A random AMP is selected for a random sample. Two things happen:

1) Indexes are sampled on the random AMP, and the PE estimates based on the total number of AMPs in the system.

2) If a column in the WHERE clause of the SQL is not an Index, the PE assumes that 10% of the rows will come back. If two columns are in the WHERE clause, then it assumes 7.5% of the rows will come back. If three columns are in the WHERE Clause, it assumes 5%.

196.When to COLLECT STATISTICS Using only a SAMPLE

You might consider Collecting Statistics with SAMPLE if:

	You are collecting statistics on a very large table.
	When collecting statistics becomes a problem with system performance or cost because the system is so busy.

Don’t consider Collecting Statistics with SAMPLE if:

	The tables are small.
	To replace all existing full scan Collect Statistics.
	If the column’s data is skewed badly.

COLLECT STATISTICS can be very time consuming because it performs a full table scan and then performs a lot of statistical calculations. Because Collect Statistics runs infrequently and benefits query optimization, it is considered a necessary task. Without statistics, query performance will suffer. The bad news about sampled statistics is that they may not be as accurate, which could negatively affect the PE’s plans. In most cases, sampled statistics are better than no statistics. Don’t use Sample unless necessary!

Examples of COLLECT STATISTICS Using only a SAMPLE

Sampled statistics are generally more accurate for data that is not skewed. For example, columns or indexes that are unique or nearly unique are not skewed. Because the PE needs to be aware of skewed data, you should not collect with sample on skewed data. That is why sampling is generally more appropriate for indexes than non-indexed column(s). If you recollect statistics on a Sample, it recollects with the same Sample!

197.How to Collect Statistics on a PPI Table on the Partition

Here is the syntax for collecting statistics on a PPI table.

COLLECT STATISTICS on <Tablename> COLUMN PARTITION;

Here is an actual example COLLECT STATISTICS on Order_Table_PPI COLUMN PARTITION;

Three reasons to Collect on the Partition:

The Parsing Engine will have a better plan for PPI Tables.

This helps the most with Partition Elimination on Range Queries.

This is especially helpful when a table has a lot of empty partitions.

The Parsing Engine can use this information to better estimate the query cost when there are a significant number of empty partitions. If PARTITION statistics are not collected, empty partitions may cause the Parsing Engine to underestimate the number of rows in a partition. You shouldn’t use WITH SAMPLE to collect on Partitions.

198.Teradata V14 Statistics Enhancements•

Teradata V12 and V13 Statistics EnhancementsI

n V12, Extrapolate Statistics is designed to more accurately provide for a statistical estimate for date range-based queries that specify a “future” date that is outside the bounds of the current statistics. This results in less re-collections
.In V12, Stale Statistics Detection compares the Random AMP Sample with the statistics collected and determines if they are stale, and should not be used.
In V13, Statistics can now be collected on Volatile Tables.
In V13, PARTITION statistic capabilities have been added to Global Temporary Tables.
In V13, Multi-Column statistics are now available on Hash and Join Indexes.
In V13, Sample Statistics are available on Tables, Volatile Tables, Global Temporary Tables, Hash Indexes and Join Indexes, including the Partition Columns.

Teradata V14 Statistics Enhancements

There is now a SUMMARY option to collect table-level statistics.•
SYSTEM SAMPLE option allows the system to determine the sampled system percentage.• Sampling options have been enhanced (e.g., SAMPLE n PERCENT).•
Statistics are stored in DBC.StatsTbl to reduce access contention and improve performance.•
New views DBC.StatsV, DBC.ColumnStatsV, DBC.MultiColumnStatsV, and IndexStatsV.•
SHOW STATISTICS statement reports detailed statistics in plain text or XML formatting.•
Internal PE enhancements for histogram structure and use,
including:•Storing statistics data in their native Teradata data types without losing precision• Enhanced extrapolation methods for stale statistics•
Maintaining statistics history
Teradata V14 now allows you to determine a sampling percentage for sampled statistics. You can even collect/recollect either summary statistics or both full and summary statistics combined. You can now collect statistics on global temporary tables, and you can provide a name for statistics collected while also being able to specify the column ordering for multicolumn statistics. There is also a dedicated statistics cache that is designed to improve query optimization time.

To collect sample statistics using the system default sample:

COLLECT STATISTICS USING SYSTEM SAMPLE COLUMN (Product_ID) ON Sales_Table;

To collect sample statistics by scanning 15 percent of the rows and use 100 intervals:

COLLECT STATISTICS USING SAMPLE 15 PERCENT AND MAXINTERVALS 100 COLUMN(Product_ID) ASProduct_Stats ON Sales_Table;

Teradata V14 Summary Statistics

New in Teradata 14.0, table-level statistics known as “summary statistics” are collected whenever column or index statistics are collected. Summary statistics do not cause their own histogram to be built, but rather they create important facts about the table undergoing collection that are held in the new DBC.StatsTbl. Here are some of the items in “summary statistics”:

Row count
Average block size
block level compression metrics
Temperature

SHOW SUMMARY STATISTICS VALUES ON Employee_Table;

Teradata V14 MaxValueLength

COLLECT STATISTICS
USING MAXVALUELENGTH 50
COLUMN ( Product_Name)
ON Product_Table ;

Before V14, whenever you collected statistics, Teradata only placed the first 16 bytes in the statistics. Long names were cut off. Now, the default is 25 bytes, but you can use theMaxValueLength keyword (example above) to specify the length you want.

MAXVALUELENGTH lets you expand the length of the values contained in the histogram for that statistic. The new default length is 25 bytes when previously it was 16. If needed, you can specify well over 1000 bytes for a maximum value length. The 16-byte limit on value sizes in earlier releases was always padded to 16 bytes, but now the length can be longer, however no padding is done.

Teradata V14 MaxIntervals

COLLECT STATISTICS
USING MaxIntervals 500
COLUMN ( Last_Name)
ON Employee_Table ;

Before V14, whenever you collected statistics Teradata did a full table scan on the values, sorted them, and then placed them into 200 intervals. Now, the default is 250 intervals, but you can specify (example above) the number of intervals you desire.

Each statistics interval highlights its single most popular value, and the number of rows that carry that value are recorded. The rest of the values in the interval are estimated. By increasing the number of intervals, the optimizer can accurately get a better row count for a greater number of the most popular values. A larger number of intervals can be useful if you have widespread skew on a column or index you are collecting statistics on and you want more individual high-row-count values to be represented in the histogram. The range is 0 – 500 for MaxIntervals.

Teradata V14 Sample N Percent

COLLECT STATISTICS
USING Sample 20 Percent
COLUMN ( Last_Name)
ON Employee_Table ;

Using Sample before defaulted each time to the system parameter, but now you can specifically state the percent you want for each column or index.

SAMPLE n PERCENT allows you to specify sampling at the individual statistics collection level, rather than at the system level. Now, different levels of statistics sampling to different columns and indexes can be performed. The better you get at knowing your data and the queries upon them, the more you can specifically use the sampling to better help the PE.

Teradata V14.10 Statistics Collection Improvements

Expressions can now be specified in COLLECT STATS statements.

Expressions include: Substr(), MOD, CONCATENATION, Format Conversion expressions like Uppercase, Lowercase, Data type Conversions, CASE-Expressions, BEGIN and END expressions of PERIOD types.

COLLECT STATISTICS USING SYSTEM SAMPLE COLUMN
(EXTRACT(MONTH FROM Order_Date), Order_Total) AS Month_Plus_TotalON Order_Table ;

It provides automatic downgrading of Full Stats to Sample Stats if column is eligible and system determines if full statistics would be better off if avoided.

Teradata will detect the spool space required before attempting the statistics collection. If it is determined that there will not be enough space, then Teradata will:

Build a new layer of local statistics on one of the AMPS, and then the global statistics will be based on local statistics.

It provides statistics re-collection performance improvements to reduce statistics re-collection overhead by automatically skipping the recollection process when not needed.

Above are the features of Teradata V14.10 Statistics Collection Improvements.

Teradata V14.10 Statistics Collection Threshold Examples

TIME and CHANGE based THRESHOLD option can be used to reduce statistics re-collection overhead by automatically skipping the recollection process when not necessary.

For example, if the user sets a TIME THRESHOLD for specific statistics collection to be 5 days, and issues a collect stats after 3 days, the system will not collect statistics.

For example, if a user sets a CHANGED based THRESHOLD of 10%, and the data has not changed by 10%, the system will not collect statistics. Insert and Updates are taken into consideration to determine data growth.

COLLECT STATISTICS USING THRESHOLD 10 PERCENT AND THRESHOLD 5 DAYS

COLUMN (Customer_Number) ON Customer_Table ;

If a COLLECT STATISTICS statement is issued less than 5 days after the above statement, or if the data has not changed by 10%, the collection will not take place.

Here is SQL that will override the above statement:

COLLECT STATISTICS USING NO THRESHOLD
FOR CURRENT COLUMN (Customer_Number) ON Customer_Table ;

Teradata V14.10 AutoStats feature

Teradata V14.10 Autostats identifies and collects missing statistics needed and detects stale statistics for refreshing. If statistics are not being used by optimizer, they will be removed.

A new repository of a system supplied database named TDSTATS stores metadata for all stats collections. This is created by a new process called DIPSTATS.

External stored procedures (XSPs) perform the stats management process.

There have been DBQL Enhancements to the Log Optimizer statistics recommendations and usage with dedicated DBQL logging option. DBQL will analyze query plans that have been logged to recommend new stats or identify unused stats.

Teradata recommends that customers should NOT replace of all their existing stats management procedures, but instead begin experimenting with AutoStats and build upon the experience.

Above are the features of Teradata V14.10 AutoStats.

When Teradata Collects Statistics it creates a Histogram

1. Highest Sorted Value in the Interval

2. Most Frequent Value in the Interval

3. Rows with the Most Frequent Value

4. Other Values in the Interval

5. Number of Rows of other Values

online-php-experts

Monday, 9 April 2018

Most useful Teradata interview questions -- PART - V

Most useful Teradata interview questions -- PART - V

collect Statistics

No comments:

Post a Comment