You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@hive.apache.org by "Carter Shanklin (JIRA)" <ji...@apache.org> on 2017/06/01 15:48:04 UTC

[jira] [Created] (HIVE-16802) Standard Table Sampling

Carter Shanklin created HIVE-16802:
--------------------------------------

Summary: Standard Table Sampling
Key: HIVE-16802
URL: https://issues.apache.org/jira/browse/HIVE-16802
Project: Hive
Issue Type: Sub-task
Reporter: Carter Shanklin

Hive's table sampling implementation has 2 major issues:
1. It makes assumptions about file layout. The main sampling approach requires tables to be bucketed. Bucketed tables are becoming less common as the need to bucket has reduced over time so many users cannot benefit from sampling.
2. The syntax is non standard.

SQL standard defines a TABLESAMPLE operator that requires a sampling method to be supplied and a probability p. The number of output records is approximately N * p/100, where N is the number of records in the table. There are two sampling methods defined: BERNOULLI and SYSTEM.

With the BERNOULLI sampling method, each record is evaluated as an independent Bernoulli trial.
The SYSTEM sampling method only controls the size of the output set, there is no independence guarantee between rows. It's common for SYSTEM sampling to be done at a block or page level, and if a block is selected, all records from the block are returned. Hive's current sampling methods are effectively types of SYSTEM sampling.

The standard also allows you to seed the PRNG used for trials using the REPEATABLE clause. The same input table with same p value and same repeatable value produces the same output.

Some examples:
select * from t tablesample bernoulli ( 50 );
select * from t tablesample system ( 30 ) repeatable (1234);

--
This message was sent by Atlassian JIRA
(v6.3.15#6346)