You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@impala.apache.org by "Alexander Behm (JIRA)" <ji...@apache.org> on 2017/05/10 17:27:04 UTC
[jira] [Created] (IMPALA-5300) Implement TABLESAMPLE
Alexander Behm created IMPALA-5300:
--------------------------------------
Summary: Implement TABLESAMPLE
Key: IMPALA-5300
URL: https://issues.apache.org/jira/browse/IMPALA-5300
Project: IMPALA
Issue Type: New Feature
Components: Frontend
Affects Versions: Impala 2.8.0
Reporter: Alexander Behm
Assignee: Alexander Behm
Priority: Critical
Implement the TABLESAMPLE clause that can be used against base table references in queries as well as the COMPUTE STATS statement.
Examples:
{code}
SELECT * FROM T TABLESAMPLE 10 PERCENT
COMPUTE STATS T TABLESAMPLE 20 PERCENT
{code}
Syntax inspired by SQL Server:
https://technet.microsoft.com/en-us/library/ms189108(v=sql.105).aspx
{code}
TABLESAMPLE <number> PERCENT [REPEATABLE (<seed>)]
{code}
*Implementation details*
* The given percentage refers to the percent of bytes in the table.
* The sampling will be coarse-grained (file level).
* Impala will randomly select files until the desired percentage of bytes has been reached
*Accepted limitations*
* Computing stats on a coarse-grained sample necessarily means a loss of precision with no guarantee on statistical significance
* There is no guarantee that a sample covers all partitions
* NDVs may be very inaccurate for sorted files
* NDVs may be very inaccurate for an unfortunate selection of files
--
This message was sent by Atlassian JIRA
(v6.3.15#6346)