You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@impala.apache.org by "Alexander Behm (JIRA)" <ji...@apache.org> on 2017/05/10 17:27:04 UTC

[jira] [Created] (IMPALA-5300) Implement TABLESAMPLE

Alexander Behm created IMPALA-5300:
--------------------------------------

             Summary: Implement TABLESAMPLE
                 Key: IMPALA-5300
                 URL: https://issues.apache.org/jira/browse/IMPALA-5300
             Project: IMPALA
          Issue Type: New Feature
          Components: Frontend
    Affects Versions: Impala 2.8.0
            Reporter: Alexander Behm
            Assignee: Alexander Behm
            Priority: Critical


Implement the TABLESAMPLE clause that can be used against base table references in queries as well as the COMPUTE STATS statement.

Examples:
{code}
SELECT * FROM T TABLESAMPLE 10 PERCENT
COMPUTE STATS T TABLESAMPLE 20 PERCENT
{code}

Syntax inspired by SQL Server:
https://technet.microsoft.com/en-us/library/ms189108(v=sql.105).aspx
{code}
TABLESAMPLE <number> PERCENT [REPEATABLE (<seed>)]
{code}

*Implementation details*
* The given percentage refers to the percent of bytes in the table.
* The sampling will be coarse-grained (file level).
* Impala will randomly select files until the desired percentage of bytes has been reached

*Accepted limitations*
* Computing stats on a coarse-grained sample necessarily means a loss of precision with no guarantee on statistical significance
* There is no guarantee that a sample covers all partitions
* NDVs may be very inaccurate for sorted files
* NDVs may be very inaccurate for an unfortunate selection of files




--
This message was sent by Atlassian JIRA
(v6.3.15#6346)