You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@spark.apache.org by "Reynold Xin (JIRA)" <ji...@apache.org> on 2016/11/01 22:54:58 UTC

[jira] [Updated] (SPARK-15867) TABLESAMPLE BUCKET semantics don't match Hive's

     [ https://issues.apache.org/jira/browse/SPARK-15867?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Reynold Xin updated SPARK-15867:
--------------------------------
    Target Version/s: 2.2.0  (was: 2.1.0)

> TABLESAMPLE BUCKET semantics don't match Hive's
> -----------------------------------------------
>
>                 Key: SPARK-15867
>                 URL: https://issues.apache.org/jira/browse/SPARK-15867
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 1.6.0, 2.0.0
>            Reporter: Andrew Or
>
> {code}
> SELECT * FROM boxes TABLESAMPLE (BUCKET 3 OUT OF 16)
> {code}
> In Hive, this would select the 3rd bucket out of every 16 buckets there are in the table. E.g. if the table was clustered by 32 buckets then this would sample the 3rd and the 19th bucket. (See https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Sampling)
> In Spark, however, we simply sample 3/16 of the number of input rows.
> Either we don't support it in Spark or do it in a way that's consistent with Hive.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org