You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Yeachan Park (Jira)" <ji...@apache.org> on 2022/12/20 15:14:00 UTC

[jira] [Commented] (SPARK-26209) Allow for dataframe bucketization without Hive

    [ https://issues.apache.org/jira/browse/SPARK-26209?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17649824#comment-17649824 ] 

Yeachan Park commented on SPARK-26209:
--------------------------------------

Is there an update for this? We'd also be interested in this feature. AFAIK the file names already contain the bucket number. For things like bucket pruning, I'd expect we could just enable a configuration that would allow Spark to take advantage of this by computing the same function/hash it used to bucket the data in the first place without the need for a metastore?

> Allow for dataframe bucketization without Hive
> ----------------------------------------------
>
>                 Key: SPARK-26209
>                 URL: https://issues.apache.org/jira/browse/SPARK-26209
>             Project: Spark
>          Issue Type: Improvement
>          Components: Input/Output, Java API, SQL
>    Affects Versions: 3.1.0
>            Reporter: Walt Elder
>            Priority: Minor
>
> As a DataFrame author, I can elect to bucketize my output without involving Hive or HMS, so that my hive-less environment can benefit from this query-optimization technique. 
>  
> https://issues.apache.org/jira/browse/SPARK-19256?focusedCommentId=16345397&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-16345397 identifies this as a shortcoming with the umbrella feature in provided via SPARK-19256.
>  
> In short, relying on Hive to store metadata *precludes* environments which don't have/use hive from making use of bucketization features. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org