You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Felix Kizhakkel Jose (Jira)" <ji...@apache.org> on 2020/03/16 18:14:00 UTC

[jira] [Comment Edited] (SPARK-31162) Provide Configuration Parameter to select/enforce the Hive Hash for Bucketing

    [ https://issues.apache.org/jira/browse/SPARK-31162?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17060408#comment-17060408 ] 

Felix Kizhakkel Jose edited comment on SPARK-31162 at 3/16/20, 6:13 PM:
------------------------------------------------------------------------

I have seen following in the API documentation:

/**
 * Buckets the output by the given columns. *If specified, the output is laid out on the file system similar to Hive's bucketing scheme.**
 *
 * This is applicable for all file-based data sources (e.g. Parquet, JSON) starting with Spark
 * 2.1.0.
 *
 * @since 2.0
 */
 @scala.annotation.varargs
 def bucketBy(numBuckets: Int, colName: String, colNames: String*): DataFrameWriter[T] = \{ this.numBuckets = Option(numBuckets) this.bucketColumnNames = Option(colName +: colNames) this }

How can we specify that?

 


was (Author: felixkjose):
I have seen following in the API documentation:



/**
 * Buckets the output by the given columns. *If specified, the output is laid out on the file*
 ** system similar to Hive's bucketing scheme.*
 *
 * This is applicable for all file-based data sources (e.g. Parquet, JSON) starting with Spark
 * 2.1.0.
 *
 * @since 2.0
 */
@scala.annotation.varargs
def bucketBy(numBuckets: Int, colName: String, colNames: String*): DataFrameWriter[T] = {
 this.numBuckets = Option(numBuckets)
 this.bucketColumnNames = Option(colName +: colNames)
 this
}

How can we specify that?

 

> Provide Configuration Parameter to select/enforce the Hive Hash for Bucketing
> -----------------------------------------------------------------------------
>
>                 Key: SPARK-31162
>                 URL: https://issues.apache.org/jira/browse/SPARK-31162
>             Project: Spark
>          Issue Type: Sub-task
>          Components: Spark Core, SQL
>    Affects Versions: 2.4.5
>            Reporter: Felix Kizhakkel Jose
>            Priority: Major
>
> I couldn't find a configuration parameter to choose Hive Hashing instead of Spark's default Murmur Hash when performing Spark BucketBy operation. According to the discussion with @[~maropu] [~hyukjin.kwon], suggested to open a new JIRA. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org