You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Felix Kizhakkel Jose (Jira)" <ji...@apache.org> on 2020/03/16 18:14:00 UTC
[jira] [Commented] (SPARK-31162) Provide Configuration Parameter to
select/enforce the Hive Hash for Bucketing
[ https://issues.apache.org/jira/browse/SPARK-31162?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17060408#comment-17060408 ]
Felix Kizhakkel Jose commented on SPARK-31162:
----------------------------------------------
I have seen following in the API documentation:
/**
* Buckets the output by the given columns. *If specified, the output is laid out on the file*
** system similar to Hive's bucketing scheme.*
*
* This is applicable for all file-based data sources (e.g. Parquet, JSON) starting with Spark
* 2.1.0.
*
* @since 2.0
*/
@scala.annotation.varargs
def bucketBy(numBuckets: Int, colName: String, colNames: String*): DataFrameWriter[T] = {
this.numBuckets = Option(numBuckets)
this.bucketColumnNames = Option(colName +: colNames)
this
}
How can we specify that?
> Provide Configuration Parameter to select/enforce the Hive Hash for Bucketing
> -----------------------------------------------------------------------------
>
> Key: SPARK-31162
> URL: https://issues.apache.org/jira/browse/SPARK-31162
> Project: Spark
> Issue Type: Sub-task
> Components: Spark Core, SQL
> Affects Versions: 2.4.5
> Reporter: Felix Kizhakkel Jose
> Priority: Major
>
> I couldn't find a configuration parameter to choose Hive Hashing instead of Spark's default Murmur Hash when performing Spark BucketBy operation. According to the discussion with @[~maropu] [~hyukjin.kwon], suggested to open a new JIRA.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org