You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@spark.apache.org by "Apache Spark (Jira)" <ji...@apache.org> on 2021/04/06 17:29:00 UTC

[jira] [Commented] (SPARK-34910) JDBC - Add an option for different stride orders

    [ https://issues.apache.org/jira/browse/SPARK-34910?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17315740#comment-17315740 ] 

Apache Spark commented on SPARK-34910:
--------------------------------------

User 'hanover-fiste' has created a pull request for this issue:
https://github.com/apache/spark/pull/32068

> JDBC - Add an option for different stride orders
> ------------------------------------------------
>
>                 Key: SPARK-34910
>                 URL: https://issues.apache.org/jira/browse/SPARK-34910
>             Project: Spark
>          Issue Type: Improvement
>          Components: SQL
>    Affects Versions: 3.2.0
>            Reporter: Jason Yarbrough
>            Priority: Trivial
>
> Currently, the JDBCRelation columnPartition function orders the strides in ascending order, starting from the lower bound and working its way towards the upper bound.
> I'm proposing leaving that as the default, but adding an option (such as strideOrder) in JDBCOptions. Since it will default to the current behavior, this will keep people's current code working as expected. However, people who may have data skew closer to the upper bound might appreciate being able to have the strides in descending order, thus filling up the first partition with the last stride and so forth. Also, people with nondeterministic data skew or sporadic data density might be able to benefit from a random ordering of the strides.
> I have the code created to implement this, and it creates a pattern that can be used to add other algorithms that people may want to add (such as counting the rows and ranking each stride, and then ordering from most dense to least). The current two options I have coded is 'descending' and 'random.'
> The original idea was to create something closer to Spark's hash partitioner, but for JDBC and pushed down to the database engine for efficiency. However, that would require adding hashing algorithms for each dialect, and the performance from those algorithms may outweigh the benefit. The method I'm proposing in this ticket avoids those complexities while still giving some of the benefit (in the case of random ordering).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org