You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Takeshi Yamamuro (Jira)" <ji...@apache.org> on 2021/03/24 01:31:00 UTC

[jira] [Comment Edited] (SPARK-34844) JDBCRelation columnPartition function includes the first stride in the lower partition

    [ https://issues.apache.org/jira/browse/SPARK-34844?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17307496#comment-17307496 ] 

Takeshi Yamamuro edited comment on SPARK-34844 at 3/24/21, 1:30 AM:
--------------------------------------------------------------------

> It depends on how much data skew you have.

Yea, I agree with that. What I wanted to say is that whether the proposed one works well or not also depends on data distribution.

> However, my question would be why skip using the lower bound anyway? It would make more sense to use what the user supplied.

IIUC we don't have a strong reason on that (we've used the initial implementation for a long time: [https://github.com/apache/spark/commit/8f471a66db0571a76a21c0d93312197fee16174a]). I think there is no best solution if we don't know data distribution in advance.

Any idea about how a user supplies these boundaries for partitions?


was (Author: maropu):
> It depends on how much data skew you have.

Yea, I agree with that. What I wanted to say is that whether the proposed one works well or not also depends on data distribution.

> However, my question would be why skip using the lower bound anyway? It would make more sense to use what the user supplied.

IIUC we don't have a strong reason on that (we've used the initial implementation for a long time: [https://github.com/apache/spark/commit/8f471a66db0571a76a21c0d93312197fee16174a]). I think there is no best solution if we don't know data distribution in advance.

Any idea about how a user supply these boundaries for partitions?

> JDBCRelation columnPartition function includes the first stride in the lower partition
> --------------------------------------------------------------------------------------
>
>                 Key: SPARK-34844
>                 URL: https://issues.apache.org/jira/browse/SPARK-34844
>             Project: Spark
>          Issue Type: Improvement
>          Components: SQL
>    Affects Versions: 3.0.0
>            Reporter: Jason Yarbrough
>            Priority: Minor
>
> Currently, columnPartition in JDBCRelation contains logic that adds the first stride into the lower partition. Because of this, the lower bound isn't used as the ceiling for the lower partition.
> For example, say we have data 0-10, 10 partitions, and the lowerBound is set to 1. The lower/first partition should contain anything < 1. However, in the current implementation, it would include anything < 2.
> A possible easy fix would be changing the following code on line 132:
> currentValue += stride
> To:
> if (i != 0) currentValue += stride
> Or include currentValue += stride within the if statement on line 131... although this creates a pretty bad looking side-effect.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org