You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Takeshi Yamamuro (Jira)" <ji...@apache.org> on 2020/08/19 03:39:00 UTC
[jira] [Comment Edited] (SPARK-32632) Bad partitioning in spark jdbc method with parameter lowerBound and upperBound

    [ https://issues.apache.org/jira/browse/SPARK-32632?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17180236#comment-17180236 ] 

Takeshi Yamamuro edited comment on SPARK-32632 at 8/19/20, 3:38 AM:
--------------------------------------------------------------------

hm, but the document clearly describe it and this is an expected behaviour:
{code:java}
Notice that lowerBound and upperBound are just used to decide the partition stride, not for filtering the rows in table. So all rows in the table will be partitioned and returned. This option applies only to reading.
{code}
[https://spark.apache.org/docs/3.0.0/sql-data-sources-jdbc.html]


was (Author: maropu):
hm, but the document clearly describe this behaviour.
{code}
Notice that lowerBound and upperBound are just used to decide the partition stride, not for filtering the rows in table. So all rows in the table will be partitioned and returned. This option applies only to reading.
{code}
https://spark.apache.org/docs/3.0.0/sql-data-sources-jdbc.html


> Bad partitioning in spark jdbc method with parameter lowerBound and upperBound
> ------------------------------------------------------------------------------
>
>                 Key: SPARK-32632
>                 URL: https://issues.apache.org/jira/browse/SPARK-32632
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 3.0.0
>            Reporter: Liu Dinghua
>            Priority: Major
>
> When I use the jdbc methed
> {code:java}
> def jdbc( url: String, table: String, columnName: String, lowerBound: Long, upperBound: Long, numPartitions: Int, connectionProperties: Properties)
> {code}
>  
>   I am confused by the partitions generated by this method,  for  rows of the first partition aren't limited by the lowerBound and the ones of the last partition are not limited by the upperBound. 
>   
>  For example, I use the method  as follow:
>   
> {code:java}
> val data = spark.read.jdbc(url, table, "id", 2, 5, 3,buildProperties()) .selectExpr("id","appkey","funnel_name")
> data.show(100, false)  
> {code}
>  
> The result partitions info is :
>  20/08/05 16:58:59 INFO JDBCRelation: Number of partitions: 3, WHERE clauses of these partitions: `id` < 3 or `id` is null, `id` >= 3 AND `id` < 4, `id` >= 4
> The returned data is:
> ||id|| appkey||funnel_name||
> |0|yanshi|test001|
> |1|yanshi|test002|
> |2|yanshi|test003|
> |3|xingkong|test_funnel|
> |4|xingkong|test_funnel2|
> |5|xingkong|test_funnel3|
> |6|donews|test_funnel4|
> |7|donews|test_funnel|
> |8|donews|test_funnel2|
> |9|dami|test_funnel3|
> |13|dami|test_funnel4|
> |15|xiaoai|test_funnel6|
>  
> Normally, the clause of the first partition should be " `id` >=2 and `id` < 3 "  because the lowerBound is 2, and the clause of the last partition should be " `id` >= 4 and `id` < 5 ",  but the facts are not.
>  
>  
>   



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org