You are viewing a plain text version of this content. The canonical link for it is here.

Posted to reviews@spark.apache.org by djvulee <gi...@git.apache.org> on 2017/01/22 05:00:33 UTC

[GitHub] spark pull request #16671: [SparkSQL] a better balance partition method for ...

GitHub user djvulee opened a pull request:

    https://github.com/apache/spark/pull/16671

    [SparkSQL] a better balance partition method for jdbc API

    ## What changes were proposed in this pull request?
    
    The partition method in` jdbc` using the equal
    step, this can lead to skew between partitions. The new method
    introduce a balance partition method base on the elements when split the
    elements, this can relieve the skew problem with a little query cost.
    
    
    ## How was this patch tested?
    UnitTest and real data.


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/djvulee/spark balancePartition

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/16671.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #16671
    
----
commit 88cdf294aa579f65b8272870d762548cf54349ce
Author: DjvuLee <li...@bytedance.com>
Date:   2017-01-20T09:53:57Z

    [SparkSQL] a better balance partition method for jdbc API
    
    The partition method in jdbc when specify the column using te equal
    step, this can lead to skew between partitions. The new method
    introduce a new partition method base on the elements when split the
    elements, this can keep the elements balanced between partitions.

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #16671: [SPARK-19327][SparkSQL] a better balance partition metho...

Posted by djvulee <gi...@git.apache.org>.

Github user djvulee commented on the issue:

    https://github.com/apache/spark/pull/16671
  
    Yes, I agree with you, sampling bases is the right choose, but through `jdbc` API is not possible to achieve this. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #16671: [SPARK-19327][SparkSQL] a better balance partitio...

Posted by djvulee <gi...@git.apache.org>.

Github user djvulee closed the pull request at:

    https://github.com/apache/spark/pull/16671


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #16671: [SPARK-19327][SparkSQL] a better balance partition metho...

Posted by djvulee <gi...@git.apache.org>.

Github user djvulee commented on the issue:

    https://github.com/apache/spark/pull/16671
  
    Yes. I will leave this PR for a few days to seen if others interested in this, and then close it.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #16671: [SparkSQL] a better balance partition method for jdbc AP...

Posted by djvulee <gi...@git.apache.org>.

Github user djvulee commented on the issue:

    https://github.com/apache/spark/pull/16671
  
    @gatorsmile can you take a look at?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #16671: [SparkSQL] a better balance partition method for jdbc AP...

Posted by djvulee <gi...@git.apache.org>.

Github user djvulee commented on the issue:

    https://github.com/apache/spark/pull/16671
  
    Table2 with about 5M rows, 200partition by SparkSQL.
    
    (The table using the MySQL sharding, and every partition will return 10K rows at most)
    
    
    old partition result(elements in each partition)
    
    >10000,49,54,53,60,59,48,61,52,57,60,69,58,57,50,52,51,66,58,45,59,52,61,56,67,51,45,49,70,49,58,59,61,53,50,53,47,50,46,53,55,53,62,55,48,58,52,62,62,37,65,59,58,55,61,59,46,53,49,49,61,72,60,46,50,51,45,47,55,63,64,63,55,47,65,57,60,60,51,45,48,77,58,57,59,39,50,62,55,57,49,63,51,38,49,66,62,58,53,54,50,54,52,69,51,49,61,60,64,49,52,50,54,58,48,51,50,49,41,68,54,45,65,62,44,52,64,58,47,51,65,47,37,42,39,44,51,65,56,54,69,51,61,63,51,52,47,55,58,66,47,54,53,53,60,66,66,68,64,66,55,58,64,55,50,57,46,56,39,60,57,63,40,51,56,58,44,46,46,44,42,52,52,44,53,46,55,57,68,57,62,48,47,52,59,58,49,44,52,47
    
    (most of data is in partition 0, but each partition will return 10K at most because our sharding limit.)
    
    
    new partition result(elements in each partition)
    
    >2083,10000,10000,6932,9799,10000,10000,10000,10000,10000,10000,10000,10000,10000,10000,10000,10000,10000,10000,10000,10000,10000,10000,10000,10000,10000,10000,10000,10000,8150,10000,10000,10000,10000,10000,10000,10000,10000,10000,10000,10000,10000,10000,10000,10000,10000,10000,10000,10000,10000,10000,10000,10000,10000,10000,10000,10000,10000,10000,10000,10000,10000,10000,10000,10000,10000,10000,10000,10000,10000,10000,10000,10000,10000,10000,10000,10000,10000,10000,10000,10000,10000,10000,10000,10000,10000,10000,10000,10000,10000,10000,10000,10000,10000,10000,10000,10000,10000,10000,7,9,70,2,10000,10000,10000,655,10000,10000,10000,10000,10000,10000,10000,10000,10000,10000,10000,10000,10000,10000,10000,10000,10000,10000,10000,10000,10000,10000,10000,10000,10000,10000,10000,10000,10000,10000,10000,10000,10000,10000,10000,10000,10000,10000,10000,10000,10000,10000,10000,10000,10000,10000,10000,10000,10000,10000,10000,10000,10000,10000,10000,10000,10000,10000,10000,10000,10000,10000,
 10000,10000,10000,10000,10000,10000,10000,10000,10000,10000,10000,10000,10000,10000,10000,10000,10000,10000,10000,10000,40,76,145,38,86,176,369,696,1338,2776,5381'
    
    
    count cost time: 0.8ms


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #16671: [SPARK-19327][SparkSQL] a better balance partition metho...

Posted by djvulee <gi...@git.apache.org>.

Github user djvulee commented on the issue:

    https://github.com/apache/spark/pull/16671
  
    Yes, this solution is not suitable for large table, but I can not find a better solution, this is the best optimisation I can find.
    So just add it as a choose, let the users know what he is doing, and need a explicit enable.
    
    From my experience, the origin equal step method can lead to some problem for real data. This conclusion can be get from the spark-user email and our real scenario. Such as users will use the `id` to partition the table, because the `id` is unique and with index, but after many inserts and deletes, the `id` range is very large, and data will lead to a skew distribution by `id`.
    
    Very large table is not so common, and if the large table with sharding, this method maybe acceptable.
    
    My personal opinion is: 
    >Given another choose for users maybe valuable, only we do not enable it by default.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #16671: [SPARK-19327][SparkSQL] a better balance partition metho...

Posted by gatorsmile <gi...@git.apache.org>.

Github user gatorsmile commented on the issue:

    https://github.com/apache/spark/pull/16671
  
    It is still achievable if the SQL interface of the underlying database supports it. Currently, if the performance really matters, JDBC is not a good interface. Thus, most DBMS vendors provide their own Spark connectors. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #16671: [SparkSQL] a better balance partition method for jdbc AP...

Posted by djvulee <gi...@git.apache.org>.

Github user djvulee commented on the issue:

    https://github.com/apache/spark/pull/16671
  
    Here is the real data test result:
    Table with 1.2Million rows, 50partition by SparkSQL.
    
    
    
    old partition result(elements in each partition)
    
    >100061,100064,100059,100066,100065,100065,100066,100066,100063,100061,100066,100065,70747,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
    
    
    new partition result(elements in each partition)
    >19543,19544,39083,39088,19544,19545,39085,19544,19542,19543,19545,39086,39087,19544,19545,39088,19544,19544,39088,19543,19545,39088,19544,19545,39088,19544,19544,39088,19544,19545,19543,19544,39086,19543,19545,39086,39086,19544,19545,39088,19544,19545,39088,19544,19544,39088,19544,19545,20701,0
    
    
    count cost time: 1.27s
    
    
    
    
    
    
    
    
    



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #16671: [SPARK-19327][SparkSQL] a better balance partition metho...

Posted by djvulee <gi...@git.apache.org>.

Github user djvulee commented on the issue:

    https://github.com/apache/spark/pull/16671
  
    Using the *predicates* parameters to split the table seems reasonable, but it just put some work should be done by Spark to users in my personal opinion. Users need know how to split the table uniform at first,  so it may use the `count(*)` extra to explode the distribution of the table. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #16671: [SPARK-19327][SparkSQL] a better balance partition metho...

Posted by djvulee <gi...@git.apache.org>.

Github user djvulee commented on the issue:

    https://github.com/apache/spark/pull/16671
  
    @HyukjinKwon One assumption behind this design is that the specified column has index in most real scenario, so the table scan cost is not much high. 
    
    What I observed is that most large table has sharding, so count cost is acceptable, this is the reason 
    why we cost less time in a 5M rows table than in a 1M rows table. If we use the `repartition`,  there is a bottleneck  when loading data from DB and high cost for `repartition`.
    
    Anyway, this solution is expensive indeed and not a good one, maybe the best way is using the Spark connectors provided by the DBMS vendors as @gatorsmile suggested.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #16671: [SPARK-19327][SparkSQL] a better balance partition metho...

Posted by gatorsmile <gi...@git.apache.org>.

Github user gatorsmile commented on the issue:

https://github.com/apache/spark/pull/16671

So far, the best workaround is that predicate-based JDBC API; otherwise, as I mentioned above, we need to do it using sampling to find the boundary of each block.

> In one embodiment, a user may specify a block size, via an interface. Blocks may be generated at the time of table partitioning. For example, according to a sampling technique described below, a user may select a particular block size and then the utility can determine the average number of table rows per block based on the number of storage bytes per row. Block-by boundary values for that range of rows of that block are determined based on the selected amount of rows, and provided in a query statement generated to obtain the statistical value for the block. That is, select rows from each table may be sampled or range-based. The select rows (or columns) are aggregated to form one \u201cblock\u201d from the database table. The \u201cblock\u201d may include the whole table, but is typically select rows of the whole table.

A few years ago, I did implement the sampling based table logical partitioning. See the link: https://www.google.com/patents/US20160275150. It works pretty well.

---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #16671: [SparkSQL] a better balance partition method for jdbc AP...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/16671
  
    Can one of the admins verify this patch?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #16671: [SPARK-19327][SparkSQL] a better balance partition metho...

Posted by HyukjinKwon <gi...@git.apache.org>.

Github user HyukjinKwon commented on the issue:

    https://github.com/apache/spark/pull/16671
  
    FWIW, I am negative of this approach too. It does not look a good solution to require full table scans to resolve skew between partitions.
    
    As said, it is not good for a large table. Then, why don't we just repartition if the data is expected to be not quite large if we _should_ resolve the skew?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #16671: [SPARK-19327][SparkSQL] a better balance partition metho...

Posted by gatorsmile <gi...@git.apache.org>.

Github user gatorsmile commented on the issue:

    https://github.com/apache/spark/pull/16671
  
    The connectors by some DBMS vendors are using the UNLOAD utility, which performs much better, and build the RDD in the connectors.
    
    Normally, JDBC is not a good option for large table fetching and writing. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org