You are viewing a plain text version of this content. The canonical link for it is here.

Posted to reviews@spark.apache.org by djvulee <gi...@git.apache.org> on 2017/01/22 06:03:38 UTC

[GitHub] spark issue #16671: [SPARK-19327][SparkSQL] a better balance partition metho...

Github user djvulee commented on the issue:

    https://github.com/apache/spark/pull/16671
  
    Yes, this solution is not suitable for large table, but I can not find a better solution, this is the best optimisation I can find.
    So just add it as a choose, let the users know what he is doing, and need a explicit enable.
    
    From my experience, the origin equal step method can lead to some problem for real data. This conclusion can be get from the spark-user email and our real scenario. Such as users will use the `id` to partition the table, because the `id` is unique and with index, but after many inserts and deletes, the `id` range is very large, and data will lead to a skew distribution by `id`.
    
    Very large table is not so common, and if the large table with sharding, this method maybe acceptable.
    
    My personal opinion is: 
    >Given another choose for users maybe valuable, only we do not enable it by default.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org