You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Shay Elbaz (Jira)" <ji...@apache.org> on 2019/12/31 19:37:00 UTC

[jira] [Created] (SPARK-30399) Bucketing does not compatible with partitioning in practice

Shay Elbaz created SPARK-30399:
----------------------------------

             Summary: Bucketing does not compatible with partitioning in practice
                 Key: SPARK-30399
                 URL: https://issues.apache.org/jira/browse/SPARK-30399
             Project: Spark
          Issue Type: Bug
          Components: SQL
    Affects Versions: 2.3.0
         Environment: HDP 2.7
            Reporter: Shay Elbaz


When using Spark Bucketed table, Spark would use as many partitions as the number of buckets for the map-side join (_FileSourceScanExec.createBucketedReadRDD_). This works great for "static" tables, but quite disastrous for _time-partitioned_ tables. In our use case, a daily partitioned key-value table is added 100GB of data every day. So in 100 days there are 10TB of data we want to join with - aiming to this scenario, we need thousands of buckets if we want every task to successfully *read and sort* all of it's data in a map-side join. But in such case, every daily increment would emit thousands of small files, leading to other big issues. 

In practice, and with a hope for some hidden optimization, we set the number of buckets to 1000 and backfilled such a table with 10TB. When trying to join with the smallest input, every executor was killed by Yarn due to over allocating memory in the sorting phase. Even without such failures, it would take every executor unreasonably amount of time to locally sort all its data.

A question on SO remained unanswered for a while, so I thought asking here - is it by design that buckets cannot be used in time-partitioned table, or am I doing something wrong?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org