You are viewing a plain text version of this content. The canonical link for it is here.

Posted to commits@hudi.apache.org by "Alexey Kudinkin (Jira)" <ji...@apache.org> on 2023/02/07 02:49:00 UTC

[jira] [Created] (HUDI-5716) Fix Partitioners to avoid assuming that parallelism is always present

Alexey Kudinkin created HUDI-5716:
-------------------------------------

             Summary: Fix Partitioners to avoid assuming that parallelism is always present
                 Key: HUDI-5716
                 URL: https://issues.apache.org/jira/browse/HUDI-5716
             Project: Apache Hudi
          Issue Type: Bug
          Components: writer-core
            Reporter: Alexey Kudinkin
            Assignee: Alexey Kudinkin
             Fix For: 0.13.1


Currently, `Partitioner` impls assume that there's always going to be some parallelism level.

This has not been issue previously for the following reasons:
 * RDDs always have inherent "parallelism" level defined as the # of partitions they operating upon. However for Dataset (SparkPlan) that's not necessarily the case (som SparkPlans might not be reporting the output partitioning)
 * Additionally, we have had the default parallelism level set in our configs before which meant that we'd prefer that over the actual incoming dataset.

However, since we've recently removed default parallelism value from our configs we now need to fix Partitioners to make sure these are not assuming that parallelism is always going to be present.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)