You are viewing a plain text version of this content. The canonical link for it is here.

Posted to commits@hudi.apache.org by "Prashant Wason (Jira)" <ji...@apache.org> on 2020/10/22 06:07:00 UTC

[jira] [Updated] (HUDI-1351) Improvements required to hudi-test-suite for scalable and repeated testing

     [ https://issues.apache.org/jira/browse/HUDI-1351?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Prashant Wason updated HUDI-1351:
---------------------------------
    Status: In Progress  (was: Open)

> Improvements required to hudi-test-suite for scalable and repeated testing
> --------------------------------------------------------------------------
>
>                 Key: HUDI-1351
>                 URL: https://issues.apache.org/jira/browse/HUDI-1351
>             Project: Apache Hudi
>          Issue Type: Improvement
>            Reporter: Prashant Wason
>            Assignee: Prashant Wason
>            Priority: Major
>
> There are some shortcomings of the hudi-test-suite which would be good to fix:
> 1. When doing repeated testing with the same DAG, the input and output directories need to be manually cleaned. This is cumbersome for repeated testing.
> 2. When running a long test, the input data generated by older DAG nodes is not deleted and leads to high file count on the HDFS cluster. The older files can be deleted once the data has been ingested.
> 3. When generating input data, if the number of insert/update partitions is less than spark's default parallelism, a number of empty avro files are created. This also leads to scalability issues on the HDFS cluster. Creating large number of smaller AVRO files is slower and less scalable than single AVRO file.
> 4. When generating data to be inserted, we cannot control which partition the data will be generated for or add a new partition. Hence we need a start_offset parameter to control the partition offset.
> 5. BUG: Does not generate correct number of insert partitions as partition number is chosen as a random long. 
> 6. BUG: Integer division used within Math.ceil in a couple of places is not correct and leads to 0 value.  Math.ceil(5/10) == 0 and not 1 (as intended) as 5 and 10 are integers.
>  
> 1. When generating input data, 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)