You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@hudi.apache.org by GitBox <gi...@apache.org> on 2021/01/07 09:38:57 UTC

[GitHub] [hudi] pranotishanbhag opened a new issue #2414: [SUPPORT]

pranotishanbhag opened a new issue #2414:
URL: https://github.com/apache/hudi/issues/2414


   **_Tips before filing an issue_**
   
   - Have you gone through our [FAQs](https://cwiki.apache.org/confluence/display/HUDI/FAQ)? Yes. Also looked at the Tuning guide: https://cwiki.apache.org/confluence/display/HUDI/Tuning+Guide
   
   - Join the mailing list to engage in conversations and get faster support at dev-subscribe@hudi.apache.org: Have subscribed to this. Also posted this on the Slack group and the suggestion was to create a GH issue.
   
   - If you have triaged this as a bug, then file an [issue](https://issues.apache.org/jira/projects/HUDI/issues) directly: Not sure may be a config issue at my end.
   
   **Describe the problem you faced**
   
   A clear and concise description of the problem.
   
   **To Reproduce**
   
   Steps to reproduce the behavior:
   
   1. Submit a spark job on EMR 5.30 or above. 
   2. Bulk_insert a huge dataset (around 4 TB) with PARTITION_SORT/GLOBAL_SORT sort mode.
   3. upsert a delta around 300 GB
   4. Observed the job is taking too much time in step "Getting small files from partitions (count at HoodieSparkSqlWriter.scala:389).
   
   **Expected behavior**
   
   I expect a good file size of 128 MB to be created during bulk_insert with fewer small files and subsequent upserts to be faster with less/no time taken in optimizing small file size.
   
   **Environment Description**
   
   * Hudi version : 0.6
   
   * Spark version : 2.4
   
   * Hive version : 2.5
   
   * Hadoop version : 2.5
   
   * Storage (HDFS/S3/GCS..) : S3
   
   * Running on Docker? (yes/no) : no
   
   
   **Additional context**
   
   I have a dataset of 4 TB size for which I have used "bulk_insert" mode with config "hoodie.bulkinsert.shuffle.parallelism" set to 15000. This job took about 2 hours and the data is stored in S3 partitioned on 2 columns (uid and source). Few questions here:
   1) what is the right value to use for hoodie.bulkinsert.shuffle.parallelism. If I go by the logic of 500 MB per partition, the value is 5000 but with this my job is very slow. After this, I changed to 15000 and it completed in 2 hours.
   I used partition_sort instead of the default global_sort for hoodie.bulkinsert.sort.mode as I wanted the data sorting within a partition. With this the job was a few minutes faster than it was with global_sort but I saw a variation in file sizes. With partition_sort some file sizes in few partitions were too small (1-2 MB) and some were around 60-70 MB. In global_sort most files were 60-70 MB. The default file size is supposed to be 120 MB. Why are we seeing smaller file sizes and what sort mode is better to use to get the right file size of 120 MB?
   2) After the bulk_insert, I tried an upsert with 300 GB delta. For this, I used the config hoodie.upsert.shuffle.parallelism set to 600 as per the Hudi tuning guide of 500 MB per partition. With this the job competed in 1 hour 20 minutes and about 50 minutes is taken by step "Getting small files from partitions (count at HoodieSparkSqlWriter.scala:389)" . I am assuming it is because of small files but I am not sure what I have done wrong here. Could this be related to the smaller files created in the previous bulk_insert? Or is there some setting which I may be missing.
   We expect to get a delta of around 200-500 MB every 12 hours and we run the job on EMR - 5.30 for now (will upgrade later after initial hudi experiments). I am using Hudi version 6.0.
   
   **Stacktrace**
   
   No exception. Job is taking too much time at step "Getting small files from partitions (count at HoodieSparkSqlWriter.scala:389)
   
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] bvaradar commented on issue #2414: [SUPPORT]

Posted by GitBox <gi...@apache.org>.
bvaradar commented on issue #2414:
URL: https://github.com/apache/hudi/issues/2414#issuecomment-756578166


   Answer provided : https://apache-hudi.slack.com/archives/C4D716NPQ/p1609922670490200?thread_ts=1609360627.455000&cid=C4D716NPQ
   
   Regarding your first question, Bulk_Insert mode simply honors the number of parallelism you have provided and uses it to create separate files.  Please look at https://cwiki.apache.org/confluence/display/HUDI/FAQ#FAQ-Whatperformance/ingestlatencycanIexpectforHudiwriting If you want file sizing to be taken care of, use INSERT mode for the first commit. Regarding your second question, the reason is that with sorting inside a partition (Note : this is spark partition), what happens is that if the records within a spark partition falls under different hive partition, then they end up as separate file. With global sort mode though, each spark partition is likely to see records from the same hive partition. The  spark step name might be misleading, It could be the index lookup time. Try global sort mode with insert and then run upsert to observe upsert time


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] bvaradar closed issue #2414: [SUPPORT]

Posted by GitBox <gi...@apache.org>.
bvaradar closed issue #2414:
URL: https://github.com/apache/hudi/issues/2414


   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] pranotishanbhag removed a comment on issue #2414: [SUPPORT]

Posted by GitBox <gi...@apache.org>.
pranotishanbhag removed a comment on issue #2414:
URL: https://github.com/apache/hudi/issues/2414#issuecomment-758042202


   Hi,
   
   I tried copy_on_write with insert mode for 4.6 TB dataset which is failing with lost nodes (previously tried bulk_insert which worked fine). I tried to tweak the executor memory and also changes the config "hoodie.copyonwrite.insert.split.size" to 100k. It still failed. Later i tried to add the config "hoodie.insert.shuffle.parallelism" and set it to a higher value and still I see failures. Please can you tell me what is wrong here.
   
   Thanks,
   Pranoti


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] pranotishanbhag commented on issue #2414: [SUPPORT]

Posted by GitBox <gi...@apache.org>.
pranotishanbhag commented on issue #2414:
URL: https://github.com/apache/hudi/issues/2414#issuecomment-758042202


   Hi,
   
   I tried copy_on_write with insert mode for 4.6 TB dataset which is failing with lost nodes (previously tried bulk_insert which worked fine). I tried to tweak the executor memory and also changes the config "hoodie.copyonwrite.insert.split.size" to 100k. It still failed. Later i tried to add the config "hoodie.insert.shuffle.parallelism" and set it to a higher value and still I see failures. Please can you tell me what is wrong here.
   
   Thanks,
   Pranoti


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org