You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@hudi.apache.org by GitBox <gi...@apache.org> on 2021/01/09 19:32:19 UTC

[GitHub] [hudi] sam-wmt opened a new issue #2423: Performance Issues due to significant Parallel Create-Dir being issued to Azure ADLS_V2

sam-wmt opened a new issue #2423:
URL: https://github.com/apache/hudi/issues/2423


   Job performance degraded over the course of 2-3 weeks and eventually started to suffer from significant timeout exceptions in dealing with the ADLS Object Storage.  When working with the Azure storage team they noted excessive sequential Create Dir operation from the workload and asked if we could investigate what might be causing this within the Hudi libraries and what could be done with it.  Main note is we're only running two workloads against this container and as such our IO and operations/sec are well within the norm, where we're seeing issues is specifically with Delete, Create files.
   
   For a single batch of data we say 65k (30k timed out) create directory operations which are called in a very small window of time which we believe caused the job/ storage account to be put into a bad state.
   
   Below are some operation types being issued via our hudi workload across the day:
   ![Uploading image.png…]()
   
   
   **Runtime details:**
   Hudi Release: 0.6.0
   Spark: Azure Databricks runtime (lite) 2.4 Workers: Standard_D16s_v3 (16-cores each 64GB-Ram, 20 workers)
   Streaming Duration: We tried both 10-minutes and 30-minutes on the table
   Source: Kafka cluster 105 partitions, average ingestion rate of ~500/sec spikes of up to 4000/sec (~3KB records)
   Storage: Azure ADLSV2 / StorageV2 (general purpose v2, Standard/Hot Storage, Read-access geo-redundant storage (RA-GRS)
   
   **Table details:**
   Table Info: Merge On Read, Inline Compaction every 18 commits, 1 retained commit per key
   Table Seeded via livestream no Insert/Bulk Insert leveraged
   As Reported from CLI / Last ### compaction
   Row Count: 1,393,797,816 (slowly growing)
   Data Size:  542.9 GB
   File Count: 15,255
   Partitions: Randomly (evenly) distributed into 1024 partitions
   
   **Hudi Configuration:**
   Primary Options:
         .option(HoodieWriteConfig.UPSERT_PARALLELISM, String.valueOf(320))
         .option(HoodieWriteConfig.INSERT_PARALLELISM, String.valueOf(320))
         .option(HoodieCompactionConfig.CLEANER_COMMITS_RETAINED_PROP, String.valueOf(1))
         .option(HoodieCompactionConfig.INLINE_COMPACT_NUM_DELTA_COMMITS_PROP, String.valueOf(18))
         .option(HoodieCompactionConfig.INLINE_COMPACT_PROP, String.valueOf(true))
         .option(HoodieStorageConfig.PARQUET_FILE_MAX_BYTES, String.valueOf(256 * 1024 * 1024))
         .option(HoodieStorageConfig.PARQUET_BLOCK_SIZE_BYTES, String.valueOf(256 * 1024 * 1024))
         .option(HoodieStorageConfig.PARQUET_COMPRESSION_CODEC, "snappy")
   Additional Options:
   "hoodie.compaction.strategy" -> "org.apache.hudi.table.action.compact.strategy.UnBoundedCompactionStrategy",
   "hoodie.bloom.index.prune.by.ranges" -> "false"
      


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] sam-wmt commented on issue #2423: Performance Issues due to significant Parallel Create-Dir being issued to Azure ADLS_V2

Posted by GitBox <gi...@apache.org>.
sam-wmt commented on issue #2423:
URL: https://github.com/apache/hudi/issues/2423#issuecomment-759969185


   Thanks so much @bvaradar, We will test the above and report back and let you know how the performance looks.  Based on the stats above I would expect this would resolve the issues.  If we find this to be successful would you be open to a PR back that allows a way to specify within configuration the option between mkdir and getFileStats.
   
   Kind Regards


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] nsivabalan commented on issue #2423: Performance Issues due to significant Parallel Create-Dir being issued to Azure ADLS_V2

Posted by GitBox <gi...@apache.org>.
nsivabalan commented on issue #2423:
URL: https://github.com/apache/hudi/issues/2423#issuecomment-768664594


   https://github.com/apache/hudi/pull/2501
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] nsivabalan commented on issue #2423: Performance Issues due to significant Parallel Create-Dir being issued to Azure ADLS_V2

Posted by GitBox <gi...@apache.org>.
nsivabalan commented on issue #2423:
URL: https://github.com/apache/hudi/issues/2423#issuecomment-772569933


   Closing this ticket(PR is landed). if you find any other issues, do let us know. Thanks for helping improve Hudi :) 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] nsivabalan edited a comment on issue #2423: Performance Issues due to significant Parallel Create-Dir being issued to Azure ADLS_V2

Posted by GitBox <gi...@apache.org>.
nsivabalan edited a comment on issue #2423:
URL: https://github.com/apache/hudi/issues/2423#issuecomment-768296704


   @bvaradar : do you think we need a config on this? btw, we have so many mkdirs() calls within hudi (HoodieRowCreateHandle, SpillableMapBasedFileSystemView, HoodieTableMetaClient, etc). Do you think we need to fix all places and may be guard by a flag? 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] christoph-wmt commented on issue #2423: Performance Issues due to significant Parallel Create-Dir being issued to Azure ADLS_V2

Posted by GitBox <gi...@apache.org>.
christoph-wmt commented on issue #2423:
URL: https://github.com/apache/hudi/issues/2423#issuecomment-768386541


   @bvaradar while this one code path already has made a huge difference I think it's worth approaching this elsewhere aswell.
   We've observed: if successful (on average) ~2200ms (CreatePathDir) VS ~70ms (GetFileProperties) for ADLS / Azure.  
   So making this configurable and minimizing create operations as much as possible would be huge for Azure users. 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] bvaradar commented on issue #2423: Performance Issues due to significant Parallel Create-Dir being issued to Azure ADLS_V2

Posted by GitBox <gi...@apache.org>.
bvaradar commented on issue #2423:
URL: https://github.com/apache/hudi/issues/2423#issuecomment-765213042


   @nsivabalan : Can you open a PR with code changes in https://github.com/apache/hudi/issues/2423#issuecomment-758433327 to have it landed ?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] nsivabalan commented on issue #2423: Performance Issues due to significant Parallel Create-Dir being issued to Azure ADLS_V2

Posted by GitBox <gi...@apache.org>.
nsivabalan commented on issue #2423:
URL: https://github.com/apache/hudi/issues/2423#issuecomment-768296704


   @bvaradar : do you think we need a config on this? 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] sam-wmt commented on issue #2423: Performance Issues due to significant Parallel Create-Dir being issued to Azure ADLS_V2

Posted by GitBox <gi...@apache.org>.
sam-wmt commented on issue #2423:
URL: https://github.com/apache/hudi/issues/2423#issuecomment-757356013


   <img width="748" alt="Screen Shot 2021-01-09 at 2 32 48 PM" src="https://user-images.githubusercontent.com/67726885/104107035-9774aa00-5287-11eb-9f8d-a43214fe1266.png">
   Adding screenshot of operation type stats for 1 day of the workload.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] sam-wmt commented on issue #2423: Performance Issues due to significant Parallel Create-Dir being issued to Azure ADLS_V2

Posted by GitBox <gi...@apache.org>.
sam-wmt commented on issue #2423:
URL: https://github.com/apache/hudi/issues/2423#issuecomment-768373302


   Happy to submit a PR with or without config depending on if you think this should be the only behavior, default behavior, or optional behavior.  We've seen drastic improvements in our Azure storage accounts and containers which were in an unhealthy state have recovered nicely after this patch.  Please let me know and I can submit the PR.
   
   @nsivabalan , @bvaradar 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] sam-wmt commented on issue #2423: Performance Issues due to significant Parallel Create-Dir being issued to Azure ADLS_V2

Posted by GitBox <gi...@apache.org>.
sam-wmt commented on issue #2423:
URL: https://github.com/apache/hudi/issues/2423#issuecomment-757356013


   <img width="748" alt="Screen Shot 2021-01-09 at 2 32 48 PM" src="https://user-images.githubusercontent.com/67726885/104107035-9774aa00-5287-11eb-9f8d-a43214fe1266.png">
   Adding screenshot of operation type stats for 1 day of the workload.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] nsivabalan closed issue #2423: Performance Issues due to significant Parallel Create-Dir being issued to Azure ADLS_V2

Posted by GitBox <gi...@apache.org>.
nsivabalan closed issue #2423:
URL: https://github.com/apache/hudi/issues/2423


   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] bvaradar commented on issue #2423: Performance Issues due to significant Parallel Create-Dir being issued to Azure ADLS_V2

Posted by GitBox <gi...@apache.org>.
bvaradar commented on issue #2423:
URL: https://github.com/apache/hudi/issues/2423#issuecomment-758433327


   Hudi does not synchronize on partition path creation. Instead, each executor task which is about to write to a parquet file ensures the directory path exists by issuing fs.mkdirs call. Added : https://issues.apache.org/jira/browse/HUDI-1523
   
   If mkdirs is a costly API, Can you try this patch. It tradesoff mkdirs call with getFileStatus() -
   `diff --git a/hudi-client/src/main/java/org/apache/hudi/io/HoodieWriteHandle.java b/hudi-client/src/main/java/org/apache/hudi/io/HoodieWriteHandle.java
   index d148b1b8..11b3cb49 100644
   --- a/hudi-client/src/main/java/org/apache/hudi/io/HoodieWriteHandle.java
   +++ b/hudi-client/src/main/java/org/apache/hudi/io/HoodieWriteHandle.java
   @@ -105,7 +105,9 @@ public abstract class HoodieWriteHandle<T extends HoodieRecordPayload> extends H
      public Path makeNewPath(String partitionPath) {
        Path path = FSUtils.getPartitionPath(config.getBasePath(), partitionPath);
        try {
   -      fs.mkdirs(path); // create a new partition as needed.
   +      if (!fs.exists(path)) {
   +        fs.mkdirs(path); // create a new partition as needed.
   +      }
        } catch (IOException e) {
          throw new HoodieIOException("Failed to make dir " + path, e);
        }`


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] bvaradar edited a comment on issue #2423: Performance Issues due to significant Parallel Create-Dir being issued to Azure ADLS_V2

Posted by GitBox <gi...@apache.org>.
bvaradar edited a comment on issue #2423:
URL: https://github.com/apache/hudi/issues/2423#issuecomment-758433327


   Hudi does not synchronize on partition path creation. Instead, each executor task which is about to write to a parquet file ensures the directory path exists by issuing fs.mkdirs call. Added : https://issues.apache.org/jira/browse/HUDI-1523
   
   If mkdirs is a costly API, Can you try this patch. It tradesoff mkdirs call with getFileStatus() -
   ```
   diff --git a/hudi-client/src/main/java/org/apache/hudi/io/HoodieWriteHandle.java b/hudi-client/src/main/java/org/apache/hudi/io/HoodieWriteHandle.java
   index d148b1b8..11b3cb49 100644
   --- a/hudi-client/src/main/java/org/apache/hudi/io/HoodieWriteHandle.java
   +++ b/hudi-client/src/main/java/org/apache/hudi/io/HoodieWriteHandle.java
   @@ -105,7 +105,9 @@ public abstract class HoodieWriteHandle<T extends HoodieRecordPayload> extends H
      public Path makeNewPath(String partitionPath) {
        Path path = FSUtils.getPartitionPath(config.getBasePath(), partitionPath);
        try {
   -      fs.mkdirs(path); // create a new partition as needed.
   +      if (!fs.exists(path)) {
   +        fs.mkdirs(path); // create a new partition as needed.
   +      }
        } catch (IOException e) {
          throw new HoodieIOException("Failed to make dir " + path, e);
        }
   ```


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org