You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@hudi.apache.org by GitBox <gi...@apache.org> on 2021/01/09 19:32:19 UTC
[GitHub] [hudi] sam-wmt opened a new issue #2423: Performance Issues due to significant Parallel Create-Dir being issued to Azure ADLS_V2
sam-wmt opened a new issue #2423:
URL: https://github.com/apache/hudi/issues/2423
Job performance degraded over the course of 2-3 weeks and eventually started to suffer from significant timeout exceptions in dealing with the ADLS Object Storage. When working with the Azure storage team they noted excessive sequential Create Dir operation from the workload and asked if we could investigate what might be causing this within the Hudi libraries and what could be done with it. Main note is we're only running two workloads against this container and as such our IO and operations/sec are well within the norm, where we're seeing issues is specifically with Delete, Create files.
For a single batch of data we say 65k (30k timed out) create directory operations which are called in a very small window of time which we believe caused the job/ storage account to be put into a bad state.
Below are some operation types being issued via our hudi workload across the day:
![Uploading image.png…]()
**Runtime details:**
Hudi Release: 0.6.0
Spark: Azure Databricks runtime (lite) 2.4 Workers: Standard_D16s_v3 (16-cores each 64GB-Ram, 20 workers)
Streaming Duration: We tried both 10-minutes and 30-minutes on the table
Source: Kafka cluster 105 partitions, average ingestion rate of ~500/sec spikes of up to 4000/sec (~3KB records)
Storage: Azure ADLSV2 / StorageV2 (general purpose v2, Standard/Hot Storage, Read-access geo-redundant storage (RA-GRS)
**Table details:**
Table Info: Merge On Read, Inline Compaction every 18 commits, 1 retained commit per key
Table Seeded via livestream no Insert/Bulk Insert leveraged
As Reported from CLI / Last ### compaction
Row Count: 1,393,797,816 (slowly growing)
Data Size: 542.9 GB
File Count: 15,255
Partitions: Randomly (evenly) distributed into 1024 partitions
**Hudi Configuration:**
Primary Options:
.option(HoodieWriteConfig.UPSERT_PARALLELISM, String.valueOf(320))
.option(HoodieWriteConfig.INSERT_PARALLELISM, String.valueOf(320))
.option(HoodieCompactionConfig.CLEANER_COMMITS_RETAINED_PROP, String.valueOf(1))
.option(HoodieCompactionConfig.INLINE_COMPACT_NUM_DELTA_COMMITS_PROP, String.valueOf(18))
.option(HoodieCompactionConfig.INLINE_COMPACT_PROP, String.valueOf(true))
.option(HoodieStorageConfig.PARQUET_FILE_MAX_BYTES, String.valueOf(256 * 1024 * 1024))
.option(HoodieStorageConfig.PARQUET_BLOCK_SIZE_BYTES, String.valueOf(256 * 1024 * 1024))
.option(HoodieStorageConfig.PARQUET_COMPRESSION_CODEC, "snappy")
Additional Options:
"hoodie.compaction.strategy" -> "org.apache.hudi.table.action.compact.strategy.UnBoundedCompactionStrategy",
"hoodie.bloom.index.prune.by.ranges" -> "false"
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [hudi] sam-wmt commented on issue #2423: Performance Issues due to significant Parallel Create-Dir being issued to Azure ADLS_V2
Posted by GitBox <gi...@apache.org>.
sam-wmt commented on issue #2423:
URL: https://github.com/apache/hudi/issues/2423#issuecomment-759969185
Thanks so much @bvaradar, We will test the above and report back and let you know how the performance looks. Based on the stats above I would expect this would resolve the issues. If we find this to be successful would you be open to a PR back that allows a way to specify within configuration the option between mkdir and getFileStats.
Kind Regards
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [hudi] nsivabalan commented on issue #2423: Performance Issues due to significant Parallel Create-Dir being issued to Azure ADLS_V2
Posted by GitBox <gi...@apache.org>.
nsivabalan commented on issue #2423:
URL: https://github.com/apache/hudi/issues/2423#issuecomment-768664594
https://github.com/apache/hudi/pull/2501
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [hudi] nsivabalan commented on issue #2423: Performance Issues due to significant Parallel Create-Dir being issued to Azure ADLS_V2
Posted by GitBox <gi...@apache.org>.
nsivabalan commented on issue #2423:
URL: https://github.com/apache/hudi/issues/2423#issuecomment-772569933
Closing this ticket(PR is landed). if you find any other issues, do let us know. Thanks for helping improve Hudi :)
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [hudi] nsivabalan edited a comment on issue #2423: Performance Issues due to significant Parallel Create-Dir being issued to Azure ADLS_V2
Posted by GitBox <gi...@apache.org>.
nsivabalan edited a comment on issue #2423:
URL: https://github.com/apache/hudi/issues/2423#issuecomment-768296704
@bvaradar : do you think we need a config on this? btw, we have so many mkdirs() calls within hudi (HoodieRowCreateHandle, SpillableMapBasedFileSystemView, HoodieTableMetaClient, etc). Do you think we need to fix all places and may be guard by a flag?
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [hudi] christoph-wmt commented on issue #2423: Performance Issues due to significant Parallel Create-Dir being issued to Azure ADLS_V2
Posted by GitBox <gi...@apache.org>.
christoph-wmt commented on issue #2423:
URL: https://github.com/apache/hudi/issues/2423#issuecomment-768386541
@bvaradar while this one code path already has made a huge difference I think it's worth approaching this elsewhere aswell.
We've observed: if successful (on average) ~2200ms (CreatePathDir) VS ~70ms (GetFileProperties) for ADLS / Azure.
So making this configurable and minimizing create operations as much as possible would be huge for Azure users.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [hudi] bvaradar commented on issue #2423: Performance Issues due to significant Parallel Create-Dir being issued to Azure ADLS_V2
Posted by GitBox <gi...@apache.org>.
bvaradar commented on issue #2423:
URL: https://github.com/apache/hudi/issues/2423#issuecomment-765213042
@nsivabalan : Can you open a PR with code changes in https://github.com/apache/hudi/issues/2423#issuecomment-758433327 to have it landed ?
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [hudi] nsivabalan commented on issue #2423: Performance Issues due to significant Parallel Create-Dir being issued to Azure ADLS_V2
Posted by GitBox <gi...@apache.org>.
nsivabalan commented on issue #2423:
URL: https://github.com/apache/hudi/issues/2423#issuecomment-768296704
@bvaradar : do you think we need a config on this?
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [hudi] sam-wmt commented on issue #2423: Performance Issues due to significant Parallel Create-Dir being issued to Azure ADLS_V2
Posted by GitBox <gi...@apache.org>.
sam-wmt commented on issue #2423:
URL: https://github.com/apache/hudi/issues/2423#issuecomment-757356013
<img width="748" alt="Screen Shot 2021-01-09 at 2 32 48 PM" src="https://user-images.githubusercontent.com/67726885/104107035-9774aa00-5287-11eb-9f8d-a43214fe1266.png">
Adding screenshot of operation type stats for 1 day of the workload.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [hudi] sam-wmt commented on issue #2423: Performance Issues due to significant Parallel Create-Dir being issued to Azure ADLS_V2
Posted by GitBox <gi...@apache.org>.
sam-wmt commented on issue #2423:
URL: https://github.com/apache/hudi/issues/2423#issuecomment-768373302
Happy to submit a PR with or without config depending on if you think this should be the only behavior, default behavior, or optional behavior. We've seen drastic improvements in our Azure storage accounts and containers which were in an unhealthy state have recovered nicely after this patch. Please let me know and I can submit the PR.
@nsivabalan , @bvaradar
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [hudi] sam-wmt commented on issue #2423: Performance Issues due to significant Parallel Create-Dir being issued to Azure ADLS_V2
Posted by GitBox <gi...@apache.org>.
sam-wmt commented on issue #2423:
URL: https://github.com/apache/hudi/issues/2423#issuecomment-757356013
<img width="748" alt="Screen Shot 2021-01-09 at 2 32 48 PM" src="https://user-images.githubusercontent.com/67726885/104107035-9774aa00-5287-11eb-9f8d-a43214fe1266.png">
Adding screenshot of operation type stats for 1 day of the workload.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [hudi] nsivabalan closed issue #2423: Performance Issues due to significant Parallel Create-Dir being issued to Azure ADLS_V2
Posted by GitBox <gi...@apache.org>.
nsivabalan closed issue #2423:
URL: https://github.com/apache/hudi/issues/2423
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [hudi] bvaradar commented on issue #2423: Performance Issues due to significant Parallel Create-Dir being issued to Azure ADLS_V2
Posted by GitBox <gi...@apache.org>.
bvaradar commented on issue #2423:
URL: https://github.com/apache/hudi/issues/2423#issuecomment-758433327
Hudi does not synchronize on partition path creation. Instead, each executor task which is about to write to a parquet file ensures the directory path exists by issuing fs.mkdirs call. Added : https://issues.apache.org/jira/browse/HUDI-1523
If mkdirs is a costly API, Can you try this patch. It tradesoff mkdirs call with getFileStatus() -
`diff --git a/hudi-client/src/main/java/org/apache/hudi/io/HoodieWriteHandle.java b/hudi-client/src/main/java/org/apache/hudi/io/HoodieWriteHandle.java
index d148b1b8..11b3cb49 100644
--- a/hudi-client/src/main/java/org/apache/hudi/io/HoodieWriteHandle.java
+++ b/hudi-client/src/main/java/org/apache/hudi/io/HoodieWriteHandle.java
@@ -105,7 +105,9 @@ public abstract class HoodieWriteHandle<T extends HoodieRecordPayload> extends H
public Path makeNewPath(String partitionPath) {
Path path = FSUtils.getPartitionPath(config.getBasePath(), partitionPath);
try {
- fs.mkdirs(path); // create a new partition as needed.
+ if (!fs.exists(path)) {
+ fs.mkdirs(path); // create a new partition as needed.
+ }
} catch (IOException e) {
throw new HoodieIOException("Failed to make dir " + path, e);
}`
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [hudi] bvaradar edited a comment on issue #2423: Performance Issues due to significant Parallel Create-Dir being issued to Azure ADLS_V2
Posted by GitBox <gi...@apache.org>.
bvaradar edited a comment on issue #2423:
URL: https://github.com/apache/hudi/issues/2423#issuecomment-758433327
Hudi does not synchronize on partition path creation. Instead, each executor task which is about to write to a parquet file ensures the directory path exists by issuing fs.mkdirs call. Added : https://issues.apache.org/jira/browse/HUDI-1523
If mkdirs is a costly API, Can you try this patch. It tradesoff mkdirs call with getFileStatus() -
```
diff --git a/hudi-client/src/main/java/org/apache/hudi/io/HoodieWriteHandle.java b/hudi-client/src/main/java/org/apache/hudi/io/HoodieWriteHandle.java
index d148b1b8..11b3cb49 100644
--- a/hudi-client/src/main/java/org/apache/hudi/io/HoodieWriteHandle.java
+++ b/hudi-client/src/main/java/org/apache/hudi/io/HoodieWriteHandle.java
@@ -105,7 +105,9 @@ public abstract class HoodieWriteHandle<T extends HoodieRecordPayload> extends H
public Path makeNewPath(String partitionPath) {
Path path = FSUtils.getPartitionPath(config.getBasePath(), partitionPath);
try {
- fs.mkdirs(path); // create a new partition as needed.
+ if (!fs.exists(path)) {
+ fs.mkdirs(path); // create a new partition as needed.
+ }
} catch (IOException e) {
throw new HoodieIOException("Failed to make dir " + path, e);
}
```
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org