You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@hudi.apache.org by "jenu9417 (via GitHub)" <gi...@apache.org> on 2023/02/19 06:51:51 UTC

[GitHub] [hudi] jenu9417 opened a new issue, #7991: Higher number of S3 HEAD requests, while writing data to S3.

jenu9417 opened a new issue, #7991:
URL: https://github.com/apache/hudi/issues/7991

   **Problem**
   
   We are working on ingesting data from Kafka onto S3 bucket, via HoodieDeltaStreamer tool. We use EMR v6.9.0 (Hudi - v0.12.1) for running HoodieDeltaStreamer tool. We have enabled hive sync partitions to glue.
   Writing data to kafka, partitioning, syncing metadata to glue, everything is happening well and good.
   But when we analyze the number of S3 Requests, we are seeing an abnormal increase in the number of HEAD requests to S3.
   Not sure what exactly is happening with such requests.
   When we analyzed S3 access logs, some of the HEAD requests were:
   ````
   HEAD /data/testfolder/.hoodie/20230216054705316.deltacommit HTTP/1.1" 404 NoSuchKey
   HEAD /data/testfolder/.hoodie/metadata/.hoodie/.aux/.bootstrap/.partitions/00000000-0000-0000-0000-000000000000-0_1-0-1_00000000000001.hfile HTTP/1.1" 404 NoSuchKey
   HEAD /data/testfolder/.hoodie/metadata/.hoodie/hoodie.properties HTTP/1.1" 200
   `````
   
   Can someone please help me understand what such requests are? Why are they made? Is there a way to optimize/reduce these requests? For context out of 100 API requests, roughly 2/3 of them are HEAD requests.
   
   1) Please help us understand the HEAD requests and how we can reduce them
   
   Also, another query was:
   2) We use customkeygenerator to format partition value using timestamp based conversion. This works when we write directly to S3. When we enable hive sync with the same partition format ('datecreated:TIMESTAMP,tenant:SIMPLE') its throwing error that, such paritioning wont be supported in hive CREATE TABLE command (due to usage of ':' I guess). Is there any work around for this. Is it possible to use customkeygenerator partition values for Hive also.
   
   
   **To Reproduce**
   
   Use HoodieDeltaStreamer tool with Kafka as source and S3 as the target sink.
   
   Command We Run:
   
   ````
   spark-submit --jars /usr/lib/spark/jars/spark-avro.jar,/usr/lib/hudi/hudi-utilities-bundle.jar --conf spark.hadoop.fs.s3a.impl=org.apache.hadoop.fs.s3a.S3AFileSystem --class org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer /usr/lib/hudi/hudi-utilities-bundle.jar --source-class org.apache.hudi.utilities.sources.JsonKafkaSource --source-ordering-field datecreated --table-type MERGE_ON_READ --target-table testfolder --target-base-path s3a://bucket/data/testfolder/ --source-limit 1000 --schemaprovider-class org.apache.hudi.utilities.schema.FilebasedSchemaProvider --hoodie-conf hoodie.deltastreamer.schemaprovider.source.schema.file=s3a://bucket/config/schema.avsc --hoodie-conf auto.offset.reset=earliest --hoodie-conf group.id=test-group --hoodie-conf bootstrap.servers=127.0.0.1:9092 --hoodie-conf hoodie.deltastreamer.source.kafka.topic=test --hoodie-conf hoodie.datasource.write.keygenerator.class=org.apache.hudi.keygen.CustomKeyGenerator --hoodie-conf hoodie.datasource.write
 .recordkey.field=sid --hoodie-conf hoodie.datasource.write.hive_style_partitioning=true --hoodie-conf hoodie.datasource.write.partitionpath.field='datecreated:TIMESTAMP,tenant:SIMPLE' --hoodie-conf hoodie.deltastreamer.keygen.timebased.timestamp.type=DATE_STRING --hoodie-conf hoodie.deltastreamer.keygen.timebased.input.dateformat="yyyy-MM-dd'T'HH:mm:ss.SSSZ" --hoodie-conf hoodie.deltastreamer.keygen.timebased.output.dateformat=yyyyMMddHH --hoodie-conf hoodie.datasource.hive_sync.assume_date_partitioning=false --hoodie-conf hoodie.datasource.hive_sync.database=testinghudi --hoodie-conf hoodie.datasource.hive_sync.table=testfolder --hoodie-conf hoodie.datasource.hive_sync.partition_extractor_class=org.apache.hudi.hive.MultiPartKeysValueExtractor --hoodie-conf hoodie.datasource.hive_sync.partition_fields='datecreated,tenant' --enable-sync
   ````
   
   
   **Environment Description**
   
   * Hudi version : 0.12.1
   
   * Spark version : 3.3.0
   
   * Hive version : 3.1.3
   
   * Hadoop version : 3.3.3
   
   * EMR version : 6.9.0
   
   * Storage (HDFS/S3/GCS..) : S3
   
   * Running on Docker? (yes/no) : No. Running on EMR using HoodieDeltaStreamer tool
   
   
   **Additional context**
   
   Found another issue, which was on similar lines, but had already been closed, without concluding on solution.
   
   https://github.com/apache/hudi/issues/2252
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] nsivabalan commented on issue #7991: Higher number of S3 HEAD requests, while writing data to S3.

Posted by "nsivabalan (via GitHub)" <gi...@apache.org>.
nsivabalan commented on issue #7991:
URL: https://github.com/apache/hudi/issues/7991#issuecomment-1453999940

   We fixed an issue w/ hive sync loading archived timeline unnecessarily https://github.com/apache/hudi/pull/7561
   with 0.13.0, it should not be the case anymore. 
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] xushiyan commented on issue #7991: Higher number of S3 HEAD requests, while writing data to S3.

Posted by "xushiyan (via GitHub)" <gi...@apache.org>.
xushiyan commented on issue #7991:
URL: https://github.com/apache/hudi/issues/7991#issuecomment-1580232229

   @jenu9417 Did you saw improvements in the new version. Did you tried with master code. We also did some of the improvements there recently.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] njalan commented on issue #7991: Higher number of S3 HEAD requests, while writing data to S3.

Posted by "njalan (via GitHub)" <gi...@apache.org>.
njalan commented on issue #7991:
URL: https://github.com/apache/hudi/issues/7991#issuecomment-1741792198

   I also face the same issues and there are  hundreds of s3 list with 40 seconds for one single table when using spark struct streamging to write hudi.  I am using spark 3.0 with hudi 0.9/hudi 0.13.1. Both two hudi versions have the same issues .  


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] nsivabalan commented on issue #7991: Higher number of S3 HEAD requests, while writing data to S3.

Posted by "nsivabalan (via GitHub)" <gi...@apache.org>.
nsivabalan commented on issue #7991:
URL: https://github.com/apache/hudi/issues/7991#issuecomment-1454007979

   can you clarify something. what exactly is your hudi table base path? 
   `/data/testfolder`
   is it `data` or is it `/data/testfolder`? 
   Hudi will not do any list operations for parent of hudi table base path. 
   But if you have other non hudi folders within hudi table base path, it could try to list those folders. 
   Depends on whether you have metedata enabled or not. But if you can clarify whats the base path and your findings on high no of LIST calls for which dir, we can go from there. 
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] umehrot2 commented on issue #7991: Higher number of S3 HEAD requests, while writing data to S3.

Posted by "umehrot2 (via GitHub)" <gi...@apache.org>.
umehrot2 commented on issue #7991:
URL: https://github.com/apache/hudi/issues/7991#issuecomment-1437510365

   @jenu9417 Can you check if you have a lot of archived commits in your timeline i.e `.hoodie/archived` folder ? Like @danny0405 mentioned, the above issue has been identified as a regression. To confirm whether this is the same issue you are facing, you can try by turning off the Hive Sync (Glue sync) and checking if you still observe the surge in HEAD requests.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] HEPBO3AH commented on issue #7991: Higher number of S3 HEAD requests, while writing data to S3.

Posted by "HEPBO3AH (via GitHub)" <gi...@apache.org>.
HEPBO3AH commented on issue #7991:
URL: https://github.com/apache/hudi/issues/7991#issuecomment-1714635347

   Linking similar issue [[SUPPORT] Is this the expected number of S3 calls?](https://github.com/apache/hudi/issues/9612). In our case there is absurd amount of HEAD calls being made during queries using AWS Athena.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] kazdy commented on issue #7991: Higher number of S3 HEAD requests, while writing data to S3.

Posted by "kazdy (via GitHub)" <gi...@apache.org>.
kazdy commented on issue #7991:
URL: https://github.com/apache/hudi/issues/7991#issuecomment-1440117725

   Hi @jenu9417 , 
   
   I had the same issue with meta sync, and got patched Hudi 0.12.1 jars from AWS Support/ EMR team yesterday. Ask them for it :) 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] jenu9417 commented on issue #7991: Higher number of S3 HEAD requests, while writing data to S3.

Posted by "jenu9417 (via GitHub)" <gi...@apache.org>.
jenu9417 commented on issue #7991:
URL: https://github.com/apache/hudi/issues/7991#issuecomment-1451312385

   @kazdy 
   
   Thanks for the suggestion. Will contact EMR team.
   
   Also @yihua @danny0405 @umehrot2 
   
   Any updates on the higher LIST operations, even when metadata sync is disabled? 
   I'm seeing higher number of LIST and HEAD operations for the older version 0.7.0 (But not at this scale of 0.12.1) as well.
   
   Can somebody please help me understand / (or) point me to the resources on why there are such higher number of LIST and HEAD requests?
   Do we have documentation on what exactly happens during a write request and various operations happening during the request?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] jenu9417 commented on issue #7991: Higher number of S3 HEAD requests, while writing data to S3.

Posted by "jenu9417 (via GitHub)" <gi...@apache.org>.
jenu9417 commented on issue #7991:
URL: https://github.com/apache/hudi/issues/7991#issuecomment-1477331774

   @nsivabalan Any updates here?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] nsivabalan commented on issue #7991: Higher number of S3 HEAD requests, while writing data to S3.

Posted by "nsivabalan (via GitHub)" <gi...@apache.org>.
nsivabalan commented on issue #7991:
URL: https://github.com/apache/hudi/issues/7991#issuecomment-1542738112

   Yes, this is already in 0.12.3 (if you are asking about https://github.com/apache/hudi/pull/7561)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] danny0405 commented on issue #7991: Higher number of S3 HEAD requests, while writing data to S3.

Posted by "danny0405 (via GitHub)" <gi...@apache.org>.
danny0405 commented on issue #7991:
URL: https://github.com/apache/hudi/issues/7991#issuecomment-1436228970

   Maybe it is related with this fix: https://github.com/apache/hudi/pull/7561, which is a regression introduced in release 0.12.x


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] jenu9417 commented on issue #7991: Higher number of S3 HEAD requests, while writing data to S3.

Posted by "jenu9417 (via GitHub)" <gi...@apache.org>.
jenu9417 commented on issue #7991:
URL: https://github.com/apache/hudi/issues/7991#issuecomment-1438812225

   @umehrot2 
   I checked the `  .hoodie/archived`   folder. There no files present under that folder.
   Also, I tried to run by turning off Hive sync (by omitting --enable-sync flag in the command)
   The number of requests had came down significantly. The number of requests got reduced by roughly 95%.
   Tried to filter the number of API requests happening for the prefix ` /data/testfolder`    for ingesting 1000 records (900 inserts + 100 updates)
   With Hive Sync Enabled:
   ```
   HEAD -  799
   GET -  86
   PUT - 359
   DELETE - 78
   LIST - 1271   (Happening in the bucket at the same time. Not for the same prefix)
   ```
   
   Without Hive Sync:
   ```
   HEAD -  35
   GET -  8
   PUT - 3
   DELETE - 7
   LIST - 1076  (Happening in the bucket at the same time. Not for the same prefix)
   ```
   
   Here all other requests have reduced except for LIST request. LIST requests are not happening for the same target prefix (/data/testfolder), but happening for the entire bucket (like /data) at the same time. There are no other writes happening to this bucket. Verified that. Also, these LIST requests are happening at the same time as well. There are other prefixes / tables inside the same bucket, which have data, but no active read is happening. (Like /data/newtestfolder/)
   By any chance, hudi is trying to list all those files in the parent prefix (/data)? Not sure. But could this be a reason?
   
   a) What could be the reason for higher number of LIST operations happening? Is it possible to reduce them?
   
   b) Now that, we have more or less established hive sync is the root cause of the problem, what could be the solution for us here? Any work around? Will downgrading to a lower version help? Any particular EMR version, which is stable you could suggest?
   
   c) How to correlate the number of different API calls to the Write operation happening? We are trying to get the numbers of write of 1 record to understand how it will expand for 1000 records. By looking at the current numbers, there doesn't seem to be much correlation. Is there any particular documentation or blog that could be helpful here? We are trying to evaluate the feasibility of using S3 as primary storage. For this we need to understand this API call usage.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] nsivabalan commented on issue #7991: Higher number of S3 HEAD requests, while writing data to S3.

Posted by "nsivabalan (via GitHub)" <gi...@apache.org>.
nsivabalan commented on issue #7991:
URL: https://github.com/apache/hudi/issues/7991#issuecomment-1542742683

   Please close the github issue if we are good


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] jenu9417 commented on issue #7991: Higher number of S3 HEAD requests, while writing data to S3.

Posted by "jenu9417 (via GitHub)" <gi...@apache.org>.
jenu9417 commented on issue #7991:
URL: https://github.com/apache/hudi/issues/7991#issuecomment-1543243868

   @nsivabalan / @HEPBO3AH  Will check in the new version, if this issue is fixed. 
   
   But I also wanted to understand the correlation between various types of API hits (specifically LIST and HEAD) per write to 1 partition. Like for each write to 1 partition, how many GET, HEAD, PUT, LIST operations are happening. This will help us to do cost estimation effectively for the project.
   
   Can you please provide some insights here? Or any corresponding documentation?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] jenu9417 commented on issue #7991: Higher number of S3 HEAD requests, while writing data to S3.

Posted by "jenu9417 (via GitHub)" <gi...@apache.org>.
jenu9417 commented on issue #7991:
URL: https://github.com/apache/hudi/issues/7991#issuecomment-1465110980

   @nsivabalan  Thanks for the update.
   
   `/data/testfolder`   is the basepath for the table. To clarify the below is the folder structure.
   ```
   /data/testfolder/
   /data/testfolder/.hoodie/
   /data/testfolder/.hoodie/.aux/
   /data/testfolder/.hoodie/.aux/.bootstrap/.fileids/
   /data/testfolder/.hoodie/.aux/.bootstrap/.partitions/
   /data/testfolder/.hoodie/.temp/
   /data/testfolder/.hoodie/.temp/20230303104616/
   /data/testfolder/.hoodie/archived/
   /data/testfolder/.hoodie/hoodie.properties
   ```
   
   There are no other non hudi folders present inside `/data/testfolder/`
   And I'm seeing a lot of HEAD operations happening for `/data/`    and  ` /data/testfolder/`
   
   Few Examples from S3 access logs.
   
   LIST
   ``
   "GET /?prefix=repo%2Fsms_data_1_newtable_ind_mor%2F&delimiter=%2F&max-keys=2&encoding-type=url HTTP/1.1"
   "GET /?prefix=repo%2Fsms_data_1_newtable_ind_mor%2F.hoodie%2F&delimiter=%2F&max-keys=2&encoding-type=url HTTP/1.1"
   "GET /?prefix=repo%2Fsms_data_1_newtable_ind_mor%2F.hoodie%2F.aux%2F.bootstrap%2F.partitions%2F&delimiter=%2F&max-keys=2&encoding-type=url HTTP/1.1"
   "GET /?prefix=repo%2Fsms_data_1_newtable_ind_mor%2F&delimiter=%2F&max-keys=2&encoding-type=url HTTP/1.1"
   ```
   
   
   HEAD
   ```
   "HEAD /repo HTTP/1.1"
   "HEAD /repo/sms_data_1_newtable_ind_mor HTTP/1.1"
   "HEAD /repo/sms_data_1_newtable_ind_mor/.hoodie HTTP/1.1"
   "HEAD /repo/sms_data_1_newtable_ind_mor HTTP/1.1"
   ````
   Such requests repeat through out the write operation.
   The major issue we face is the frequency of such API hits happening per write to 1 partition. We see around 100 LIST and 100 HEAD operations per write to 1 partition. Since LIST is costlier operation, the impact of such higher number of LIST API operations per write to 1 partition is making the overall approach costlier.
   
   If we could understand the correlation between various types of API hits (specifically LIST and HEAD) per write to 1 partition, it will be helpful for us to decide.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] HEPBO3AH commented on issue #7991: Higher number of S3 HEAD requests, while writing data to S3.

Posted by "HEPBO3AH (via GitHub)" <gi...@apache.org>.
HEPBO3AH commented on issue #7991:
URL: https://github.com/apache/hudi/issues/7991#issuecomment-1540904661

   @jenu9417 Have you tried 0.13 to see if the fix is effective?
   @nsivabalan can this be ported to 0.12.x?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] Saksham-lumiq commented on issue #7991: Higher number of S3 HEAD requests, while writing data to S3.

Posted by "Saksham-lumiq (via GitHub)" <gi...@apache.org>.
Saksham-lumiq commented on issue #7991:
URL: https://github.com/apache/hudi/issues/7991#issuecomment-1614206310

   @nsivabalan in Glue the latest supported version of hudi is 0.12.1, so can't switch to 0.13.1, can't disable hive sync as well, any other way of possible, also what is the change that has been made in 0.13.0 which solves the higher head request problem?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] jenu9417 commented on issue #7991: Higher number of S3 HEAD requests, while writing data to S3.

Posted by "jenu9417 (via GitHub)" <gi...@apache.org>.
jenu9417 commented on issue #7991:
URL: https://github.com/apache/hudi/issues/7991#issuecomment-1435921620

   Also, Just noticed while analysing S3 access logs,
   the number BatchDeleteObject API calls is also way higher. Even more higher than HEAD requests.
   ````
   BATCH.DELETE.OBJECT	data/
   BATCH.DELETE.OBJECT	data/testfolder/
   BATCH.DELETE.OBJECT	data/testfolder/.hoodie/
   BATCH.DELETE.OBJECT	data/testfolder/
   ````
   Such batch delete object requests for the same path has been continuously invoked repeatedly, roughly around 500 times for the ingestion of 1 record from Kafka to S3.
   Can you please help us understand why these HEAD and BATCH DELETE OBJECT requests are way higher even for 1 record ingestion.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org