You are viewing a plain text version of this content. The canonical link for it is here.

Posted to commits@hudi.apache.org by "VitoMakarevich (via GitHub)" <gi...@apache.org> on 2023/01/23 14:23:20 UTC

[GitHub] [hudi] VitoMakarevich opened a new issue, #7734: [SUPPORT] Increased rate of object storage calls after upgrade from 0.11.0 to 0.12.1

VitoMakarevich opened a new issue, #7734:
URL: https://github.com/apache/hudi/issues/7734

   **Describe the problem you faced**
   
   Hello, recently we updated the hudi version from 0.11.0 to 0.12.1, after that we saw performance degradation, but since we have no clear reproduction, at the moment we want to check things we see in fact. So, one of the things is that we see s3 rates grow significantly(few orders). Only head/get counts are increased, the rest looks the same(post/list/delete). Also, the bytes downloaded look the same. I'm now checking which calls are most frequent(but we could not compare now because didn't collect that granular data before). I suspect some bloom-filter issues that lead to loading the same data more & more, but I'm not very familiar to be sure. I also suspected failed tasks to be the reason, but we have a relatively low amount(and had before).
   <img width="1345" alt="image" src="https://user-images.githubusercontent.com/15978165/214061278-59628cd8-9106-46c0-969c-4198fb33b877.png">
   Our spark settings are
   `
           "hoodie.datasource.write.table.type" = "COPY_ON_WRITE"
           "hoodie.datasource.write.recordkey.field" = "hkey"
           "hoodie.datasource.write.precombine.field" = "hkey"
           "hoodie.datasource.write.partitionpath.field" = "root_account_uuid"
           "hoodie.datasource.write.drop.partition.columns" = "true"
           "hoodie.datasource.write.hive_style_partitioning" = "true"
           "hoodie.finalize.write.parallelism" = "200"
           "hoodie.upsert.shuffle.parallelism" = "200"
           "hoodie.insert.shuffle.parallelism" = "200"
           "hoodie.bulkinsert.shuffle.parallelism" = "200"
           "hoodie.compact.inline" = "false"
           "hoodie.clean.automatic" = "true"
           "hoodie.cleaner.policy" = "KEEP_LATEST_BY_HOURS"
           "hoodie.cleaner.hours.retained" = "12"
           "hoodie.cleaner.commits.retained" = "180"
           "hoodie.metadata.cleaner.commits.retained" = "180"
           "hoodie.keep.min.commits" = "200"
           "hoodie.keep.max.commits" = "240"
           "hoodie.clustering.inline" = "false"
           "hoodie.clustering.inline.max.commits" = "4"
           "hoodie.clustering.plan.strategy.target.file.max.bytes" = "1073741824"
           "hoodie.clustering.plan.strategy.small.file.limit" = "629145600"
           "hoodie.metadata.enable" = "false"
           "hoodie.metadata.keep.min.commits" = "12"
           "hoodie.metadata.keep.max.commits" = "24"
           "hoodie.datasource.compaction.async.enable" = "false"
           "hoodie.write.markers.type" = "DIRECT"
           "hoodie.embed.timeline.server" = "true"
           "hoodie.index.type" = "BLOOM"
           "hoodie.bloom.index.update.partition.path" = "true"
           "hoodie.compact.inline.max.delta.seconds" = "7200"
           "hoodie.compact.inline.trigger.strategy" = "TIME_ELAPSED"
           "hoodie.copyonwrite.insert.split.size" = "50000"
           "hoodie.bloom.index.prune.by.ranges" = "true"
           "hoodie.memory.merge.max.size" = "8589934592"
           "hoodie.datasource.write.insert.drop.duplicates" = "false"
           "hoodie.metrics.on" = "true"
           "hoodie.metrics.reporter.type" = "JMX"
           "hoodie.datasource.hive_sync.partition_fields" = "root_account_uuid"
           "hoodie.datasource.hive_sync.mode" = "hms"
           "hoodie.datasource.hive_sync.enable" = "true"
           "hoodie.datasource.hive_sync.database" = "${glue_database}"
   `
   
   Are you aware of some degradation like this?
   
   **To Reproduce**
   
   
   **Expected behavior** 
   These metrics should stay the same
   
   **Environment Description** We upgraded from EMR 6.7(hudi 0.11.0) to EMR 6.9(0.12.1)
   
   * Hudi version : 0.12.1
   
   * Spark version : 3.3.0
   
   * Hive version : - 
   
   * Hadoop version : 3.3.3
   
   * Storage (HDFS/S3/GCS..) : S3
   
   * Running on Docker? (yes/no) : No
   
   
   **Additional context**
   
   Add any other context about the problem here.
   
   **Stacktrace**
   
   ```Add the stacktrace of the error.```
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] nsivabalan commented on issue #7734: [SUPPORT] Increased rate of object storage calls after upgrade from 0.11.0 to 0.12.1

Posted by "nsivabalan (via GitHub)" <gi...@apache.org>.

nsivabalan commented on issue #7734:
URL: https://github.com/apache/hudi/issues/7734#issuecomment-1642882763

   hey @VitoMakarevich : I guess we found the root cause. can you update the ticket an close it out as applicable. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] alexeykudinkin commented on issue #7734: [SUPPORT] Increased rate of object storage calls after upgrade from 0.11.0 to 0.12.1

Posted by "alexeykudinkin (via GitHub)" <gi...@apache.org>.

alexeykudinkin commented on issue #7734:
URL: https://github.com/apache/hudi/issues/7734#issuecomment-1402662656

   @VitoMakarevich would it be possible for you to provide us with the logs? Feel free to redact all the sensitive information. 
   
   This would allow us to greatly speed up the investigation here.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] yihua commented on issue #7734: [SUPPORT] Increased rate of object storage calls after upgrade from 0.11.0 to 0.12.1

Posted by "yihua (via GitHub)" <gi...@apache.org>.

yihua commented on issue #7734:
URL: https://github.com/apache/hudi/issues/7734#issuecomment-1402667042

   @VitoMakarevich if you're around today, we can also do a live debugging session.  Are you in the Hudi OSS channel?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] nsivabalan commented on issue #7734: [SUPPORT] Increased rate of object storage calls after upgrade from 0.11.0 to 0.12.1

Posted by "nsivabalan (via GitHub)" <gi...@apache.org>.

nsivabalan commented on issue #7734:
URL: https://github.com/apache/hudi/issues/7734#issuecomment-1404081333

   @VitoMakarevich : Can we sync up via general slack in apache hudi workspace. Would like to get more clarify around the scenario. 
   If my understanding is right. 
   there are two hudi tables in play here. TableA and TableB(both are hudi tables). TableB is populated by doing a snapshot query on tableA and doing a filtering on top (from what you described you are not doing leveraging incremental query. Curious to understand why though?). So, in this pipeline, you are seeing an uptick in the GET and HEAD calls with 0.12.1 compared to 0.11.0 (w/o any metadata table). Do you happened to have separate dashboard for requests to TableA vs TableB? 
   
   And you have commits and clean going on. You are not sure whats playing a part here. Can you disable clean for few commits and see do you see similar trend here. 
   
   Commit of interest has 4GB data ingested. 55k records(~80k record size), where 38k are updates and rest is inserts. ~= 70% updates. 
   
   Let us know how we can sync up via slack to investigate this more. 
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] yihua commented on issue #7734: [SUPPORT] Increased rate of object storage calls after upgrade from 0.11.0 to 0.12.1

Posted by "yihua (via GitHub)" <gi...@apache.org>.

yihua commented on issue #7734:
URL: https://github.com/apache/hudi/issues/7734#issuecomment-1402436857

   Hi @VitoMakarevich thanks for reporting this issue.  Do you query this table using any query engine other than Spark?  And do you see the surge of HEAD/GET requests from the ingestion/write job only?
   
   If you could enable S3 request logs by setting log4j.logger.com.amazonaws.request=DEBUG in log4j properties file and the following Spark configs, that would really help us understand where most requests originate.
   
   ```
   --conf spark.driver.extraJavaOptions="-Dlog4j.configuration=file:/<path>/s3-debug.log4j.properties"   --conf spark.executor.extraJavaOptions="-Dlog4j.configuration=file:/<path>/s3-debug.log4j.properties"
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] kasured commented on issue #7734: [SUPPORT] Increased rate of object storage calls after upgrade from 0.11.0 to 0.12.1

Posted by "kasured (via GitHub)" <gi...@apache.org>.

kasured commented on issue #7734:
URL: https://github.com/apache/hudi/issues/7734#issuecomment-1428977425

   As discussed the root cause looks to be the same as in https://github.com/apache/hudi/issues/7844


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] VitoMakarevich commented on issue #7734: [SUPPORT] Increased rate of object storage calls after upgrade from 0.11.0 to 0.12.1

Posted by "VitoMakarevich (via GitHub)" <gi...@apache.org>.

VitoMakarevich commented on issue #7734:
URL: https://github.com/apache/hudi/issues/7734#issuecomment-1402495885

Thank you for looking into it! We have a few flows, but let me describe one that I'm debugging now. The flow is that we have hudi table populated by the streaming job(spark), then the second job(batch) runs every N hours and reads all updates since the previous offset by loading hudi snapshot & doing filter(same as hudi condition from previous offset to now). Our job has bloom index range pruning on, and our target size of datafile is 128mb, we are running a cdc workload, so
I debugged 1 particular run(from s3 logs as you suggested since thought to do the same initially) - it was 2 commits(clean and commit), during that run(filtered by time) the job run Get request to 296 unique files(here probably all files like markers/commitline/data/else), it issued ~30k get requests, it was factually 38k updates and 17k inserts. Since all get calls are range requests, I calculated that of that 30k requests, 25k was less than 100 KB in size, 1.5k is 100-200 KB.
Let me know if you need any kind of additional information. In the meantime I'll continue searching and will write here all the details I'll consider important.

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] VitoMakarevich commented on issue #7734: [SUPPORT] Increased rate of object storage calls after upgrade from 0.11.0 to 0.12.1

Posted by "VitoMakarevich (via GitHub)" <gi...@apache.org>.

VitoMakarevich commented on issue #7734:
URL: https://github.com/apache/hudi/issues/7734#issuecomment-1402552828

   Update: I see there is log line in hudi `Read bloom filter from `, it looked suspicious to me, I verified that this log is present 1 time for each partition/file.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] VitoMakarevich commented on issue #7734: [SUPPORT] Increased rate of object storage calls after upgrade from 0.11.0 to 0.12.1

Posted by "VitoMakarevich (via GitHub)" <gi...@apache.org>.

VitoMakarevich commented on issue #7734:
URL: https://github.com/apache/hudi/issues/7734#issuecomment-1403299851

   @alexeykudinkin I think it's possible, I'll ask our security team about which details should be redacted.
   @yihua Yes, it's possible. Just registered with the email `vitali.makarevich@instructure.com`


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

Re: [I] [SUPPORT] Increased rate of object storage calls after upgrade from 0.11.0 to 0.12.1 [hudi]

Posted by "VitoMakarevich (via GitHub)" <gi...@apache.org>.

VitoMakarevich closed issue #7734: [SUPPORT] Increased rate of object storage calls after upgrade from 0.11.0 to 0.12.1
URL: https://github.com/apache/hudi/issues/7734


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org