You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@hudi.apache.org by GitBox <gi...@apache.org> on 2020/05/07 13:58:41 UTC

[GitHub] [incubator-hudi] reste85 opened a new issue #1598: [SUPPORT] Slow upsert time reading from Kafka

reste85 opened a new issue #1598:
URL: https://github.com/apache/incubator-hudi/issues/1598


   Hi all,
   We're experiencing strange issues using deltastreamer with hudi 0.5.2 version.
   We're reading from a Kafka source, in particular from a compacted topic with 50 partitions. We're partitioning via a custom KeyResolver which basically is partitioning similarly to Kafka (murmur3hash(recordKey) mod n°_of_partitions).
   What we see is that during the first three runs everything goes smoothly (each run ingests 5mln records). At the fourth run, suddenly the process really slows down.
   Speaking about job stages, we saw that the countByKey is the step that is taking too long, with low cluster usage/load (it is shuffling?)
   
   Here the hudi properties we're using:
   `# Hoodie properties
   hoodie.upsert.shuffle.parallelism=5
   hoodie.insert.shuffle.parallelism=5
   hoodie.bulkinsert.shuffle.parallelism=5
   hoodie.embed.timeline.server=true
   hoodie.filesystem.view.type=EMBEDDED_KV_STORE
   hoodie.compact.inline=false
   hoodie.cleaner.policy=KEEP_LATEST_FILE_VERSIONS
   hoodie.clean.automatic=true
   hoodie.combine.before.upsert=true
   hoodie.cleaner.fileversions.retained=1
   hoodie.bloom.index.prune.by.ranges=false
   hoodie.index.bloom.num_entries=1000000`
   
   Last run (the one is taking too long):
   <img width="1662" alt="Screenshot 2020-05-07 at 15 32 24" src="https://user-images.githubusercontent.com/14905251/81303030-749c5100-907b-11ea-84c6-59bb10d2f48d.png">
   
   <img width="1664" alt="Screenshot 2020-05-07 at 15 32 32" src="https://user-images.githubusercontent.com/14905251/81303064-81b94000-907b-11ea-8ac3-b0b75b2be443.png">
   
   
   
   First, second and third run (that went very well):
   <img width="1674" alt="firsrun" src="https://user-images.githubusercontent.com/14905251/81303100-8ed62f00-907b-11ea-8a82-79a171f32b31.png">
   <img width="1657" alt="secondrun" src="https://user-images.githubusercontent.com/14905251/81303109-91388900-907b-11ea-9ea7-dbf9ca4cf33a.png">
   <img width="1667" alt="thirdrun" src="https://user-images.githubusercontent.com/14905251/81303117-939ae300-907b-11ea-9c37-857d34cf492d.png">
   
   
   thank you in advance!
   
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [incubator-hudi] reste85 commented on issue #1598: [SUPPORT] Slow upsert time reading from Kafka

Posted by GitBox <gi...@apache.org>.
reste85 commented on issue #1598:
URL: https://github.com/apache/incubator-hudi/issues/1598#issuecomment-628519978


   PS: from the logs i can see that offset are picked correctly from each partition and that the total number of messages that deltastreamer is trying to read is around 13 mln (which is good). 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [incubator-hudi] reste85 commented on issue #1598: [SUPPORT] Slow upsert time reading from Kafka

Posted by GitBox <gi...@apache.org>.
reste85 commented on issue #1598:
URL: https://github.com/apache/incubator-hudi/issues/1598#issuecomment-628462752


   It seems like when it's reading from kafka it gets stuck. We have 113 mln of records in our compacted topic and every run we try to read 50 mln messages. First two runs worked like a charm, the third one is getting stuck (so when it's trying to read a number of messages < 50mln). If i remove the option "spark.network.timeout=500000" from spark-submit conf, i'm getting 
   
   "java.lang.IllegalArgumentException: requirement failed: Failed to get records for compacted spark-executor-topic_consumer topic-changelog-11 after polling for 310000"
   
   I'm trying to follow this post: https://stackoverflow.com/questions/42264669/spark-streaming-assertion-failed-failed-to-get-records-for-spark-executor-a-gro
   
   Using these properties in kafka consumer:
   spark.streaming.kafka.consumer.poll.ms=310000
   request.timeout.ms=30000
   max.poll.interval.ms=25000
   
   Still getting the same error
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [incubator-hudi] reste85 closed issue #1598: [SUPPORT] Slow upsert time reading from Kafka

Posted by GitBox <gi...@apache.org>.
reste85 closed issue #1598:
URL: https://github.com/apache/incubator-hudi/issues/1598


   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [incubator-hudi] reste85 edited a comment on issue #1598: [SUPPORT] Slow upsert time reading from Kafka

Posted by GitBox <gi...@apache.org>.
reste85 edited a comment on issue #1598:
URL: https://github.com/apache/incubator-hudi/issues/1598#issuecomment-625312293


   Just a note:
   We had 16 mln of records in the topic. According to the 0.5.2-inc version, Deltastreamer reads 5mln records at each iteration. First three runs were ok (so we've correctly ingested 15mln records). Last run seemed stuck (for 1.8 hours): no resources usage, no network usage etc. So i've asked to pump up some new data inside the topic and the job suddenly completed.
   Does this means that to perform the computation we need at least some X data in Kafka? does this depends on how KafkaRDD are designed? 
   
   Thank you!


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [incubator-hudi] vinothchandar commented on issue #1598: [SUPPORT] Slow upsert time reading from Kafka

Posted by GitBox <gi...@apache.org>.
vinothchandar commented on issue #1598:
URL: https://github.com/apache/incubator-hudi/issues/1598#issuecomment-625633172


   Just seeing this. Will triage tomorrow :) thanks for reporting.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [incubator-hudi] reste85 commented on issue #1598: [SUPPORT] Slow upsert time reading from Kafka

Posted by GitBox <gi...@apache.org>.
reste85 commented on issue #1598:
URL: https://github.com/apache/incubator-hudi/issues/1598#issuecomment-625312293


   Just a note:
   We had 16 mln of records in the topic. According to the 0.5.2-inc version, Deltastreamer reads 5mln records at each iteration. First three runs were ok. Last run seemed stuck (for 1.8 hours): no resources usage, no network usage etc. So i've asked to pump up some new data inside the topic and the job suddenly completed.
   Does this means that to perform the computation we need at least some X data in Kafka? does this depends on how KafkaRDD are designed? 
   
   Thank you!


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [incubator-hudi] reste85 commented on issue #1598: [SUPPORT] Slow upsert time reading from Kafka

Posted by GitBox <gi...@apache.org>.
reste85 commented on issue #1598:
URL: https://github.com/apache/incubator-hudi/issues/1598#issuecomment-629988384


   Hi guys,
   As stated in chat, this is not related to Hudi in general. At first sight we thought that the problem was due this: https://issues.apache.org/jira/browse/KAFKA-4753. Indeed, the problem could be addressed by the fact that we're using transactional producers, and offsets in Kafka with transactional producers have different meaning and some of them can be used to handle transactions. This offset usage is probably the thing that makes our consumers hanging up in the "last run" due to the fact that the ending offset is not reachable. So this problem probably relates to this. Have a look at:
   https://stackoverflow.com/questions/59763422/in-my-kafka-topic-end-of-the-offset-is-higher-than-last-messagess-offset-numbe
   https://issues.apache.org/jira/browse/KAFKA-8358
   https://stackoverflow.com/questions/56182606/in-kafka-when-producing-message-with-transactional-consumer-offset-doubled-up
   
   You can close the issue, thank you for your support!
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [incubator-hudi] reste85 edited a comment on issue #1598: [SUPPORT] Slow upsert time reading from Kafka

Posted by GitBox <gi...@apache.org>.
reste85 edited a comment on issue #1598:
URL: https://github.com/apache/incubator-hudi/issues/1598#issuecomment-629988384


   Hi guys,
   As stated in chat, this is not related to Hudi in general. At first sight we thought that the problem was due this: https://issues.apache.org/jira/browse/KAFKA-4753. Indeed, the problem could be addressed by the fact that we're using transactional producers, and offsets in Kafka with transactional producers have different meaning and some of them can be used to handle transactions. This offset usage is probably the thing that makes our consumers hanging up in the "last run" due to the fact that the ending offset is not reachable. So this problem probably relates to this. Have a look at:
   https://stackoverflow.com/questions/59763422/in-my-kafka-topic-end-of-the-offset-is-higher-than-last-messagess-offset-numbe
   https://issues.apache.org/jira/browse/KAFKA-8358
   https://stackoverflow.com/questions/56182606/in-kafka-when-producing-message-with-transactional-consumer-offset-doubled-up
   
   I'm closing the issue, thank you for your support!
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org