You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@hudi.apache.org by GitBox <gi...@apache.org> on 2020/09/26 15:28:56 UTC
[GitHub] [hudi] rnatarajan edited a comment on issue #2083: Kafka readStream performance slow [SUPPORT]

rnatarajan edited a comment on issue #2083:
URL: https://github.com/apache/hudi/issues/2083#issuecomment-699508859


   Sorry I did not update the ticket in the past week.
   
   1. Yes that is correct.
   2. I tried with both Spark Structured Streaming and DStream. But Since our source is Debezium/Kafka, we had to use ForEachRDD to convert few fields (days since epoch to date, unix time to timestamp) and then we are doing df.write.format("hudi").....<few Hudi options that we had documented in the ticket>.mode(SaveMode.Append).save("<path>")
   3. I am attaching the screenshot that shows that count takes most time in case of bulk_insert and countByKey in case of insert.
   
   Structured Streaming had a higher throughput. But with triggers in Spark, I cannot post granular details. Hence using DStream to illustrate thoughput issue.
   
   Regarding operation mode as insert, each DStream batch was about 434000 records. 
   For the first batch processing took 1.3mins batch but then the processing time drops to about 37s. 
   Attached are the details that narrow down the bottleneck to CountByKey.
   Picture shows that first batch took 53s in countByKey where as for subsequent batches drops to about 28s. 
   Time taken in CountByKey remains around 28s for each batch of 434000 records.
   In this case DStream achieves a peak throughput of about 12K Rows per second.
   
   ![insert_time_taken_by_each_batch](https://user-images.githubusercontent.com/2908985/94343943-77272500-ffe1-11ea-8c46-65c7b39959d3.png)
   ![insert_for_each_rdd](https://user-images.githubusercontent.com/2908985/94343947-7bebd900-ffe1-11ea-8b6b-a802873d7d03.png)
   ![insert_dag_details](https://user-images.githubusercontent.com/2908985/94343951-7f7f6000-ffe1-11ea-8f76-7caedb49ff5c.png)
   ![insert_first_batch_bottleneck](https://user-images.githubusercontent.com/2908985/94343955-81e1ba00-ffe1-11ea-9fd7-1fe0c5c0c85a.png)
   ![insert_expand_bottleneck_in_first_batch](https://user-images.githubusercontent.com/2908985/94343959-860dd780-ffe1-11ea-8d63-80836cb92ff7.png)
   ![insert_subsequent_batch_bottleneck](https://user-images.githubusercontent.com/2908985/94343961-8ad28b80-ffe1-11ea-8890-0b11720fb460.png)
   
   
   
   Regarding operation mode as bulk_insert, each DStream batch was about 434000 records.
   For the first batch processing took 1.2mins batch but then the processing time drops to about 34s. 
   Attached are the details that narrow down the bottleneck to Count.
   Picture shows that first batch took 56s in Count where as for subsequent batches drops to about 32s. 
   Time taken in Count remains around 32s for each batch of 434000 records.
   In this case DStream achieves a peak throughput of about 12K Rows per second.
   
   
   ![bulk_insert_time_taken_by_each_batch](https://user-images.githubusercontent.com/2908985/94343967-90c86c80-ffe1-11ea-9b75-a19b170f3a74.png)
   ![bulk_insert_for_each_rdd](https://user-images.githubusercontent.com/2908985/94343968-93c35d00-ffe1-11ea-8d72-47a868f87195.png)
   ![bulk_insert_dag_details](https://user-images.githubusercontent.com/2908985/94343971-96be4d80-ffe1-11ea-9f49-370d38619452.png)
   ![bulk_insert_first_batch_bottleneck](https://user-images.githubusercontent.com/2908985/94343972-9920a780-ffe1-11ea-9ae1-506f9c41b4ae.png)
   ![bulk_insert_expand_bottleneck_in_first_batch](https://user-images.githubusercontent.com/2908985/94343974-9c1b9800-ffe1-11ea-8b2a-5d317bb4495d.png)
   ![bulk_insert_subsequent_batch_bottleneck](https://user-images.githubusercontent.com/2908985/94343976-9f168880-ffe1-11ea-9661-275e2425aced.png)
   
   
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org