You are viewing a plain text version of this content. The canonical link for it is here.

Posted to commits@hudi.apache.org by GitBox <gi...@apache.org> on 2020/06/15 12:43:05 UTC

[GitHub] [hudi] vinothchandar commented on issue #1728: Processing time gradually increases while using spark structured streaming

vinothchandar commented on issue #1728:
URL: https://github.com/apache/hudi/issues/1728#issuecomment-644111413


   @harishchanderramesh can you paste the entire stack trace for the exception?  we can look into why thats happening.. cc @bvaradar who is currently hardening something around this. 
   
   So, onto your performance issue, I see that  you use 
   
   > .option("hoodie.index.type","GLOBAL_BLOOM") \ 
   
   This means every run you will compare incoming keys against all files in the dataset, which is sort of a problematic approach if you want runtime to be proportional to input rather than the dataset size.. Is there a way you can use the non-global version of the index ?
   
   Also few things 
   - Once 0.5.3 is out this week, you can try that anyway. since it has a bunch of perf fixes (no silver bullets for this problem though)
   - If you can paste your spark UI for a 1-2 batches, we can see what cost increases per run.. 
   
   
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org