You are viewing a plain text version of this content. The canonical link for it is here.

Posted to reviews@spark.apache.org by GitBox <gi...@apache.org> on 2019/05/16 18:00:30 UTC

[GitHub] [spark] davidnavas commented on issue #24616: [SPARK-27726] [Core] Fix performance of ElementTrackingStore deletes when using InMemoryStore under high loads

davidnavas commented on issue #24616: [SPARK-27726] [Core] Fix performance of ElementTrackingStore deletes when using InMemoryStore under high loads
URL: https://github.com/apache/spark/pull/24616#issuecomment-493171343
 
 
   Updated PR
   >>>
   At some point I thought about allowing "unordered views", which I think would also fix the main performance problem with the deletes (which is the copy + sort operation). But I kinda like your new interface (bar some naming nits).
   <<<
   
   Yes, that's where I started too -- let me share with you why I ended up with removeAll anyway...
   
   Setup:
   Define N as the number of entities for a particular type
   Define MAX as the maximum number of entities that the ElementStore is tasked to retain  (iirc, for stages, this is 1000)
   Define K as the number of elements being deleted in one go.
   
   So, first thing I did was to remove the sort if first==last on the view, which isn't correct in general, but certainly surfaces an unordered view.  The problem is that for N >> MAX, K ~= N, and the basic algorithm is still O(N^2) because we're filtering the whole list on each pass.
   
   Having got my head on straight, the second thing I did was to move the entire delete out of the stage loop.  Then I noticed that you had already done that for Tasks.  But I was annoyed by the implied sort that happens anyway, so I implemented a removeIf construct (attempt unordered view part 2).  This held up really well, until I performance tested with LevelDB.
   
   LevelDB works GREAT with the original implementation.  It's not so wonderful when N << MAX, because then the delete is O(N) instead of O(K).  I hate having to explain a performance regression on a performance PR, and that's why I moved to removeAll and separate implementations for each Store.  It's actually about half the speed of the removeIf construct for InMemory, probably because of all the wrappers being built around the indexValues (stage naturalKeys).  And of course, InMemoryStore still is O(N) and not O(K) on removal for N << MAX.  But it's 100x faster than LevelDB, and isn't anyway a regression in behavior.
   
   removeAll was definitely not my first go-to.  :>
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org