You are viewing a plain text version of this content. The canonical link for it is here.

Posted to reviews@spark.apache.org by GitBox <gi...@apache.org> on 2019/04/01 16:16:03 UTC

[GitHub] [spark] holdenk commented on issue #24070: [SPARK-23961][PYTHON] Fix error when toLocalIterator goes out of scope

holdenk commented on issue #24070: [SPARK-23961][PYTHON] Fix error when toLocalIterator goes out of scope
URL: https://github.com/apache/spark/pull/24070#issuecomment-478645478
 
 
   So before I jump into the details of this PR, from the design side I've got some questions I'd like to try and understand better:
   
   1) This description mentions that it may save "unneeded work" but not eagerly queueing jobs that aren't needed, but in the situation where the jobs are needed it seems like we're adding unnecessary synchronization that could slow this down.
   - Have you tried the benchmarks with a slow per-partition computation to see if there is more impact there?
   - Is this slow down "worth it", 20% on DataFrame is nothing to walk away from if we can avoid it.
   - If we're worried about the driver program falling over from buffering  too much data in the JVM can we maybe queue the next partition while we are serving the current one to decrease the overhead?
   
   2) Are we solving this problem on the "right" side? Is it common for the the localiterator to go out of scope before being consumed?
   - If it is is a relatively infrequent occurrence, would it possible make sense to just make the Java code accept that the Python socket may go away? I think that might keep the "happy" (or more frequent) path at reasonable performance. On the other hand if folks are using toLocalIterator to partially consume the data and this is the expected path, then using exception handling might not make sense.
   
   I'm going to dig into the code, but I'd love to get a better understanding of the design as well.
   
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org