You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by GitBox <gi...@apache.org> on 2019/04/05 00:03:14 UTC

[GitHub] [spark] BryanCutler edited a comment on issue #24070: [SPARK-23961][PYTHON] Fix error when toLocalIterator goes out of scope

BryanCutler edited a comment on issue #24070: [SPARK-23961][PYTHON] Fix error when toLocalIterator goes out of scope
URL: https://github.com/apache/spark/pull/24070#issuecomment-480104456
 
 
   > Are we solving this problem on the "right" side? Is it common for the the localiterator to go out of scope before being consumed?
   If it is is a relatively infrequent occurrence, would it possible make sense to just make the Java code accept that the Python socket may go away?
   
   Thanks @holdenk , I totally agree.  I started off with a fix that would catch the error on the Java side and just allow the connection to be broken without showing the error.  Let me summarize the trade-offs that I noticed:
   
   **Catch/Ignore error in Java:**
   - No change in the protocol to Python, keeps things the same
   - This is the fastest way to iterate through all elements
   - Ignoring the error _might_ hide an actual error, but this is over a local socket so once the connection is made I'm not sure how likely an error is
   - While Python is still consuming data from one partition, other jobs will be started to collect the next partitions until write to the socket is blocked again.  This is nice if all the data is used, but bad if it's not.
   
   **Synchronized Protocol (this PR):**
   - Allows for the iterator to go out of scope and close the connection cleanly, without needing to catch/ignore errors.
   - Jobs are only triggered when the next element is needed. So only after the last element of  a partition is consumed, the next partition is collected. This matches the behavior of Scala toLocalIterator.
   - Performance is a little slower because of synchronization and no longer eagerly collecting the next partition
   - Makes the communication between Python more complicated
   
   I went with the second option here because I definitely didn't want to mask potential real errors, and only triggering jobs for data that is requested, matches Scala and is good for some use cases. Also while we don't want performance to suck, I think people would use toLocalIterator to work with constrained memory (1 partition at a time) and do a regular collect if they want good performance.
   
   I'm not completely sure this is the best fix, so I'm happy to discuss :)
   
   Btw, from the JIRA for this, the user was calling `itertools.islice(..)` to take a slice of the the local iterator, which wasn't fully consuming it. I would agree it is probably more common to consume the entire iterator though.
   
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org