You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by GitBox <gi...@apache.org> on 2019/08/01 00:19:13 UTC

[GitHub] [spark] HyukjinKwon opened a new pull request #25321: [SPARK-28153][PYTHON][BRANCH-2.4] Use AtomicReference at InputFileBlockHolder (to support input_file_name with Python UDF)

HyukjinKwon opened a new pull request #25321: [SPARK-28153][PYTHON][BRANCH-2.4] Use AtomicReference at InputFileBlockHolder (to support input_file_name with Python UDF)
URL: https://github.com/apache/spark/pull/25321
 
 
   ## What changes were proposed in this pull request?
   
   This PR backports https://github.com/apache/spark/pull/24958 to branch-2.4.
   
   This PR proposes to use `AtomicReference` so that parent and child threads can access to the same file block holder.
   
   Python UDF expressions are turned to a plan and then it launches a separate thread to consume the input iterator. In the separate child thread, the iterator sets `InputFileBlockHolder.set` before the parent does which the parent thread is unable to read later.
   
   1. In this separate child thread, if it happens to call `InputFileBlockHolder.set` first without initialization of the parent's thread local (which is done when the `ThreadLocal.get()` is first called), the child thread seems calling its own `initialValue` to initialize.
   
   2. After that, the parent calls its own `initialValue` to initializes at the first call of `ThreadLocal.get()`.
   
   3. Both now have two different references. Updating at child isn't reflected to parent.
   
   This PR fixes it via initializing parent's thread local with `AtomicReference` for file status so that they can be used in each task, and children thread's update is reflected.
   
   I also tried to explain this a bit more at https://github.com/apache/spark/pull/24958#discussion_r297203041.
   
   ## How was this patch tested?
   
   Manually tested and unittest was added.
   
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org