You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by "bersprockets (via GitHub)" <gi...@apache.org> on 2023/07/22 20:08:23 UTC

[GitHub] [spark] bersprockets commented on pull request #41712: [SPARK-44132][SQL] Materialize `Stream` of join column names to avoid codegen failure

bersprockets commented on PR #41712:
URL: https://github.com/apache/spark/pull/41712#issuecomment-1646662544

   @steven-aerts 
   
   I took a closer look at
   
   https://github.com/apache/spark/blob/24960d84a9dac17728822f3e783335f221c49da3/sql/core/src/main/scala/org/apache/spark/sql/execution/joins/JoinCodegenSupport.scala#L82
   
   This map operation is the culprit. The function passed to the map operation relies on mutable state that's manipulated outside of the function (`ctx.currentVars` and `ctx.INPUT_ROW`). If the map operation runs immediately, the state of those fields are appropriate for the function. If the map operation runs later (or runs immediately for the first element, and then completes sometime later), then the state of those fields may not be correct for proper operation of the function, and in fact are not correct for your example cases.
   
   I confirmed this by adding debug statements to the function indicating whether those two fields in `ctx` are as expected. In the `Stream` case, they are not as expected in the second join when processing the last element of `plan.output`. In the `IndexedSeq` case, the fields are always as expected.
   
   Personally, I would fix it here by changing the line to `plan.output.toIndexedSeq.zipWithIndex.map... etc.`. That would guarantee that all paths to this code (not just those coming from the Dataset APIs) would be handled properly. However, per [this comment](https://github.com/apache/spark/pull/35635#discussion_r836062422), a committer thought it better to fix this sort of issue in the public APIs. Either works for me.
   
   By the way you have conflict in JoinSuite, so you need to rebase.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org