You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2021/07/30 21:36:58 UTC

[GitHub] [arrow] michalursa commented on pull request #10845: ARROW-13268 [C++][Compute] Add ExecNode for semi and anti-semi join

michalursa commented on pull request #10845:
URL: https://github.com/apache/arrow/pull/10845#issuecomment-890166609


   1. The comment about UINT32_MAX probably needs to be updated
   
   2. It seems to me that ConsumeCachedProbeBatches is only called for a single thread index - the one for the thread that reaches completion of build_counter_.
   
   3. Because in StopProducing calls to Cancel() on two AtomicCounters are connected with ||, finished_.MarkFinished() can be called twice (first thread gets true from first counter Cancel() call and some other thread from the second Cancel() call). Also, shouldn't we always Cancel() both counters?
   
   4. I wonder what happens with an empty input on build side?
   
   5. I think we will only have one class for hash join. It should be fine to call it HashJoinNode and throw Status::NotImplemented() for join types outside of semi, anti-semi. JoinType enum could have elements from all other types as well. Also JoinType enum would be better as "enum class" although Arrow C++ probably has some policy about enums.
   
   6. I would rename build_side_complete_ to hash_table_built_ or hash_table_build_complete_. Currently I get it confused with build_counter_ checks, where one means all build side input batches consumed by local state hash tables, and the other means hash table merge is complete.
   
   7. Also it would be nice to tie these two conditions above to futures, so that a merge task and tasks to process cache probe side batches could be generated and scheduled to execute once these futures are complete. But the futures are not critical at this point, just something nice to have.
   
   8. Status returned from CacheProbeBatch is always OK()
   
   9. We probably don't support DictionaryArray in key columns in the code as it is right now, we should check and return Status::NotImplemented() when making hash join node (or make sure it works). Also there could be a scenario where one side of the join uses DictionaryArray while the other uses Array with the same underlying type for keys to compare.
   
   10. In BuildSideMerge() ARROW_DCHECK(state->grouper). Perhaps it is a copy-paste from group by node, but it would be good to have a comment about why it is not possible to have states 0 and 2 initialized but not 1. This is not obvious. And maybe it should just be relaxed to skip processing if the local thread state with a given index is not initialized.
   
   11. TotalReached() method added to AtomicCounter is not used anywhere.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org