You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2021/08/02 14:21:33 UTC

[GitHub] [arrow] nirandaperera commented on pull request #10845: ARROW-13268 [C++][Compute] Add ExecNode for semi and anti-semi join

nirandaperera commented on pull request #10845:
URL: https://github.com/apache/arrow/pull/10845#issuecomment-891068825


   >     2. It seems to me that ConsumeCachedProbeBatches is only called for a single thread index - the one for the thread that reaches completion of build_counter_.
   > 
   Yes, thanks @michalursa. I missed this! 
   
   >     3. Because in StopProducing calls to Cancel() on two AtomicCounters are connected with ||, finished_.MarkFinished() can be called twice (first thread gets true from first counter Cancel() call and some other thread from the second Cancel() call). Also, shouldn't we always Cancel() both counters?
   > 
   I see... The `GroupByNode` had this,
   ```c++
       if (input_counter_.Cancel()) {
         finished_.MarkFinished();
       } else if (output_counter_.Cancel()) {
         finished_.MarkFinished();
       }
   ``` 
   and I was wondering why both the cases had the same code path. I thought it can be combined in a single statement. 
   
   So, do you mean to say that `finished_.MarkFinished()` should be called if `build_counter_.Cancel() && out_counter_.Cancel()`?
   
   >     4. I wonder what happens with an empty input on build side?
   > 
   This was my thought process. `build_counter_` has -1 for total initially. So, until the `build_input` signals the `InputFinished` with 0, probe batches will be cached. And when it receives 0, it toggles `build_side_complete_` and probe batches will be queried against an empty hashmap. 
   We could actually return a NullArray in the `Grouper::Find` method, prematurely (if the hashmap is empty). WDYT?
   
   >     5. I think we will only have one class for hash join. It should be fine to call it HashJoinNode and throw Status::NotImplemented() for join types outside of semi, anti-semi. JoinType enum could have elements from all other types as well. Also JoinType enum would be better as "enum class" although Arrow C++ probably has some policy about enums.
   > 
   Sure! 
   
   >     6. I would rename build_side_complete_ to hash_table_built_ or hash_table_build_complete_. Currently I get it confused with build_counter_ checks, where one means all build side input batches consumed by local state hash tables, and the other means hash table merge is complete.
   > 
   Sure!
   
   >     7. Also it would be nice to tie these two conditions above to futures, so that a merge task and tasks to process cache probe side batches could be generated and scheduled to execute once these futures are complete. But the futures are not critical at this point, just something nice to have.
   > 
   I will think about this one! :-)
   
   >     8. Status returned from CacheProbeBatch is always OK()
   > 
   I'll make this void!
   
   >     9. We probably don't support DictionaryArray in key columns in the code as it is right now, we should check and return Status::NotImplemented() when making hash join node (or make sure it works). Also there could be a scenario where one side of the join uses DictionaryArray while the other uses Array with the same underlying type for keys to compare.
   > 
   Sure!
   
   >     10. In BuildSideMerge() ARROW_DCHECK(state->grouper). Perhaps it is a copy-paste from group by node, but it would be good to have a comment about why it is not possible to have states 0 and 2 initialized but not 1. This is not obvious. And maybe it should just be relaxed to skip processing if the local thread state with a given index is not initialized.
   > 
   Yes, it is a copy from the GroupBy impl.
   Ah! Good catch! that is something I didnt think about! Are we talking about a case like this?
   Ex: 4 threads, but only 1 input batch. So, before/while other batches being initialized, thread0 receives the batch and calls `BuildSideMerge()`. Now, other states could have null, and ideally we could `continue` the loop if that is the case (because it is guaranteed that those states wouldn't receive any more batches, because build_counter_ is already completed.)
   
   >     11. TotalReached() method added to AtomicCounter is not used anywhere.
   > 
   >     12. There is a problem with null key. I believe in hash join with equality condition it should be that "null != null" (and there is usually a separate comparison operator that treats nulls as equal), while in group by "null==null" when matching groups. We should have a comment about it and document it for the users (maybe we don't have documentation strings for exec nodes yet). If needed we would have to filter out null keys separately from Grouper.
   > 
   I see... but it looks like Pandas holds null/NaN/na as a valid key and if the users want to, they have to explicitly drop na values. 
   https://stackoverflow.com/questions/23940181/pandas-merging-with-missing-values
   I started a thread on this in Zulip https://ursalabs.zulipchat.com/#narrow/stream/180245-dev/topic/Null.20values.20as.20keys
   
   
   Thank you very 
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org