You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by "caiwanli (via GitHub)" <gi...@apache.org> on 2023/12/14 02:05:17 UTC

Re: [I] [C++] How to implement efficient Join between two RecordBatch? [arrow]

caiwanli commented on issue #39142:
URL: https://github.com/apache/arrow/issues/39142#issuecomment-1854993891

   My goal is to implement an in-memory query engine using the Arrow data format. The unit of data transfer between operators is in terms of RecordBatches (RB). This is easily achievable for non-blocking operators. However, I encountered challenges when implementing Join. The Join operation involves randomly accessing all data, building a hash table, and then probing to generate results. I am using the basic data formats provided by Arrow and do not want to use Acero, as I have my own thread management and resource control. I don't want to add another layer of Acero when implementing JOIN. If operations on RecordBatches are efficient enough, I can implement it on my own. In fact, I have already implemented Radix-HashJoin, but the efficiency is too slow. 
   The main time-consuming phase is in the probing and result generation stage, as it involves merging two RecordBatches. Below is my merging codeļ¼š
   `void join_two_record_batch(std::shared_ptr<arrow::RecordBatch>& left, 
                              int16_t left_col_nums, 
                              int16_t left_col, 
                              std::shared_ptr<arrow::RecordBatch>& right, 
                              int16_t right_col_nums, 
                              int16_t right_col) {
     auto schema = left->schema();
     int offset = right_col_nums;
     for(int i = 0; i < left_col_nums; i++) {
       if(i == left_col_nums) {
         continue;
       }
       auto filed = schema->field(i);
       std::string filed_name = "r." + filed->name();
       auto column = left->column(i);
       right = right->AddColumn(offset, filed_name, column).ValueOrDie();
       offset++;
     }
   }`
   If you have better ideas, perhaps we can discuss them.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org