You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2022/01/20 03:20:11 UTC

[GitHub] [arrow] vibhatha commented on a change in pull request #12110: ARROW-15212: [C++] Handle suffix argument in joins

vibhatha commented on a change in pull request #12110:
URL: https://github.com/apache/arrow/pull/12110#discussion_r788317883



##########
File path: cpp/src/arrow/compute/exec/hash_join_node.cc
##########
@@ -275,30 +275,56 @@ Status HashJoinSchema::ValidateSchemas(JoinType join_type, const Schema& left_sc
 }
 
 std::shared_ptr<Schema> HashJoinSchema::MakeOutputSchema(
-    const std::string& left_field_name_prefix,
-    const std::string& right_field_name_prefix) {
+    const std::string& left_field_name_suffix,
+    const std::string& right_field_name_suffix) {
   std::vector<std::shared_ptr<Field>> fields;
   int left_size = proj_maps[0].num_cols(HashJoinProjection::OUTPUT);
   int right_size = proj_maps[1].num_cols(HashJoinProjection::OUTPUT);
   fields.resize(left_size + right_size);
 
-  for (int i = 0; i < left_size + right_size; ++i) {
-    bool is_left = (i < left_size);
-    int side = (is_left ? 0 : 1);
-    int input_field_id = proj_maps[side]
-                             .map(HashJoinProjection::OUTPUT, HashJoinProjection::INPUT)
-                             .get(is_left ? i : i - left_size);
+  std::unordered_multimap<std::string, int> left_field_map;

Review comment:
       You're making a very important point. One thing, I wanted to do was, to make sure I use the correct order, because I want to bind the column name with the corresponding iterator index. That's the main reason why I selected a `multimap` leaving space for multiple keys too. But again, thinking about this, do we allow duplicate column names in a schema? I might have missed this even though I wrote it like this. 
   
   `equal_range` needs to be there and I will correct it. 
   
   By the way, I also assumed O(m*n) must be heavy if we have columns with 1000s of columns, that's why the code required that map in the first place. Is there a better way to optimize it? 




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org