You are viewing a plain text version of this content. The canonical link for it is here.

Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2022/06/01 14:04:21 UTC

[GitHub] [arrow] jorisvandenbossche commented on a diff in pull request #13281: ARROW-16685: Preserve order of columns in joins

jorisvandenbossche commented on code in PR #13281:
URL: https://github.com/apache/arrow/pull/13281#discussion_r886845637


##########
python/pyarrow/_exec_plan.pyx:
##########
@@ -259,13 +259,19 @@ def _perform_join(join_type, left_operand not None, left_keys,
         left_columns = []
     elif join_type == "inner":
         c_join_type = CJoinType_INNER
-        right_columns = set(right_columns) - set(right_keys)
+        right_columns = [
+            col for col in right_columns if col not in set(right_keys)

Review Comment:
   Not that it matters much because it are small numbers anyway, but I was wondering about it: given that `right_keys` is typically a short list, converting it to a set only introduces more overhead.
   
   ```
   In [16]: right_columns = ["a", "b", "c", "d", "e", "f"]
   
   In [17]: right_keys = ["b", "c"]
   
   In [18]: [col for col in right_columns if col not in set(right_keys)]
   Out[18]: ['a', 'd', 'e', 'f']
   
   In [19]: %timeit [col for col in right_columns if col not in set(right_keys)]
   691 ns ± 3.92 ns per loop (mean ± std. dev. of 7 runs, 1,000,000 loops each)
   
   In [20]: %timeit [col for col in right_columns if col not in right_keys]
   353 ns ± 19.1 ns per loop (mean ± std. dev. of 7 runs, 1,000,000 loops each)
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org