You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2022/07/03 16:55:51 UTC

[GitHub] [arrow] rok opened a new pull request, #13501: ARROW-16695: [R][Python][C++] Extension types are not supported in joins

rok opened a new pull request, #13501:
URL: https://github.com/apache/arrow/pull/13501

   This is to resolve [ARROW-16695](https://issues.apache.org/jira/browse/ARROW-16695).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] rok commented on pull request #13501: ARROW-16695: [R][Python][C++] Extension types are not supported in joins

Posted by GitBox <gi...@apache.org>.
rok commented on PR #13501:
URL: https://github.com/apache/arrow/pull/13501#issuecomment-1218472052

   Thanks for the explanation @michalursa! That makes a lot of sense!
   
   @westonpace could I ask you or someone in vicinity to this code for review? :)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] rok commented on pull request #13501: ARROW-16695: [R][Python][C++] Extension types are not supported in joins

Posted by GitBox <gi...@apache.org>.
rok commented on PR #13501:
URL: https://github.com/apache/arrow/pull/13501#issuecomment-1215134706

   I'm not sure who best to ask for review here. I'm approaching this very naively and am afraid I'm missing something.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] rok commented on a diff in pull request #13501: ARROW-16695: [R][Python][C++] Extension types are not supported in joins

Posted by GitBox <gi...@apache.org>.
rok commented on code in PR #13501:
URL: https://github.com/apache/arrow/pull/13501#discussion_r949492718


##########
python/pyarrow/tests/test_exec_plan.py:
##########
@@ -28,6 +28,24 @@
 pytestmark = pytest.mark.dataset
 
 
+class UuidType(pa.PyExtensionType):

Review Comment:
   Oh yeah good call. I was actually not using `UuidType`. Switched to importing the type from `test_extension_type.py`.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] rok commented on pull request #13501: ARROW-16695: [R][Python][C++] Extension types are not supported in joins

Posted by GitBox <gi...@apache.org>.
rok commented on PR #13501:
URL: https://github.com/apache/arrow/pull/13501#issuecomment-1220076972

   Huh, [python and C++ on s390x get the physical type instead of extension](https://app.travis-ci.com/github/apache/arrow/jobs/580252967#L3789).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] westonpace commented on pull request #13501: ARROW-16695: [R][Python][C++] Extension types are not supported in joins

Posted by GitBox <gi...@apache.org>.
westonpace commented on PR #13501:
URL: https://github.com/apache/arrow/pull/13501#issuecomment-1235891475

   That seems reasonable.  Feel free to merge.  You're basically working on the physical type and saving off the extension type to reattach later?
   
   I think, longer term, we might want to consider doing this at the plan level.  So nodes never have to know about the existence of extension types, and we simply attach them, as needed, in the sink.  Might want to wait until we have a few compute functions for extension types first though so there is a bit more example.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] rok commented on pull request #13501: ARROW-16695: [R][Python][C++] Extension types are not supported in joins

Posted by GitBox <gi...@apache.org>.
rok commented on PR #13501:
URL: https://github.com/apache/arrow/pull/13501#issuecomment-1235901940

   > That seems reasonable. Feel free to merge. You're basically working on the physical type and saving off the extension type to reattach later?
   
   Yeah exactly.
   
   > I think, longer term, we might want to consider doing this at the plan level. So nodes never have to know about the existence of extension types, and we simply attach them, as needed, in the sink. Might want to wait until we have a few compute functions for extension types first though so there is a bit more example.
   
   Huh, wouldn't there be cases where extensions would be relevant to the kernels? E.g. geographical coordinates in a certain projection (where extension would contain the projection) and [calculating haversine distances](https://en.wikipedia.org/wiki/Haversine_formula).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] ursabot commented on pull request #13501: ARROW-16695: [R][Python][C++] Extension types are not supported in joins

Posted by GitBox <gi...@apache.org>.
ursabot commented on PR #13501:
URL: https://github.com/apache/arrow/pull/13501#issuecomment-1236110244

   Benchmark runs are scheduled for baseline = 672431b5146e078406c9295e873705db553fa115 and contender = 50a7d15dfb4cbc4dd449ff2bb3ba2b1cde62a3ab. 50a7d15dfb4cbc4dd449ff2bb3ba2b1cde62a3ab is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
   Conbench compare runs links:
   [Finished :arrow_down:0.0% :arrow_up:0.0%] [ec2-t3-xlarge-us-east-2](https://conbench.ursa.dev/compare/runs/9289be81c2314a589413c98470792438...87366054dcfd410899b66d409a906fa2/)
   [Failed] [test-mac-arm](https://conbench.ursa.dev/compare/runs/925437749bdc48b7a7c5ae71d20ce739...42eb88648bfc466785802a298e9330bf/)
   [Failed :arrow_down:0.0% :arrow_up:0.0%] [ursa-i9-9960x](https://conbench.ursa.dev/compare/runs/1afc426cbf48499882bf41aebbb9da73...acb93e064ac34038a0972267feef4ea5/)
   [Finished :arrow_down:0.53% :arrow_up:0.11%] [ursa-thinkcentre-m75q](https://conbench.ursa.dev/compare/runs/71c269e0a929455fbd157130287c779d...d0abab4450c84de9b273127fd85a70c1/)
   Buildkite builds:
   [Finished] [`50a7d15d` ec2-t3-xlarge-us-east-2](https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ec2-t3-xlarge-us-east-2/builds/1414)
   [Failed] [`50a7d15d` test-mac-arm](https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-test-mac-arm/builds/1431)
   [Failed] [`50a7d15d` ursa-i9-9960x](https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ursa-i9-9960x/builds/1415)
   [Finished] [`50a7d15d` ursa-thinkcentre-m75q](https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ursa-thinkcentre-m75q/builds/1429)
   [Finished] [`672431b5` ec2-t3-xlarge-us-east-2](https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ec2-t3-xlarge-us-east-2/builds/1413)
   [Failed] [`672431b5` test-mac-arm](https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-test-mac-arm/builds/1430)
   [Failed] [`672431b5` ursa-i9-9960x](https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ursa-i9-9960x/builds/1414)
   [Finished] [`672431b5` ursa-thinkcentre-m75q](https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ursa-thinkcentre-m75q/builds/1428)
   Supported benchmarks:
   ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True
   test-mac-arm: Supported benchmark langs: C++, Python, R
   ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
   ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] github-actions[bot] commented on pull request #13501: ARROW-16695: [R][Python][C++] Extension types are not supported in joins

Posted by GitBox <gi...@apache.org>.
github-actions[bot] commented on PR #13501:
URL: https://github.com/apache/arrow/pull/13501#issuecomment-1173135472

   https://issues.apache.org/jira/browse/ARROW-16695


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] rok commented on pull request #13501: ARROW-16695: [R][Python][C++] Extension types are not supported in joins

Posted by GitBox <gi...@apache.org>.
rok commented on PR #13501:
URL: https://github.com/apache/arrow/pull/13501#issuecomment-1235824348

   @westonpace I had to introduce some changes to `RowEncoder` to get the `HashJoin` to work. Could you please confirm that I'm not doing something bad and I'll merge afterwards.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] rok commented on pull request #13501: ARROW-16695: [R][Python][C++] Extension types are not supported in joins

Posted by GitBox <gi...@apache.org>.
rok commented on PR #13501:
URL: https://github.com/apache/arrow/pull/13501#issuecomment-1173187174

   @michalursa I'd like to enable joining tables where some of non-key columns are ExtensionTypes. Am I right to assume this should be possible? Naive approach in this PR segfaults.
   
   Trying a similar thing in c++ fails in [RadixRecordBatchSorter 
   (vector_sort.cc)](https://github.com/apache/arrow/blob/master/cpp/src/arrow/compute/kernels/vector_sort.cc#L513) which could be fixed by disabling further checks (`VISIT_SORTABLE_PHYSICAL_TYPES`).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] westonpace commented on pull request #13501: ARROW-16695: [R][Python][C++] Extension types are not supported in joins

Posted by GitBox <gi...@apache.org>.
westonpace commented on PR #13501:
URL: https://github.com/apache/arrow/pull/13501#issuecomment-1219885108

   Yes, thanks.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] westonpace commented on pull request #13501: ARROW-16695: [R][Python][C++] Extension types are not supported in joins

Posted by GitBox <gi...@apache.org>.
westonpace commented on PR #13501:
URL: https://github.com/apache/arrow/pull/13501#issuecomment-1220088290

   > Huh, [python and C++ on s390x get the physical type instead of extension](https://app.travis-ci.com/github/apache/arrow/jobs/580252967#L3789).
   
   Probably because we fall back to the basic hash join implementation on big endian machines:
   
   ```
   #if ARROW_LITTLE_ENDIAN
       use_swiss_join = (filter == literal(true)) && !schema_mgr->HasDictionaries() &&
                        !schema_mgr->HasLargeBinary();
   #else
       use_swiss_join = false;
   #endif
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] rok commented on pull request #13501: ARROW-16695: [R][Python][C++] Extension types are not supported in joins

Posted by GitBox <gi...@apache.org>.
rok commented on PR #13501:
URL: https://github.com/apache/arrow/pull/13501#issuecomment-1220091865

   Got it! I'll adopt HashJoin as well.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] westonpace commented on a diff in pull request #13501: ARROW-16695: [R][Python][C++] Extension types are not supported in joins

Posted by GitBox <gi...@apache.org>.
westonpace commented on code in PR #13501:
URL: https://github.com/apache/arrow/pull/13501#discussion_r949449019


##########
cpp/src/arrow/compute/exec/hash_join_node_test.cc:
##########
@@ -1801,6 +1802,42 @@ TEST(HashJoin, UnsupportedTypes) {
   }
 }
 
+TEST(HashJoin, ExtensionTypes) {
+  auto ext_arr = ExampleUuid();
+  auto l_int_arr = ArrayFromJSON(int32(), "[1, 2, 3, 4]");
+  auto r_int_arr = ArrayFromJSON(int32(), "[1, 2, 3, 4]");
+  auto bin_arr = ArrayFromJSON(
+      fixed_size_binary(16),
+      "[null, \"abcdefghijklmno0\", \"abcdefghijklmno1\", \"abcdefghijklmno2\"]");
+  const bool parallel = false;
+  std::vector<FieldRef> l_keys{{"l_0"}};
+  std::vector<FieldRef> r_keys{{"r_0"}};
+
+  HashJoinNodeOptions join_options{JoinType::INNER,
+                                   {FieldRef("l_0")},
+                                   {FieldRef("r_0")},
+                                   {FieldRef("l_0"), FieldRef("l_1")},
+                                   {FieldRef("r_0"), FieldRef("r_1")},
+                                   {JoinKeyCmp::EQ}};
+
+  std::shared_ptr<Schema> output_schema =
+      schema({field("l_0", int32()), field("l_1", fixed_size_binary(16)),
+              field("r_0", int32()), field("r_1", uuid())});
+
+  Random64Bit rng(42);
+  ASSERT_OK_AND_ASSIGN(
+      auto batches,
+      HashJoinWithExecPlan(rng, parallel, join_options, output_schema,
+                           {l_int_arr, bin_arr}, {r_int_arr, ext_arr}, 4, 4));
+  ASSERT_OK_AND_ASSIGN(auto output_rows_test,
+                       TableFromExecBatches(output_schema, batches));
+  auto table =
+      arrow::Table::Make(output_schema, {l_int_arr, bin_arr, r_int_arr, ext_arr}, 4);

Review Comment:
   Minor nit: Can we call this `expected_table` or just `expected`?



##########
python/pyarrow/tests/test_exec_plan.py:
##########
@@ -28,6 +28,24 @@
 pytestmark = pytest.mark.dataset
 
 
+class UuidType(pa.PyExtensionType):

Review Comment:
   Given that this is also in `test_extension_type.py` do we want to make a common place for this and import instead of repeating?  I don't feel too strongly here as it isn't much repetition.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] rok commented on a diff in pull request #13501: ARROW-16695: [R][Python][C++] Extension types are not supported in joins

Posted by GitBox <gi...@apache.org>.
rok commented on code in PR #13501:
URL: https://github.com/apache/arrow/pull/13501#discussion_r949491953


##########
cpp/src/arrow/compute/exec/hash_join_node_test.cc:
##########
@@ -1801,6 +1802,42 @@ TEST(HashJoin, UnsupportedTypes) {
   }
 }
 
+TEST(HashJoin, ExtensionTypes) {
+  auto ext_arr = ExampleUuid();
+  auto l_int_arr = ArrayFromJSON(int32(), "[1, 2, 3, 4]");
+  auto r_int_arr = ArrayFromJSON(int32(), "[1, 2, 3, 4]");
+  auto bin_arr = ArrayFromJSON(
+      fixed_size_binary(16),
+      "[null, \"abcdefghijklmno0\", \"abcdefghijklmno1\", \"abcdefghijklmno2\"]");
+  const bool parallel = false;
+  std::vector<FieldRef> l_keys{{"l_0"}};
+  std::vector<FieldRef> r_keys{{"r_0"}};
+
+  HashJoinNodeOptions join_options{JoinType::INNER,
+                                   {FieldRef("l_0")},
+                                   {FieldRef("r_0")},
+                                   {FieldRef("l_0"), FieldRef("l_1")},
+                                   {FieldRef("r_0"), FieldRef("r_1")},
+                                   {JoinKeyCmp::EQ}};
+
+  std::shared_ptr<Schema> output_schema =
+      schema({field("l_0", int32()), field("l_1", fixed_size_binary(16)),
+              field("r_0", int32()), field("r_1", uuid())});
+
+  Random64Bit rng(42);
+  ASSERT_OK_AND_ASSIGN(
+      auto batches,
+      HashJoinWithExecPlan(rng, parallel, join_options, output_schema,
+                           {l_int_arr, bin_arr}, {r_int_arr, ext_arr}, 4, 4));
+  ASSERT_OK_AND_ASSIGN(auto output_rows_test,
+                       TableFromExecBatches(output_schema, batches));
+  auto table =
+      arrow::Table::Make(output_schema, {l_int_arr, bin_arr, r_int_arr, ext_arr}, 4);

Review Comment:
   Changed to `expected_table`.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] rok commented on pull request #13501: ARROW-16695: [R][Python][C++] Extension types are not supported in joins

Posted by GitBox <gi...@apache.org>.
rok commented on PR #13501:
URL: https://github.com/apache/arrow/pull/13501#issuecomment-1219831229

   Thanks for the review @westonpace! I've pushed proposed changes. Shall I merge if CI passes?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] michalursa commented on pull request #13501: ARROW-16695: [R][Python][C++] Extension types are not supported in joins

Posted by GitBox <gi...@apache.org>.
michalursa commented on PR #13501:
URL: https://github.com/apache/arrow/pull/13501#issuecomment-1218423588

   > With the current changes in this PR, it fails in ` RowEncoder::Init`, which also would need to be updated to handle ExtensionTypes (delegating to the storage type).
   > 
   > What I don't fully understand here is why the "non-key field" needs to be included in creating the Encoder? The non-key fields are not used for building up the join values, so why is it important that we support that data type? (naively, I would assume that for such non-key fields, we only need to be able to "take" the resulting values needed for the result)
   
   The data on the build side of hash join is stored in a row oriented way, that is why there is a storage transformation that is not simply a Take. Storing hash table rows in a row oriented way is important for performance, especially when the entire hash table does not fit into CPU cache, because there is only an access to contiguous block of memory (potentially just one or a few cache lines next to each other) instead of multiple loads from separate addresses. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] jorisvandenbossche commented on pull request #13501: ARROW-16695: [R][Python][C++] Extension types are not supported in joins

Posted by GitBox <gi...@apache.org>.
jorisvandenbossche commented on PR #13501:
URL: https://github.com/apache/arrow/pull/13501#issuecomment-1176430760

   What I don't fully understand here is why the "non-key field" needs to be included in creating the Encoder? The non-key fields are not used for building up the join values, so why is it important that we support that data type? 
   (naively, I would assume that for such non-key fields, we only need to be able to "take" the resulting values needed for the result)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] rok merged pull request #13501: ARROW-16695: [R][Python][C++] Extension types are not supported in joins

Posted by GitBox <gi...@apache.org>.
rok merged PR #13501:
URL: https://github.com/apache/arrow/pull/13501


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org