You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2022/07/21 14:20:33 UTC

[GitHub] [arrow] westonpace opened a new pull request, #13679: ARROW-17115: [C++] HashJoin fails if it encounters a batch with more than 32Ki rows

westonpace opened a new pull request, #13679:
URL: https://github.com/apache/arrow/pull/13679

   The swiss join was correctly breaking up probe side batches but build size batches would get partitioned as-is before any breaking up happened.  That partitioning assumed 16-bit addressable indices and this failed if a build side batch was too large.
   
   Rather than break batches up in the hash-join node I went ahead and started breaking batches up in the source node.  This matches the morsel / batch model and is basically a small precursor for future scheduler changes.
   
   This will have some small end-user impact as the output for larger queries is going to be batched more finely.  However, we were already slicing batches up into 128Ki chunks in the scanner starting with 9.0.0 and so I don't think this is a significant difference.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] jonkeane commented on pull request #13679: ARROW-17115: [C++] HashJoin fails if it encounters a batch with more than 32Ki rows

Posted by GitBox <gi...@apache.org>.
jonkeane commented on PR #13679:
URL: https://github.com/apache/arrow/pull/13679#issuecomment-1192084825

   @ursabot please benchmark


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] westonpace commented on a diff in pull request #13679: ARROW-17115: [C++] HashJoin fails if it encounters a batch with more than 32Ki rows

Posted by GitBox <gi...@apache.org>.
westonpace commented on code in PR #13679:
URL: https://github.com/apache/arrow/pull/13679#discussion_r927865769


##########
cpp/src/arrow/compute/exec/exec_plan.h:
##########
@@ -144,10 +146,24 @@ class ARROW_EXPORT ExecPlan : public std::enable_shared_from_this<ExecPlan> {
   /// \brief Return the plan's attached metadata
   std::shared_ptr<const KeyValueMetadata> metadata() const;
 
+  /// \brief Should the plan use a legacy batching strategy
+  ///
+  /// This is currently in place only to support the Scanner::ToTable
+  /// method.  This method relies on batch indices from the scanner
+  /// remaining consistent.  This is impractical in the ExecPlan which
+  /// might slice batches as needed (e.g. for a join)
+  ///
+  /// However, it still works for simple plans and this is the only way
+  /// we have at the moment for maintaining implicit order.
+  bool UseLegacyBatching() const { return use_legacy_batching_; }

Review Comment:
   I would very much like to properly address ordering soon (and I think @icexelloss needs it as well).  Once we do that then this flag can go away.  The scanner will assign the initial "sort index" to batches and the sink will just resequence based on this at the end.  The "sort index" will be a property of the batch itself (and there will not be one index series per fragment) and so it will be very easy for the source node to renumber things as it slices batches.
   
   Renumbering today is possible, but not very straightforward.  The scanner adds a scalar column to the batch with the batch index.  So to renumber the batches I'd have to pull out the scalar column (which might not be present if the source is not a scanner), then reattach a scalar column to each sliced batch.  Another complication is that I would have to keep track of the current batch counter for each fragment.  All of this is possible, but enough of a change, that I'd rather not make it and instead focus on the better long term fix.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] save-buffer commented on a diff in pull request #13679: ARROW-17115: [C++] HashJoin fails if it encounters a batch with more than 32Ki rows

Posted by GitBox <gi...@apache.org>.
save-buffer commented on code in PR #13679:
URL: https://github.com/apache/arrow/pull/13679#discussion_r927874603


##########
python/pyarrow/tests/test_dataset.py:
##########
@@ -1816,7 +1816,7 @@ def test_positional_keywords_raises(tempdir):
 @pytest.mark.parquet
 @pytest.mark.pandas
 def test_read_partition_keys_only(tempdir):
-    BATCH_SIZE = 2 ** 17
+    BATCH_SIZE = 2 ** 15

Review Comment:
   Does this still need to be changed? Wouldn't the source break it up? 



##########
cpp/src/arrow/compute/exec/exec_plan.h:
##########
@@ -144,10 +146,24 @@ class ARROW_EXPORT ExecPlan : public std::enable_shared_from_this<ExecPlan> {
   /// \brief Return the plan's attached metadata
   std::shared_ptr<const KeyValueMetadata> metadata() const;
 
+  /// \brief Should the plan use a legacy batching strategy
+  ///
+  /// This is currently in place only to support the Scanner::ToTable
+  /// method.  This method relies on batch indices from the scanner
+  /// remaining consistent.  This is impractical in the ExecPlan which
+  /// might slice batches as needed (e.g. for a join)
+  ///
+  /// However, it still works for simple plans and this is the only way
+  /// we have at the moment for maintaining implicit order.
+  bool UseLegacyBatching() const { return use_legacy_batching_; }

Review Comment:
   > Once we do that then this flag can go away. The scanner will assign the initial "sort index" to batches and the sink will just resequence based on this at the end.
   
   How will this interact with group by/union/join?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] ursabot commented on pull request #13679: ARROW-17115: [C++] HashJoin fails if it encounters a batch with more than 32Ki rows

Posted by GitBox <gi...@apache.org>.
ursabot commented on PR #13679:
URL: https://github.com/apache/arrow/pull/13679#issuecomment-1192084839

   Commit ef0a8bf949c82060402a43d775c1b0bb2f395992 already has scheduled benchmark runs.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] ursabot commented on pull request #13679: ARROW-17115: [C++] HashJoin fails if it encounters a batch with more than 32Ki rows

Posted by GitBox <gi...@apache.org>.
ursabot commented on PR #13679:
URL: https://github.com/apache/arrow/pull/13679#issuecomment-1192082049

   Commit ef0a8bf949c82060402a43d775c1b0bb2f395992 already has scheduled benchmark runs.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] ursabot commented on pull request #13679: ARROW-17115: [C++] HashJoin fails if it encounters a batch with more than 32Ki rows

Posted by GitBox <gi...@apache.org>.
ursabot commented on PR #13679:
URL: https://github.com/apache/arrow/pull/13679#issuecomment-1193250719

   Benchmark runs are scheduled for baseline = 3c7a0cad0e25ed66e4c555d9da49f320f803573c and contender = ee2e9448c8565820ba38a2df9e44ab6055e5df1d. ee2e9448c8565820ba38a2df9e44ab6055e5df1d is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
   Conbench compare runs links:
   [Failed :arrow_down:0.0% :arrow_up:0.0%] [ec2-t3-xlarge-us-east-2](https://conbench.ursa.dev/compare/runs/f43c5e3114554b0282dfb1b68e249624...8a73a8c19db84f35bb8e2174d7af108a/)
   [Failed :arrow_down:1.43% :arrow_up:0.0%] [test-mac-arm](https://conbench.ursa.dev/compare/runs/023aa95d3d61476fba2bff60f7925aa8...e0e0612b63b248eabb9ec5c98a2822fe/)
   [Failed :arrow_down:0.27% :arrow_up:0.0%] [ursa-i9-9960x](https://conbench.ursa.dev/compare/runs/84c83ef14f2d49b29462681b44085d84...1e73e861fb814aa4aa21a9416de9173a/)
   [Finished :arrow_down:1.14% :arrow_up:0.11%] [ursa-thinkcentre-m75q](https://conbench.ursa.dev/compare/runs/f764f300f550404d8c1269f45c437813...054531ceeba84c109a1cb3d83f70726c/)
   Buildkite builds:
   [Failed] [`ee2e9448` ec2-t3-xlarge-us-east-2](https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ec2-t3-xlarge-us-east-2/builds/1187)
   [Finished] [`ee2e9448` test-mac-arm](https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-test-mac-arm/builds/1199)
   [Finished] [`ee2e9448` ursa-i9-9960x](https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ursa-i9-9960x/builds/1181)
   [Finished] [`ee2e9448` ursa-thinkcentre-m75q](https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ursa-thinkcentre-m75q/builds/1201)
   [Failed] [`3c7a0cad` ec2-t3-xlarge-us-east-2](https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ec2-t3-xlarge-us-east-2/builds/1186)
   [Failed] [`3c7a0cad` test-mac-arm](https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-test-mac-arm/builds/1198)
   [Failed] [`3c7a0cad` ursa-i9-9960x](https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ursa-i9-9960x/builds/1180)
   [Finished] [`3c7a0cad` ursa-thinkcentre-m75q](https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ursa-thinkcentre-m75q/builds/1200)
   Supported benchmarks:
   ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True
   test-mac-arm: Supported benchmark langs: C++, Python, R
   ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
   ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] westonpace commented on pull request #13679: ARROW-17115: [C++] HashJoin fails if it encounters a batch with more than 32Ki rows

Posted by GitBox <gi...@apache.org>.
westonpace commented on PR #13679:
URL: https://github.com/apache/arrow/pull/13679#issuecomment-1192234425

   @save-buffer do you mind reviewing this?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] westonpace commented on pull request #13679: ARROW-17115: [C++] HashJoin fails if it encounters a batch with more than 32Ki rows

Posted by GitBox <gi...@apache.org>.
westonpace commented on PR #13679:
URL: https://github.com/apache/arrow/pull/13679#issuecomment-1192085504

   @ursabot please benchmark lang=R


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] westonpace merged pull request #13679: ARROW-17115: [C++] HashJoin fails if it encounters a batch with more than 32Ki rows

Posted by GitBox <gi...@apache.org>.
westonpace merged PR #13679:
URL: https://github.com/apache/arrow/pull/13679


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] github-actions[bot] commented on pull request #13679: ARROW-17115: [C++] HashJoin fails if it encounters a batch with more than 32Ki rows

Posted by GitBox <gi...@apache.org>.
github-actions[bot] commented on PR #13679:
URL: https://github.com/apache/arrow/pull/13679#issuecomment-1191784615

   https://issues.apache.org/jira/browse/ARROW-17115


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] westonpace commented on pull request #13679: ARROW-17115: [C++] HashJoin fails if it encounters a batch with more than 32Ki rows

Posted by GitBox <gi...@apache.org>.
westonpace commented on PR #13679:
URL: https://github.com/apache/arrow/pull/13679#issuecomment-1192082032

   @ursabot please benchmark lang=R


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] westonpace commented on a diff in pull request #13679: ARROW-17115: [C++] HashJoin fails if it encounters a batch with more than 32Ki rows

Posted by GitBox <gi...@apache.org>.
westonpace commented on code in PR #13679:
URL: https://github.com/apache/arrow/pull/13679#discussion_r927992050


##########
python/pyarrow/tests/test_dataset.py:
##########
@@ -1816,7 +1816,7 @@ def test_positional_keywords_raises(tempdir):
 @pytest.mark.parquet
 @pytest.mark.pandas
 def test_read_partition_keys_only(tempdir):
-    BATCH_SIZE = 2 ** 17
+    BATCH_SIZE = 2 ** 15

Review Comment:
   The source breaks it up too much.  I could probably fix the expected number of chunks but this test is easier to understand if it is written write at the splitting boundary.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] westonpace commented on pull request #13679: ARROW-17115: [C++] HashJoin fails if it encounters a batch with more than 32Ki rows

Posted by GitBox <gi...@apache.org>.
westonpace commented on PR #13679:
URL: https://github.com/apache/arrow/pull/13679#issuecomment-1191547601

   @ursabot please benchmark lang=C++


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] westonpace commented on pull request #13679: ARROW-17115: [C++] HashJoin fails if it encounters a batch with more than 32Ki rows

Posted by GitBox <gi...@apache.org>.
westonpace commented on PR #13679:
URL: https://github.com/apache/arrow/pull/13679#issuecomment-1192244782

   CI failures (so far) look unrelated.  TPCH-19 failed one of the two conbench jobs but it appears there is some past flakiness with this query.  I'm trying to figure out exactly how much.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] westonpace commented on pull request #13679: ARROW-17115: [C++] HashJoin fails if it encounters a batch with more than 32Ki rows

Posted by GitBox <gi...@apache.org>.
westonpace commented on PR #13679:
URL: https://github.com/apache/arrow/pull/13679#issuecomment-1192217616

   @ursabot please benchmark lang=R


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] ursabot commented on pull request #13679: ARROW-17115: [C++] HashJoin fails if it encounters a batch with more than 32Ki rows

Posted by GitBox <gi...@apache.org>.
ursabot commented on PR #13679:
URL: https://github.com/apache/arrow/pull/13679#issuecomment-1192085572

   Benchmark runs are scheduled for baseline = 5014dc5f78d51a0df7af144805af84a26e611899 and contender = 286acbb22fbd6c570d0b8d7f1060ba9d14ed9008. Results will be available as each benchmark for each run completes.
   Conbench compare runs links:
   [Skipped :warning: Only ['Python'] langs are supported on ec2-t3-xlarge-us-east-2] [ec2-t3-xlarge-us-east-2](https://conbench.ursa.dev/compare/runs/22d7225e39f9482ab54ecad81749ad81...c2b13c11f0a94bedb2d8b9a329b90efe/)
   [Scheduled] [test-mac-arm](https://conbench.ursa.dev/compare/runs/4ab2cbe1988647e48901b5eeea9ceb97...a94d74133ae34820b8025c0ca72304cd/)
   [Scheduled] [ursa-i9-9960x](https://conbench.ursa.dev/compare/runs/94f8b6eada8047a0b646b9456c2675aa...0c8b537614444f008df5743b54e159c7/)
   [Skipped :warning: Only ['C++', 'Java'] langs are supported on ursa-thinkcentre-m75q] [ursa-thinkcentre-m75q](https://conbench.ursa.dev/compare/runs/a9c67b378464469ab38fed59db50092c...cb5fb5c0a68b4532872d3bebabf57114/)
   Buildkite builds:
   [Scheduled] [`286acbb2` test-mac-arm](https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-test-mac-arm/builds/1181)
   [Scheduled] [`286acbb2` ursa-i9-9960x](https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ursa-i9-9960x/builds/1162)
   [Scheduled] [`5014dc5f` ec2-t3-xlarge-us-east-2](https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ec2-t3-xlarge-us-east-2/builds/1175)
   [Scheduled] [`5014dc5f` test-mac-arm](https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-test-mac-arm/builds/1180)
   [Scheduled] [`5014dc5f` ursa-i9-9960x](https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ursa-i9-9960x/builds/1161)
   [Scheduled] [`5014dc5f` ursa-thinkcentre-m75q](https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ursa-thinkcentre-m75q/builds/1178)
   Supported benchmarks:
   ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True
   test-mac-arm: Supported benchmark langs: C++, Python, R
   ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
   ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] westonpace commented on a diff in pull request #13679: ARROW-17115: [C++] HashJoin fails if it encounters a batch with more than 32Ki rows

Posted by GitBox <gi...@apache.org>.
westonpace commented on code in PR #13679:
URL: https://github.com/apache/arrow/pull/13679#discussion_r927866606


##########
cpp/src/arrow/compute/exec/exec_plan.h:
##########
@@ -144,10 +146,24 @@ class ARROW_EXPORT ExecPlan : public std::enable_shared_from_this<ExecPlan> {
   /// \brief Return the plan's attached metadata
   std::shared_ptr<const KeyValueMetadata> metadata() const;
 
+  /// \brief Should the plan use a legacy batching strategy
+  ///
+  /// This is currently in place only to support the Scanner::ToTable
+  /// method.  This method relies on batch indices from the scanner
+  /// remaining consistent.  This is impractical in the ExecPlan which
+  /// might slice batches as needed (e.g. for a join)
+  ///
+  /// However, it still works for simple plans and this is the only way
+  /// we have at the moment for maintaining implicit order.
+  bool UseLegacyBatching() const { return use_legacy_batching_; }

Review Comment:
   > Also, ScanBatches is the underlying method that requires sequencing right?
   
   ToTable uses `ScanBatchesUnorderedAsync` today and then inserts the items into a priority queue when they come out of Acero.  `ScanBatches` is nearly unused but I think there is still one python method that binds to it.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] westonpace commented on pull request #13679: ARROW-17115: [C++] HashJoin fails if it encounters a batch with more than 32Ki rows

Posted by GitBox <gi...@apache.org>.
westonpace commented on PR #13679:
URL: https://github.com/apache/arrow/pull/13679#issuecomment-1192942528

   > How will this interact with group by/union/join?
   
   To start with these nodes would assign the batch index to -1.  This would signal that any existing ordering is being destroyed and there is no meaningful new ordering.  The sink wouldn't try to resequence and would just return things in whatever order it arrives.  This is probably fine for a while since users generally assume these things destroy ordering.
   
   If a user wants order and adds a sort node then the sort node will replace whatever the existing batch index is (even if it is -1) with a new sequence.
   
   In the future:
   
   A group_by could optionally (I think) emit items in ascending group ID order.  It would then re-sequence the ouptut.  So the first batch it emits would have index 0, the next would have index 1, and so on.
   
   If we add support for a merge_join we could optionally emit items in ascending key order.  The merge_join would emit batches with a new order sequence.
   
   Union is a bit of a tricky one and I imagine it would simply remain an order destroying node.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] lidavidm commented on a diff in pull request #13679: ARROW-17115: [C++] HashJoin fails if it encounters a batch with more than 32Ki rows

Posted by GitBox <gi...@apache.org>.
lidavidm commented on code in PR #13679:
URL: https://github.com/apache/arrow/pull/13679#discussion_r927750234


##########
cpp/src/arrow/compute/exec/exec_plan.h:
##########
@@ -144,10 +146,24 @@ class ARROW_EXPORT ExecPlan : public std::enable_shared_from_this<ExecPlan> {
   /// \brief Return the plan's attached metadata
   std::shared_ptr<const KeyValueMetadata> metadata() const;
 
+  /// \brief Should the plan use a legacy batching strategy
+  ///
+  /// This is currently in place only to support the Scanner::ToTable
+  /// method.  This method relies on batch indices from the scanner
+  /// remaining consistent.  This is impractical in the ExecPlan which
+  /// might slice batches as needed (e.g. for a join)
+  ///
+  /// However, it still works for simple plans and this is the only way
+  /// we have at the moment for maintaining implicit order.
+  bool UseLegacyBatching() const { return use_legacy_batching_; }

Review Comment:
   I'm not super enthused about this flag, but it seems it'll be hard to get around it. Is there a plan for being able to get rid of it? 
   
   Also, ScanBatches is the underlying method that requires sequencing right?



##########
cpp/src/arrow/compute/exec/exec_plan.h:
##########
@@ -144,10 +146,24 @@ class ARROW_EXPORT ExecPlan : public std::enable_shared_from_this<ExecPlan> {
   /// \brief Return the plan's attached metadata
   std::shared_ptr<const KeyValueMetadata> metadata() const;
 
+  /// \brief Should the plan use a legacy batching strategy
+  ///
+  /// This is currently in place only to support the Scanner::ToTable
+  /// method.  This method relies on batch indices from the scanner
+  /// remaining consistent.  This is impractical in the ExecPlan which
+  /// might slice batches as needed (e.g. for a join)
+  ///
+  /// However, it still works for simple plans and this is the only way
+  /// we have at the moment for maintaining implicit order.
+  bool UseLegacyBatching() const { return use_legacy_batching_; }

Review Comment:
   In principle is it not possible to renumber the batches? 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] ursabot commented on pull request #13679: ARROW-17115: [C++] HashJoin fails if it encounters a batch with more than 32Ki rows

Posted by GitBox <gi...@apache.org>.
ursabot commented on PR #13679:
URL: https://github.com/apache/arrow/pull/13679#issuecomment-1191547711

   Benchmark runs are scheduled for baseline = 1ae8dc747c6835069c652874124dae07c50a3bf4 and contender = ef0a8bf949c82060402a43d775c1b0bb2f395992. Results will be available as each benchmark for each run completes.
   Conbench compare runs links:
   [Skipped :warning: Only ['Python'] langs are supported on ec2-t3-xlarge-us-east-2] [ec2-t3-xlarge-us-east-2](https://conbench.ursa.dev/compare/runs/6f321d012cae4b43b08d885290d3e8af...cdb4ed2d71d1491b9524a7e65c2afe5e/)
   [Scheduled] [test-mac-arm](https://conbench.ursa.dev/compare/runs/832cf1a3dea143be95ca69e7b450bfb3...c57fceb5bc1149ac92403bd0087569bf/)
   [Skipped :warning: Only ['JavaScript', 'Python', 'R'] langs are supported on ursa-i9-9960x] [ursa-i9-9960x](https://conbench.ursa.dev/compare/runs/335629f871fe4aa2983e227a574949e7...356c04d8df824ea49885122964e813b8/)
   [Scheduled] [ursa-thinkcentre-m75q](https://conbench.ursa.dev/compare/runs/417c2743dee64406949e9e63a729afe2...c9bcc03250844bdfacbd5d515319d576/)
   Buildkite builds:
   [Scheduled] [`ef0a8bf9` test-mac-arm](https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-test-mac-arm/builds/1174)
   [Scheduled] [`ef0a8bf9` ursa-thinkcentre-m75q](https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ursa-thinkcentre-m75q/builds/1173)
   [Failed] [`1ae8dc74` ec2-t3-xlarge-us-east-2](https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ec2-t3-xlarge-us-east-2/builds/1168)
   [Scheduled] [`1ae8dc74` test-mac-arm](https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-test-mac-arm/builds/1173)
   [Scheduled] [`1ae8dc74` ursa-i9-9960x](https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ursa-i9-9960x/builds/1155)
   [Scheduled] [`1ae8dc74` ursa-thinkcentre-m75q](https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ursa-thinkcentre-m75q/builds/1172)
   Supported benchmarks:
   ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True
   test-mac-arm: Supported benchmark langs: C++, Python, R
   ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
   ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] ursabot commented on pull request #13679: ARROW-17115: [C++] HashJoin fails if it encounters a batch with more than 32Ki rows

Posted by GitBox <gi...@apache.org>.
ursabot commented on PR #13679:
URL: https://github.com/apache/arrow/pull/13679#issuecomment-1192217663

   Benchmark runs are scheduled for baseline = 5014dc5f78d51a0df7af144805af84a26e611899 and contender = b9b8f1ec5dc4e91be433b21d6ef6e0816711328b. Results will be available as each benchmark for each run completes.
   Conbench compare runs links:
   [Skipped :warning: Only ['Python'] langs are supported on ec2-t3-xlarge-us-east-2] [ec2-t3-xlarge-us-east-2](https://conbench.ursa.dev/compare/runs/22d7225e39f9482ab54ecad81749ad81...9e3bf70a57f54e70ade79d880df39fe9/)
   [Scheduled] [test-mac-arm](https://conbench.ursa.dev/compare/runs/4ab2cbe1988647e48901b5eeea9ceb97...537756481d9b4479b3b8a1936d1de4aa/)
   [Scheduled] [ursa-i9-9960x](https://conbench.ursa.dev/compare/runs/94f8b6eada8047a0b646b9456c2675aa...95c9432c07f0434381b9cb8c4d763523/)
   [Skipped :warning: Only ['C++', 'Java'] langs are supported on ursa-thinkcentre-m75q] [ursa-thinkcentre-m75q](https://conbench.ursa.dev/compare/runs/a9c67b378464469ab38fed59db50092c...025a81869ae54290aa24651ba4243e2e/)
   Buildkite builds:
   [Scheduled] [`b9b8f1ec` test-mac-arm](https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-test-mac-arm/builds/1183)
   [Scheduled] [`b9b8f1ec` ursa-i9-9960x](https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ursa-i9-9960x/builds/1164)
   [Failed] [`5014dc5f` ec2-t3-xlarge-us-east-2](https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ec2-t3-xlarge-us-east-2/builds/1175)
   [Failed] [`5014dc5f` test-mac-arm](https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-test-mac-arm/builds/1180)
   [Failed] [`5014dc5f` ursa-i9-9960x](https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ursa-i9-9960x/builds/1161)
   [Finished] [`5014dc5f` ursa-thinkcentre-m75q](https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ursa-thinkcentre-m75q/builds/1178)
   Supported benchmarks:
   ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True
   test-mac-arm: Supported benchmark langs: C++, Python, R
   ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
   ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org