You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by "westonpace (via GitHub)" <gi...@apache.org> on 2023/02/11 01:03:16 UTC

[GitHub] [arrow] westonpace opened a new pull request, #34137: GH-34136: [C++] Add a concept of ordering to ExecPlan

westonpace opened a new pull request, #34137:
URL: https://github.com/apache/arrow/pull/34137

   In addition, it is now possible to bypass the I/O executor on the `record_batch_source`, `exec_batch_source`, and `array_vector_source`.
   
   It is now possible to create a source node from a `gen::Gen` generator.
   
   BREAKING CHANGE: The default executor for `record_batch_source`, `exec_batch_source`, and `array_vector_source` was (erroneously) the plan's CPU executor.  It now defaults properly to the I/O executor.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] westonpace merged pull request #34137: GH-34136: [C++] Add a concept of ordering to ExecPlan

Posted by "westonpace (via GitHub)" <gi...@apache.org>.
westonpace merged PR #34137:
URL: https://github.com/apache/arrow/pull/34137


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] westonpace commented on a diff in pull request #34137: GH-34136: [C++] Add a concept of ordering to ExecPlan

Posted by "westonpace (via GitHub)" <gi...@apache.org>.
westonpace commented on code in PR #34137:
URL: https://github.com/apache/arrow/pull/34137#discussion_r1113713497


##########
cpp/src/arrow/compute/exec/exec_plan.h:
##########
@@ -146,6 +148,46 @@ class ARROW_EXPORT ExecNode {
 
   virtual Status Validate() const;
 
+  /// \brief the ordering of the output batches
+  ///
+  /// This does not guarantee the batches will be emitted by this node
+  /// in order.  Instead it guarantees that the batches will have their
+  /// ExecBatch::index property set in a way that respects this ordering.
+  ///
+  /// In other words, given the ordering {{"x", SortOrder::Ascending}} we
+  /// know that all values of x in a batch with index N will be less than
+  /// or equal to all values of x in a batch with index N+k (assuming k > 0)

Review Comment:
   I've updated the wording here to mention ordering within a batch.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] github-actions[bot] commented on pull request #34137: GH-34136: [C++] Add a concept of ordering to ExecPlan

Posted by "github-actions[bot] (via GitHub)" <gi...@apache.org>.
github-actions[bot] commented on PR #34137:
URL: https://github.com/apache/arrow/pull/34137#issuecomment-1426546742

   * Closes: #34136


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] github-actions[bot] commented on pull request #34137: GH-34136: [C++] Add a concept of ordering to ExecPlan

Posted by "github-actions[bot] (via GitHub)" <gi...@apache.org>.
github-actions[bot] commented on PR #34137:
URL: https://github.com/apache/arrow/pull/34137#issuecomment-1426546756

   :warning: GitHub issue #34136 **has been automatically assigned in GitHub** to PR creator.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] westonpace commented on a diff in pull request #34137: GH-34136: [C++] Add a concept of ordering to ExecPlan

Posted by "westonpace (via GitHub)" <gi...@apache.org>.
westonpace commented on code in PR #34137:
URL: https://github.com/apache/arrow/pull/34137#discussion_r1113653542


##########
cpp/src/arrow/compute/exec/exec_plan.h:
##########
@@ -146,6 +148,46 @@ class ARROW_EXPORT ExecNode {
 
   virtual Status Validate() const;
 
+  /// \brief the ordering of the output batches
+  ///
+  /// This does not guarantee the batches will be emitted by this node
+  /// in order.  Instead it guarantees that the batches will have their
+  /// ExecBatch::index property set in a way that respects this ordering.
+  ///
+  /// In other words, given the ordering {{"x", SortOrder::Ascending}} we
+  /// know that all values of x in a batch with index N will be less than
+  /// or equal to all values of x in a batch with index N+k (assuming k > 0)
+  ///
+  /// Note that an ordering can be both Ordering::Unordered and Ordering::Implicit.
+  /// A node's output should be marked Ordering::Unordered if the order is
+  /// non-deterministic.  For example, a hash-join has no predictable output order.
+  ///
+  /// If the ordering is Ordering::Implicit then there is a meaningful order but that
+  /// odering is not represented by any column in the data.  The most common case for this
+  /// is when reading data from an in-memory table.  The data has an implicit "row order"
+  /// which is not neccesarily represented in the data set.
+  ///
+  /// A typical map node will not modify the ordering.  Nothing needs to be done
+  /// other than ensure the index assigned to output batches is the same as the
+  /// input batch that was mapped.
+  ///
+  /// Other nodes may introduce order.  For example, an order-by node will emit
+  /// a brand new ordering independent of the input ordering.
+  ///
+  /// Finally, as described above, such as a hash-join or aggregation may may
+  /// destroy ordering (although these nodes could also choose to establish a
+  /// new ordering based on the hash keys).
+  ///
+  /// Some nodes will require an ordering.  For example, a fetch node or an
+  /// asof join node will only function if the input data is ordered (for fetch
+  /// it is enough to be implicitly ordered.  For an asof join the ordering must
+  /// be explicit and compatible with the on key.)
+  ///
+  /// Nodes that maintain ordering should be careful to avoid introducing gaps
+  /// in the batch index.  This may require emitting empty batches in order to
+  /// maintain continuity.

Review Comment:
   Good question.  The dataset writer will discard empty batches without writing anything.  However, the sink node still respects empty batches.  For example, if one were doing `dataset.to_batches(...)` then they might see an empty batch.
   
   I'm fairly certain this is consistent with the current implementation and not a change in behavior.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] westonpace commented on pull request #34137: GH-34136: [C++] Add a concept of ordering to ExecPlan

Posted by "westonpace (via GitHub)" <gi...@apache.org>.
westonpace commented on PR #34137:
URL: https://github.com/apache/arrow/pull/34137#issuecomment-1426546923

   This PR should be complete but it builds on #34059 which has not yet merged.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] westonpace commented on a diff in pull request #34137: GH-34136: [C++] Add a concept of ordering to ExecPlan

Posted by "westonpace (via GitHub)" <gi...@apache.org>.
westonpace commented on code in PR #34137:
URL: https://github.com/apache/arrow/pull/34137#discussion_r1113652757


##########
cpp/src/arrow/compute/exec/exec_plan.h:
##########
@@ -146,6 +148,46 @@ class ARROW_EXPORT ExecNode {
 
   virtual Status Validate() const;
 
+  /// \brief the ordering of the output batches
+  ///
+  /// This does not guarantee the batches will be emitted by this node
+  /// in order.  Instead it guarantees that the batches will have their
+  /// ExecBatch::index property set in a way that respects this ordering.
+  ///
+  /// In other words, given the ordering {{"x", SortOrder::Ascending}} we
+  /// know that all values of x in a batch with index N will be less than
+  /// or equal to all values of x in a batch with index N+k (assuming k > 0)
+  ///
+  /// Note that an ordering can be both Ordering::Unordered and Ordering::Implicit.
+  /// A node's output should be marked Ordering::Unordered if the order is
+  /// non-deterministic.  For example, a hash-join has no predictable output order.
+  ///
+  /// If the ordering is Ordering::Implicit then there is a meaningful order but that
+  /// odering is not represented by any column in the data.  The most common case for this
+  /// is when reading data from an in-memory table.  The data has an implicit "row order"
+  /// which is not neccesarily represented in the data set.
+  ///
+  /// A typical map node will not modify the ordering.  Nothing needs to be done

Review Comment:
   Ah, yes, this is probably an internal term.  I'll update this.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] westonpace commented on a diff in pull request #34137: GH-34136: [C++] Add a concept of ordering to ExecPlan

Posted by "westonpace (via GitHub)" <gi...@apache.org>.
westonpace commented on code in PR #34137:
URL: https://github.com/apache/arrow/pull/34137#discussion_r1113651753


##########
cpp/src/arrow/compute/exec/exec_plan.h:
##########
@@ -146,6 +148,46 @@ class ARROW_EXPORT ExecNode {
 
   virtual Status Validate() const;
 
+  /// \brief the ordering of the output batches
+  ///
+  /// This does not guarantee the batches will be emitted by this node
+  /// in order.  Instead it guarantees that the batches will have their
+  /// ExecBatch::index property set in a way that respects this ordering.
+  ///
+  /// In other words, given the ordering {{"x", SortOrder::Ascending}} we
+  /// know that all values of x in a batch with index N will be less than
+  /// or equal to all values of x in a batch with index N+k (assuming k > 0)

Review Comment:
   Yes, it should guarantee that the ordering exists within the batch as well.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] jorisvandenbossche commented on pull request #34137: GH-34136: [C++] Add a concept of ordering to ExecPlan

Posted by "jorisvandenbossche (via GitHub)" <gi...@apache.org>.
jorisvandenbossche commented on PR #34137:
URL: https://github.com/apache/arrow/pull/34137#issuecomment-1447853650

   This is ready to be merged?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] lidavidm commented on a diff in pull request #34137: GH-34136: [C++] Add a concept of ordering to ExecPlan

Posted by "lidavidm (via GitHub)" <gi...@apache.org>.
lidavidm commented on code in PR #34137:
URL: https://github.com/apache/arrow/pull/34137#discussion_r1110484479


##########
cpp/src/arrow/compute/ordering.h:
##########
@@ -0,0 +1,120 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+#pragma once
+
+#include <string>
+#include <vector>
+
+#include "arrow/type.h"
+#include "arrow/util/compare.h"
+#include "arrow/util/visibility.h"
+
+namespace arrow {
+namespace compute {
+
+enum class SortOrder {
+  /// Arrange values in increasing order
+  Ascending,
+  /// Arrange values in decreasing order
+  Descending,
+};
+
+enum class NullPlacement {
+  /// Place nulls and NaNs before any non-null values.
+  /// NaNs will come after nulls.
+  AtStart,
+  /// Place nulls and NaNs after any non-null values.
+  /// NaNs will come before nulls.
+  AtEnd,
+};
+
+/// \brief One sort key for PartitionNthIndices (TODO) and SortIndices
+class ARROW_EXPORT SortKey : public util::EqualityComparable<SortKey> {
+ public:
+  explicit SortKey(FieldRef target, SortOrder order = SortOrder::Ascending)
+      : target(std::move(target)), order(order) {}
+
+  bool Equals(const SortKey& other) const;
+  std::string ToString() const;
+
+  /// A FieldRef targetting the sort column.
+  FieldRef target;
+  /// How to order by this sort key.
+  SortOrder order;
+};
+
+class ARROW_EXPORT Ordering : public util::EqualityComparable<Ordering> {
+ public:
+  Ordering(std::vector<SortKey> sort_keys,
+           NullPlacement null_placement = NullPlacement::AtStart)
+      : sort_keys_(std::move(sort_keys)), null_placement_(null_placement) {}
+  /// true if data ordered by other is also ordered by this
+  ///
+  /// For example, if data is ordered by [a, b, c] then it is also ordered
+  /// by [a, b] but not by [b, c] or [a, b, c, d].
+  ///
+  /// [a, b].IsSuborderOf([a, b, c]) - true
+  /// [a, b, c].IsSuborderOf([a, b, c]) - true
+  /// [b, c].IsSuborderOf([a, b, c]) - false
+  /// [a, b, c, d].IsSuborderOf([a, b, c]) - false
+  ///
+  /// The implicit ordering is not a suborder of any other ordering and
+  /// no other ordering is a suborder of it.  The implicit ordering is not a
+  /// suborder of itself.
+  ///
+  /// The unordered ordering is a suborder of all other orderings but no
+  /// other ordering is a suborder of it.  The unordered ordering is a suborder
+  /// of itself.
+  ///
+  /// The unordered ordering is a suborder of the implicit ordering.
+  bool IsSuborderOf(const Ordering& other) const;
+
+  bool Equals(const Ordering& other) const;
+  std::string ToString() const;
+
+  bool is_implicit() const { return is_implicit_; }
+  bool is_unordered() const { return !is_implicit_ && sort_keys_.empty(); }
+
+  std::vector<SortKey> sort_keys() { return sort_keys_; }
+  NullPlacement null_placement() { return null_placement_; }
+
+  static const Ordering& Imiplicit() {

Review Comment:
   typo: Implicit



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] jorisvandenbossche commented on a diff in pull request #34137: GH-34136: [C++] Add a concept of ordering to ExecPlan

Posted by "jorisvandenbossche (via GitHub)" <gi...@apache.org>.
jorisvandenbossche commented on code in PR #34137:
URL: https://github.com/apache/arrow/pull/34137#discussion_r1113577301


##########
cpp/src/arrow/compute/exec/exec_plan.h:
##########
@@ -146,6 +148,46 @@ class ARROW_EXPORT ExecNode {
 
   virtual Status Validate() const;
 
+  /// \brief the ordering of the output batches
+  ///
+  /// This does not guarantee the batches will be emitted by this node
+  /// in order.  Instead it guarantees that the batches will have their
+  /// ExecBatch::index property set in a way that respects this ordering.
+  ///
+  /// In other words, given the ordering {{"x", SortOrder::Ascending}} we
+  /// know that all values of x in a batch with index N will be less than
+  /// or equal to all values of x in a batch with index N+k (assuming k > 0)

Review Comment:
   If there is an order based on a column "x", does that also guarantee something about the order _within_ each batch? (or only between batches as this paragraph explains)



##########
cpp/src/arrow/compute/exec/exec_plan.h:
##########
@@ -146,6 +148,46 @@ class ARROW_EXPORT ExecNode {
 
   virtual Status Validate() const;
 
+  /// \brief the ordering of the output batches
+  ///
+  /// This does not guarantee the batches will be emitted by this node
+  /// in order.  Instead it guarantees that the batches will have their
+  /// ExecBatch::index property set in a way that respects this ordering.
+  ///
+  /// In other words, given the ordering {{"x", SortOrder::Ascending}} we
+  /// know that all values of x in a batch with index N will be less than
+  /// or equal to all values of x in a batch with index N+k (assuming k > 0)
+  ///
+  /// Note that an ordering can be both Ordering::Unordered and Ordering::Implicit.
+  /// A node's output should be marked Ordering::Unordered if the order is
+  /// non-deterministic.  For example, a hash-join has no predictable output order.
+  ///
+  /// If the ordering is Ordering::Implicit then there is a meaningful order but that
+  /// odering is not represented by any column in the data.  The most common case for this
+  /// is when reading data from an in-memory table.  The data has an implicit "row order"
+  /// which is not neccesarily represented in the data set.
+  ///
+  /// A typical map node will not modify the ordering.  Nothing needs to be done

Review Comment:
   What is exactly meant with "map node"? (I don't find this term used anywhere else in the compute / Acero code or docs) Do we mean a node that uses typical element-wise scalar kernels? Also something like a filter node will always preserve ordering.



##########
cpp/src/arrow/compute/exec/exec_plan.h:
##########
@@ -146,6 +148,46 @@ class ARROW_EXPORT ExecNode {
 
   virtual Status Validate() const;
 
+  /// \brief the ordering of the output batches
+  ///
+  /// This does not guarantee the batches will be emitted by this node
+  /// in order.  Instead it guarantees that the batches will have their
+  /// ExecBatch::index property set in a way that respects this ordering.
+  ///
+  /// In other words, given the ordering {{"x", SortOrder::Ascending}} we
+  /// know that all values of x in a batch with index N will be less than
+  /// or equal to all values of x in a batch with index N+k (assuming k > 0)
+  ///
+  /// Note that an ordering can be both Ordering::Unordered and Ordering::Implicit.
+  /// A node's output should be marked Ordering::Unordered if the order is
+  /// non-deterministic.  For example, a hash-join has no predictable output order.
+  ///
+  /// If the ordering is Ordering::Implicit then there is a meaningful order but that
+  /// odering is not represented by any column in the data.  The most common case for this
+  /// is when reading data from an in-memory table.  The data has an implicit "row order"
+  /// which is not neccesarily represented in the data set.
+  ///
+  /// A typical map node will not modify the ordering.  Nothing needs to be done
+  /// other than ensure the index assigned to output batches is the same as the
+  /// input batch that was mapped.
+  ///
+  /// Other nodes may introduce order.  For example, an order-by node will emit
+  /// a brand new ordering independent of the input ordering.
+  ///
+  /// Finally, as described above, such as a hash-join or aggregation may may
+  /// destroy ordering (although these nodes could also choose to establish a
+  /// new ordering based on the hash keys).
+  ///
+  /// Some nodes will require an ordering.  For example, a fetch node or an
+  /// asof join node will only function if the input data is ordered (for fetch
+  /// it is enough to be implicitly ordered.  For an asof join the ordering must
+  /// be explicit and compatible with the on key.)
+  ///
+  /// Nodes that maintain ordering should be careful to avoid introducing gaps
+  /// in the batch index.  This may require emitting empty batches in order to
+  /// maintain continuity.

Review Comment:
   Just wondering, but assume you would do a filter operation that filters a large part of the data, and so you might end up with several empty batches, does that affect for example sink nodes like writing files? (do we skip empty batches there, or do we then potentially write empty files for them?) 



##########
cpp/src/arrow/compute/exec/exec_plan.h:
##########
@@ -146,6 +148,46 @@ class ARROW_EXPORT ExecNode {
 
   virtual Status Validate() const;
 
+  /// \brief the ordering of the output batches
+  ///
+  /// This does not guarantee the batches will be emitted by this node
+  /// in order.  Instead it guarantees that the batches will have their
+  /// ExecBatch::index property set in a way that respects this ordering.
+  ///
+  /// In other words, given the ordering {{"x", SortOrder::Ascending}} we
+  /// know that all values of x in a batch with index N will be less than
+  /// or equal to all values of x in a batch with index N+k (assuming k > 0)
+  ///
+  /// Note that an ordering can be both Ordering::Unordered and Ordering::Implicit.
+  /// A node's output should be marked Ordering::Unordered if the order is
+  /// non-deterministic.  For example, a hash-join has no predictable output order.
+  ///
+  /// If the ordering is Ordering::Implicit then there is a meaningful order but that
+  /// odering is not represented by any column in the data.  The most common case for this
+  /// is when reading data from an in-memory table.  The data has an implicit "row order"
+  /// which is not neccesarily represented in the data set.
+  ///
+  /// A typical map node will not modify the ordering.  Nothing needs to be done

Review Comment:
   OK, I searched for "map node" and not for "MapNode" ;) 
   And I see that indeed both Project and Filter node inherit from MapNode (it's still a term that is not used in our documentation though)
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] westonpace commented on a diff in pull request #34137: GH-34136: [C++] Add a concept of ordering to ExecPlan

Posted by "westonpace (via GitHub)" <gi...@apache.org>.
westonpace commented on code in PR #34137:
URL: https://github.com/apache/arrow/pull/34137#discussion_r1110528592


##########
cpp/src/arrow/compute/ordering.h:
##########
@@ -0,0 +1,120 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+#pragma once
+
+#include <string>
+#include <vector>
+
+#include "arrow/type.h"
+#include "arrow/util/compare.h"
+#include "arrow/util/visibility.h"
+
+namespace arrow {
+namespace compute {
+
+enum class SortOrder {
+  /// Arrange values in increasing order
+  Ascending,
+  /// Arrange values in decreasing order
+  Descending,
+};
+
+enum class NullPlacement {
+  /// Place nulls and NaNs before any non-null values.
+  /// NaNs will come after nulls.
+  AtStart,
+  /// Place nulls and NaNs after any non-null values.
+  /// NaNs will come before nulls.
+  AtEnd,
+};
+
+/// \brief One sort key for PartitionNthIndices (TODO) and SortIndices
+class ARROW_EXPORT SortKey : public util::EqualityComparable<SortKey> {
+ public:
+  explicit SortKey(FieldRef target, SortOrder order = SortOrder::Ascending)
+      : target(std::move(target)), order(order) {}
+
+  bool Equals(const SortKey& other) const;
+  std::string ToString() const;
+
+  /// A FieldRef targetting the sort column.
+  FieldRef target;
+  /// How to order by this sort key.
+  SortOrder order;
+};
+
+class ARROW_EXPORT Ordering : public util::EqualityComparable<Ordering> {
+ public:
+  Ordering(std::vector<SortKey> sort_keys,
+           NullPlacement null_placement = NullPlacement::AtStart)
+      : sort_keys_(std::move(sort_keys)), null_placement_(null_placement) {}
+  /// true if data ordered by other is also ordered by this
+  ///
+  /// For example, if data is ordered by [a, b, c] then it is also ordered
+  /// by [a, b] but not by [b, c] or [a, b, c, d].
+  ///
+  /// [a, b].IsSuborderOf([a, b, c]) - true
+  /// [a, b, c].IsSuborderOf([a, b, c]) - true
+  /// [b, c].IsSuborderOf([a, b, c]) - false
+  /// [a, b, c, d].IsSuborderOf([a, b, c]) - false
+  ///
+  /// The implicit ordering is not a suborder of any other ordering and
+  /// no other ordering is a suborder of it.  The implicit ordering is not a
+  /// suborder of itself.
+  ///
+  /// The unordered ordering is a suborder of all other orderings but no
+  /// other ordering is a suborder of it.  The unordered ordering is a suborder
+  /// of itself.
+  ///
+  /// The unordered ordering is a suborder of the implicit ordering.
+  bool IsSuborderOf(const Ordering& other) const;
+
+  bool Equals(const Ordering& other) const;
+  std::string ToString() const;
+
+  bool is_implicit() const { return is_implicit_; }
+  bool is_unordered() const { return !is_implicit_ && sort_keys_.empty(); }
+
+  std::vector<SortKey> sort_keys() { return sort_keys_; }
+  NullPlacement null_placement() { return null_placement_; }
+
+  static const Ordering& Imiplicit() {

Review Comment:
   :facepalm: 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] westonpace commented on pull request #34137: GH-34136: [C++] Add a concept of ordering to ExecPlan

Posted by "westonpace (via GitHub)" <gi...@apache.org>.
westonpace commented on PR #34137:
URL: https://github.com/apache/arrow/pull/34137#issuecomment-1439198325

   > even if you have a query plan with an order_by node at the end before consuming as eg a table, you won't get the data as ordered by default, because for the ordering to be respected, you need to explicitly enable it with sequence_output?
   
   Correct.  Maybe the default should be true when the input to the sink node is ordered.  This would mean we only default to false if there is an aggregate or join.  Given the cost of this sequencing should generally be pretty reasonable I think it would be an ok default (and users could still disable it if they wanted).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] westonpace commented on a diff in pull request #34137: GH-34136: [C++] Add a concept of ordering to ExecPlan

Posted by "westonpace (via GitHub)" <gi...@apache.org>.
westonpace commented on code in PR #34137:
URL: https://github.com/apache/arrow/pull/34137#discussion_r1113713671


##########
cpp/src/arrow/compute/exec/exec_plan.h:
##########
@@ -146,6 +148,46 @@ class ARROW_EXPORT ExecNode {
 
   virtual Status Validate() const;
 
+  /// \brief the ordering of the output batches
+  ///
+  /// This does not guarantee the batches will be emitted by this node
+  /// in order.  Instead it guarantees that the batches will have their
+  /// ExecBatch::index property set in a way that respects this ordering.
+  ///
+  /// In other words, given the ordering {{"x", SortOrder::Ascending}} we
+  /// know that all values of x in a batch with index N will be less than
+  /// or equal to all values of x in a batch with index N+k (assuming k > 0)
+  ///
+  /// Note that an ordering can be both Ordering::Unordered and Ordering::Implicit.
+  /// A node's output should be marked Ordering::Unordered if the order is
+  /// non-deterministic.  For example, a hash-join has no predictable output order.
+  ///
+  /// If the ordering is Ordering::Implicit then there is a meaningful order but that
+  /// odering is not represented by any column in the data.  The most common case for this
+  /// is when reading data from an in-memory table.  The data has an implicit "row order"
+  /// which is not neccesarily represented in the data set.
+  ///
+  /// A typical map node will not modify the ordering.  Nothing needs to be done

Review Comment:
   I changed this to "A filter or project node..."



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] westonpace commented on pull request #34137: GH-34136: [C++] Add a concept of ordering to ExecPlan

Posted by "westonpace (via GitHub)" <gi...@apache.org>.
westonpace commented on PR #34137:
URL: https://github.com/apache/arrow/pull/34137#issuecomment-1435355922

   @lidavidm Sorry, I requested your review a bit too early.  I started working on the order by node and realized that the ordering I had in place didn't support the ability to specify null placement.  So I changed it to a proper class similar to SortNodeOptions.  The change is pretty minor but I'll leave up unless you want to take another look.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] jorisvandenbossche commented on pull request #34137: GH-34136: [C++] Add a concept of ordering to ExecPlan

Posted by "jorisvandenbossche (via GitHub)" <gi...@apache.org>.
jorisvandenbossche commented on PR #34137:
URL: https://github.com/apache/arrow/pull/34137#issuecomment-1439130374

   > One more question: assume you would use something like `DeclarationToTable`, does that automatically use the ordering / batch indices if there is one, or do you still need to indicate this manually you want to use it? (like the generic `SinkNodeOptions` has a `sequence_delivery` parameter with a default of false, but I don't see that exposed through the DeclarationTo.. versions)
   
   Ah, I see there is a new variant of `DeclarationToTable` that uses QueryOptions as parameter, and that options struct has a `sequence_output` keyword. So I assume that answers my question.
   
   Follow-up question on this: the default for `sequence_output` is false, so does that mean that even if you have a query plan with an order_by node at the end before consuming as eg a table, you won't get the data as ordered by default, because for the ordering to be respected, you need to explicitly enable it with `sequence_output`?
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] westonpace commented on pull request #34137: GH-34136: [C++] Add a concept of ordering to ExecPlan

Posted by "westonpace (via GitHub)" <gi...@apache.org>.
westonpace commented on PR #34137:
URL: https://github.com/apache/arrow/pull/34137#issuecomment-1439282326

   > Correct. Maybe I should change to an optional and the default (nullopt) would sequence when the input to the sink node is ordered. This would mean we only default to false if there is an aggregate or join. Given the cost of this sequencing should generally be pretty reasonable I think it would be an ok default (and users could still disable it if they wanted).
   
   I've done this.  The default (nullopt) means "sequence if there is any ordering".  It can be set to true to get "fail validation if there is no meaningful ordering" or false to get "never sequence and maximize performance even if there is a meaningful ordering".


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] jorisvandenbossche commented on a diff in pull request #34137: GH-34136: [C++] Add a concept of ordering to ExecPlan

Posted by "jorisvandenbossche (via GitHub)" <gi...@apache.org>.
jorisvandenbossche commented on code in PR #34137:
URL: https://github.com/apache/arrow/pull/34137#discussion_r1113934248


##########
cpp/src/arrow/compute/exec/exec_plan.h:
##########
@@ -156,7 +156,10 @@ class ARROW_EXPORT ExecNode {
   ///
   /// In other words, given the ordering {{"x", SortOrder::Ascending}} we
   /// know that all values of x in a batch with index N will be less than
-  /// or equal to all values of x in a batch with index N+k (assuming k > 0)
+  /// or equal to all values of x in a batch with index N+k (assuming k > 0).
+  /// Furthermore, we also know that values will be sorted within a batch.
+  /// Any row N will have a value of x that is less than the value for

Review Comment:
   ```suggestion
     /// Any row N will have a value of x that is less than or equal to the value for
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] westonpace commented on pull request #34137: GH-34136: [C++] Add a concept of ordering to ExecPlan

Posted by "westonpace (via GitHub)" <gi...@apache.org>.
westonpace commented on PR #34137:
URL: https://github.com/apache/arrow/pull/34137#issuecomment-1448325631

   > This is ready to be merged?
   
   Yes, thanks.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] ursabot commented on pull request #34137: GH-34136: [C++] Add a concept of ordering to ExecPlan

Posted by "ursabot (via GitHub)" <gi...@apache.org>.
ursabot commented on PR #34137:
URL: https://github.com/apache/arrow/pull/34137#issuecomment-1449978961

   Benchmark runs are scheduled for baseline = ef21008d4e253188092f3c54ad4f7e83aac8d79e and contender = 762329b73bac05edfaf0a1e229ef21d801b57048. 762329b73bac05edfaf0a1e229ef21d801b57048 is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
   Conbench compare runs links:
   [Finished :arrow_down:0.0% :arrow_up:0.0%] [ec2-t3-xlarge-us-east-2](https://conbench.ursa.dev/compare/runs/6ef25ffbbe674f37a2801e215dcbf0cf...05dbc3fb25094099b230ffc9c21052e3/)
   [Failed :arrow_down:0.52% :arrow_up:0.03%] [test-mac-arm](https://conbench.ursa.dev/compare/runs/4b2892c9affb4abb9262725844f0e528...b046ccc6f58d4a508013a6b42039611a/)
   [Finished :arrow_down:0.0% :arrow_up:0.0%] [ursa-i9-9960x](https://conbench.ursa.dev/compare/runs/c94752ab23aa47a79383fe4578c4e581...d1ab2ecccaf04ec28b2715d877c44fbb/)
   [Finished :arrow_down:0.22% :arrow_up:0.16%] [ursa-thinkcentre-m75q](https://conbench.ursa.dev/compare/runs/49919f11f2924e33a03c4d0d682f7b88...d937ba0a9c154b1aa4206c71033c1fe9/)
   Buildkite builds:
   [Finished] [`762329b7` ec2-t3-xlarge-us-east-2](https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ec2-t3-xlarge-us-east-2/builds/2451)
   [Failed] [`762329b7` test-mac-arm](https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-test-mac-arm/builds/2481)
   [Finished] [`762329b7` ursa-i9-9960x](https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ursa-i9-9960x/builds/2449)
   [Finished] [`762329b7` ursa-thinkcentre-m75q](https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ursa-thinkcentre-m75q/builds/2472)
   [Finished] [`ef21008d` ec2-t3-xlarge-us-east-2](https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ec2-t3-xlarge-us-east-2/builds/2450)
   [Finished] [`ef21008d` test-mac-arm](https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-test-mac-arm/builds/2480)
   [Finished] [`ef21008d` ursa-i9-9960x](https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ursa-i9-9960x/builds/2448)
   [Finished] [`ef21008d` ursa-thinkcentre-m75q](https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ursa-thinkcentre-m75q/builds/2471)
   Supported benchmarks:
   ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True
   test-mac-arm: Supported benchmark langs: C++, Python, R
   ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
   ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org