You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2022/07/11 11:01:24 UTC

[GitHub] [arrow] vibhatha opened a new pull request, #13567: ARROW-16911: [C++] Add Equals method to Partitioning

vibhatha opened a new pull request, #13567:
URL: https://github.com/apache/arrow/pull/13567

   Adding `Equals` method to `Partitioning` class and extended classes. Also include a few test cases. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] vibhatha commented on a diff in pull request #13567: ARROW-16911: [C++] Add Equals method to Partitioning

Posted by GitBox <gi...@apache.org>.
vibhatha commented on code in PR #13567:
URL: https://github.com/apache/arrow/pull/13567#discussion_r918147690


##########
cpp/src/arrow/dataset/partition.cc:
##########
@@ -115,6 +117,21 @@ static Result<RecordBatchVector> ApplyGroupings(
   return out;
 }
 
+bool KeyValuePartitioning::Equals(const Partitioning& other) const {
+  if (this == &other) {
+    return true;
+  }
+  const auto& kv_partion = checked_cast<const KeyValuePartitioning&>(other);
+  int64_t idx = 0;
+  for (const auto& array : dictionaries_) {
+    if (!array->Equals(kv_partion.dictionaries()[idx++])) {

Review Comment:
   @pitrou thanks for noting this one. I am not very much familiary with this piece of the code please advice.
   
   But I guess in general, whatever the system thinks, I should first check the size and if size differs this should return false right? I totally missed this part. Partially being there cannot be taken as a fulfilment of equality right?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] pitrou commented on a diff in pull request #13567: ARROW-16911: [C++] Add Equals method to Partitioning

Posted by GitBox <gi...@apache.org>.
pitrou commented on code in PR #13567:
URL: https://github.com/apache/arrow/pull/13567#discussion_r918111292


##########
cpp/src/arrow/dataset/partition.cc:
##########
@@ -115,6 +117,21 @@ static Result<RecordBatchVector> ApplyGroupings(
   return out;
 }
 
+bool KeyValuePartitioning::Equals(const Partitioning& other) const {
+  if (this == &other) {
+    return true;
+  }
+  const auto& kv_partion = checked_cast<const KeyValuePartitioning&>(other);
+  int64_t idx = 0;
+  for (const auto& array : dictionaries_) {
+    if (!array->Equals(kv_partion.dictionaries()[idx++])) {

Review Comment:
   Hmm, what if `kv_partion.dictionaries()` does not have the same size as `this->dictionaries_`. Is it possible?



##########
cpp/src/arrow/dataset/partition.h:
##########
@@ -63,13 +64,18 @@ struct ARROW_DS_EXPORT PartitionPathFormat {
 /// Paths are consumed from left to right. Paths must be relative to
 /// the root of a partition; path prefixes must be removed before passing
 /// the path to a partitioning for parsing.
-class ARROW_DS_EXPORT Partitioning {
+class ARROW_DS_EXPORT Partitioning : public util::EqualityComparable<Partitioning> {
  public:
   virtual ~Partitioning() = default;
 
   /// \brief The name identifying the kind of partitioning
   virtual std::string type_name() const = 0;
 
+  /// \brief determines if partiioning is exactly equal

Review Comment:
   ```suggestion
     /// \brief Return whether the partitionings are equal
   ```



##########
cpp/src/arrow/dataset/partition_test.cc:
##########
@@ -206,6 +206,15 @@ TEST_F(TestPartitioning, DirectoryPartitioning) {
                                           equal(field_ref("beta"), literal("foo"))));
 }
 
+TEST_F(TestPartitioning, DirectoryPartitioningEquals) {
+  auto part = std::make_shared<DirectoryPartitioning>(
+      schema({field("alpha", int32()), field("beta", utf8())}));
+  auto other = std::make_shared<DirectoryPartitioning>(
+      schema({field("alpha", int32()), field("beta", utf8())}));
+  EXPECT_TRUE(part->Equals(*part));
+  EXPECT_FALSE(part->Equals(*other));

Review Comment:
   (same remark for the below tests)



##########
cpp/src/arrow/dataset/partition_test.cc:
##########
@@ -206,6 +206,15 @@ TEST_F(TestPartitioning, DirectoryPartitioning) {
                                           equal(field_ref("beta"), literal("foo"))));
 }
 
+TEST_F(TestPartitioning, DirectoryPartitioningEquals) {
+  auto part = std::make_shared<DirectoryPartitioning>(
+      schema({field("alpha", int32()), field("beta", utf8())}));
+  auto other = std::make_shared<DirectoryPartitioning>(
+      schema({field("alpha", int32()), field("beta", utf8())}));
+  EXPECT_TRUE(part->Equals(*part));
+  EXPECT_FALSE(part->Equals(*other));

Review Comment:
   Given that `Equals` generally starts by comparing pointer equality, these tests are not sufficient. You should also test a third `DirectoryPartitioning` that has exactly the same attributes as `part`.



##########
cpp/src/arrow/dataset/partition.cc:
##########
@@ -82,6 +82,8 @@ std::shared_ptr<Partitioning> Partitioning::Default() {
 
     std::string type_name() const override { return "default"; }
 
+    bool Equals(const Partitioning& other) const override { return false; }

Review Comment:
   Are you planning to actually implement this?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] lidavidm commented on a diff in pull request #13567: ARROW-16911: [C++] Add Equals method to Partitioning

Posted by GitBox <gi...@apache.org>.
lidavidm commented on code in PR #13567:
URL: https://github.com/apache/arrow/pull/13567#discussion_r918888503


##########
cpp/src/arrow/dataset/partition.cc:
##########
@@ -115,6 +119,28 @@ static Result<RecordBatchVector> ApplyGroupings(
   return out;
 }
 
+bool KeyValuePartitioning::Equals(const Partitioning& other) const {
+  if (this == &other) {
+    return true;
+  }
+  const auto& kv_partitioning = checked_cast<const KeyValuePartitioning&>(other);
+  const auto& other_dictionaries = kv_partitioning.dictionaries();
+  if (dictionaries_.size() != other_dictionaries.size()) {
+    return false;
+  }
+  int64_t idx = 0;
+  for (const auto& array : dictionaries_) {
+    const auto& other_array = other_dictionaries[idx++];
+    bool match = (array == NULLPTR && other_array == NULLPTR) ||

Review Comment:
   just a reminder: NULLPTR is only required in headers, regular code can use `nullptr` like usual.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] vibhatha commented on a diff in pull request #13567: ARROW-16911: [C++] Add Equals method to Partitioning

Posted by GitBox <gi...@apache.org>.
vibhatha commented on code in PR #13567:
URL: https://github.com/apache/arrow/pull/13567#discussion_r918148748


##########
cpp/src/arrow/dataset/partition.cc:
##########
@@ -82,6 +82,8 @@ std::shared_ptr<Partitioning> Partitioning::Default() {
 
     std::string type_name() const override { return "default"; }
 
+    bool Equals(const Partitioning& other) const override { return false; }

Review Comment:
   I am not sure if I misunderstood the review comment when changing the earlier code into this. I thought this must be always `false`. cc @westonpace 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] pitrou commented on a diff in pull request #13567: ARROW-16911: [C++] Add Equals method to Partitioning

Posted by GitBox <gi...@apache.org>.
pitrou commented on code in PR #13567:
URL: https://github.com/apache/arrow/pull/13567#discussion_r918172392


##########
cpp/src/arrow/dataset/partition.cc:
##########
@@ -115,6 +117,21 @@ static Result<RecordBatchVector> ApplyGroupings(
   return out;
 }
 
+bool KeyValuePartitioning::Equals(const Partitioning& other) const {
+  if (this == &other) {
+    return true;
+  }
+  const auto& kv_partion = checked_cast<const KeyValuePartitioning&>(other);
+  int64_t idx = 0;
+  for (const auto& array : dictionaries_) {
+    if (!array->Equals(kv_partion.dictionaries()[idx++])) {

Review Comment:
   Yes, you should check the size.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] pitrou commented on pull request #13567: ARROW-16911: [C++] Add Equals method to Partitioning

Posted by GitBox <gi...@apache.org>.
pitrou commented on PR #13567:
URL: https://github.com/apache/arrow/pull/13567#issuecomment-1187220817

   @vibhatha This CI failure looks unrelated, can you take a look? https://github.com/apache/arrow/runs/7351561711?check_suite_focus=true#step:6:2042


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] ursabot commented on pull request #13567: ARROW-16911: [C++] Add Equals method to Partitioning

Posted by GitBox <gi...@apache.org>.
ursabot commented on PR #13567:
URL: https://github.com/apache/arrow/pull/13567#issuecomment-1189832075

   Benchmark runs are scheduled for baseline = 0fda96c460ef88243de6963e1380cff8aa951e02 and contender = d07dc75e27f016ca05c4ca22bc2d19c79fa2cd4a. d07dc75e27f016ca05c4ca22bc2d19c79fa2cd4a is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
   Conbench compare runs links:
   [Failed :arrow_down:0.0% :arrow_up:0.0%] [ec2-t3-xlarge-us-east-2](https://conbench.ursa.dev/compare/runs/97c99bb4f7914c2e93cae919e152df53...4267a9a9b92941728811f0a10ae91237/)
   [Failed :arrow_down:0.97% :arrow_up:0.0%] [test-mac-arm](https://conbench.ursa.dev/compare/runs/2f4ed674a29a4421a746fe75c651296a...cb796ee1152f4bb7b1e4f1ff5f6ff2ad/)
   [Failed :arrow_down:0.32% :arrow_up:0.0%] [ursa-i9-9960x](https://conbench.ursa.dev/compare/runs/cd8333a2b1284305ad82edf3085f0fdc...9bdf56f27a844f1bbbc0101d9b3989bd/)
   [Finished :arrow_down:0.04% :arrow_up:0.04%] [ursa-thinkcentre-m75q](https://conbench.ursa.dev/compare/runs/0fee5dfda495469e974416a916c5937f...3603e96f061c4716a8be197ef112049e/)
   Buildkite builds:
   [Failed] [`d07dc75e` ec2-t3-xlarge-us-east-2](https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ec2-t3-xlarge-us-east-2/builds/1141)
   [Failed] [`d07dc75e` test-mac-arm](https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-test-mac-arm/builds/1146)
   [Failed] [`d07dc75e` ursa-i9-9960x](https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ursa-i9-9960x/builds/1133)
   [Finished] [`d07dc75e` ursa-thinkcentre-m75q](https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ursa-thinkcentre-m75q/builds/1150)
   [Failed] [`0fda96c4` ec2-t3-xlarge-us-east-2](https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ec2-t3-xlarge-us-east-2/builds/1140)
   [Failed] [`0fda96c4` test-mac-arm](https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-test-mac-arm/builds/1145)
   [Failed] [`0fda96c4` ursa-i9-9960x](https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ursa-i9-9960x/builds/1132)
   [Finished] [`0fda96c4` ursa-thinkcentre-m75q](https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ursa-thinkcentre-m75q/builds/1149)
   Supported benchmarks:
   ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True
   test-mac-arm: Supported benchmark langs: C++, Python, R
   ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
   ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] lidavidm commented on pull request #13567: ARROW-16911: [C++] Add Equals method to Partitioning

Posted by GitBox <gi...@apache.org>.
lidavidm commented on PR #13567:
URL: https://github.com/apache/arrow/pull/13567#issuecomment-1187603307

   Antoine is fixing the Protobuf issue in #13634


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] westonpace commented on a diff in pull request #13567: ARROW-16911: [C++] Add Equals method to Partitioning

Posted by GitBox <gi...@apache.org>.
westonpace commented on code in PR #13567:
URL: https://github.com/apache/arrow/pull/13567#discussion_r918002885


##########
cpp/src/arrow/dataset/partition.cc:
##########
@@ -82,6 +82,14 @@ std::shared_ptr<Partitioning> Partitioning::Default() {
 
     std::string type_name() const override { return "default"; }
 
+    bool Equals(const Partitioning& other, bool check_metadata) const override {
+      if (this == &other) {
+        return true;
+      }
+      return checked_cast<const DefaultPartitioning&>(other).type_name() == type_name() &&

Review Comment:
   I agree with @lidavidm that a checked_cast should be avoided.  If you received a pointer type `Partitioning*` you could do a `dynamic_cast` instead of the `type_name` comparison.  However, a failed `dynamic_cast` is an exception when you have a reference type and it's probably better to just compare `type_name`.
   
   The comparison to `type_name` is basically doing the same thing as a cast check anyways.
   
   The reason a `checked_cast` is a bad idea is that it should be ok for a user to do:
   
   ```
   DefaultPartitioning one;
   KeyValuePartitioning two(schema);
   if (one.Equals(two)) {
     std::cout << "equal" << std::endl;
   }
   ```
   
   This `Equals` method should always return false so this toy example is somewhat meaningless.  However, it could be useful if the actual partitions were based on file metadata or something dynamically calculated.  If there is a `checked_cast` then, instead of false, we will get an abort in debug mode and potentially a segmentation fault in release mode.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] pitrou commented on a diff in pull request #13567: ARROW-16911: [C++] Add Equals method to Partitioning

Posted by GitBox <gi...@apache.org>.
pitrou commented on code in PR #13567:
URL: https://github.com/apache/arrow/pull/13567#discussion_r924253359


##########
cpp/src/arrow/dataset/partition_test.cc:
##########
@@ -426,6 +456,41 @@ TEST_F(TestPartitioning, HivePartitioning) {
   AssertParseError("/alpha=0.0/beta=3.25/");  // conversion of "0.0" to int32 fails
 }
 
+TEST_F(TestPartitioning, HivePartitioningEquals) {
+  const auto& array_vector = ArrayVector();
+  std::vector<std::shared_ptr<arrow::Array>> other_vector(2);
+  other_vector.push_back(ArrayFromJSON(utf8(), R"(["foo", "bar", "baz"])"));
+  other_vector.push_back(ArrayFromJSON(utf8(), R"(["bar", "foo", "baz"])"));

Review Comment:
   Er, this is creating a vector of size 4 with the first two values zero-initialized, AFAIU.
   ```suggestion
     ArrayVector other_vector(2);
     other_vector[0] = ArrayFromJSON(utf8(), R"(["foo", "bar", "baz"])");
     other_vector[1] = ArrayFromJSON(utf8(), R"(["bar", "foo", "baz"])");
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] vibhatha commented on a diff in pull request #13567: ARROW-16911: [C++] Add Equals method to Partitioning

Posted by GitBox <gi...@apache.org>.
vibhatha commented on code in PR #13567:
URL: https://github.com/apache/arrow/pull/13567#discussion_r918666275


##########
cpp/src/arrow/dataset/partition_test.cc:
##########
@@ -959,6 +1025,16 @@ TEST_F(TestPartitioning, Range) {
                          less_equal(field_ref("z"), literal(3.0)))}));
 }
 
+TEST_F(TestPartitioning, RangePartitoningEquals) {

Review Comment:
   yeah that's true. I just added it. I will remove this one. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] vibhatha commented on pull request #13567: ARROW-16911: [C++] Add Equals method to Partitioning

Posted by GitBox <gi...@apache.org>.
vibhatha commented on PR #13567:
URL: https://github.com/apache/arrow/pull/13567#issuecomment-1185133720

   cc @lidavidm @pitrou addressed the reviews


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] vibhatha commented on pull request #13567: ARROW-16911: [C++] Add Equals method to Partitioning

Posted by GitBox <gi...@apache.org>.
vibhatha commented on PR #13567:
URL: https://github.com/apache/arrow/pull/13567#issuecomment-1189892404

   @pitrou Is this something to worry about? I cannot trace the failure? 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] lidavidm commented on a diff in pull request #13567: ARROW-16911: [C++] Add Equals method to Partitioning

Posted by GitBox <gi...@apache.org>.
lidavidm commented on code in PR #13567:
URL: https://github.com/apache/arrow/pull/13567#discussion_r918125799


##########
cpp/src/arrow/dataset/partition.cc:
##########
@@ -115,6 +117,21 @@ static Result<RecordBatchVector> ApplyGroupings(
   return out;
 }
 
+bool KeyValuePartitioning::Equals(const Partitioning& other) const {
+  if (this == &other) {
+    return true;
+  }
+  const auto& kv_partion = checked_cast<const KeyValuePartitioning&>(other);
+  int64_t idx = 0;
+  for (const auto& array : dictionaries_) {
+    if (!array->Equals(kv_partion.dictionaries()[idx++])) {

Review Comment:
   Also a typo here: `kv_partitioning`?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] lidavidm commented on a diff in pull request #13567: ARROW-16911: [C++] Add Equals method to Partitioning

Posted by GitBox <gi...@apache.org>.
lidavidm commented on code in PR #13567:
URL: https://github.com/apache/arrow/pull/13567#discussion_r923472704


##########
cpp/src/arrow/dataset/partition_test.cc:
##########
@@ -458,8 +458,10 @@ TEST_F(TestPartitioning, HivePartitioning) {
 
 TEST_F(TestPartitioning, HivePartitioningEquals) {
   const auto& array_vector = ArrayVector();
-  const auto& other_vector = {ArrayFromJSON(utf8(), R"(["foo", "bar", "baz"])"),
-                              ArrayFromJSON(utf8(), R"(["bar", "foo", "baz"])")};
+  std::vector<std::shared_ptr<arrow::Array>> other_vector;
+  other_vector.reserve(2);

Review Comment:
   nit, but for consistency with the surrounding code, this could just be `ArrayVector other_vector(2)`



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] vibhatha commented on pull request #13567: ARROW-16911: [C++] Add Equals method to Partitioning

Posted by GitBox <gi...@apache.org>.
vibhatha commented on PR #13567:
URL: https://github.com/apache/arrow/pull/13567#issuecomment-1187601918

   @pitrou I think now it is fixed, but some CIs are failing because of a protobuf issue. 
   
   cc @lidavidm It is visible here too. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] vibhatha commented on pull request #13567: ARROW-16911: [C++] Add Equals method to Partitioning

Posted by GitBox <gi...@apache.org>.
vibhatha commented on PR #13567:
URL: https://github.com/apache/arrow/pull/13567#issuecomment-1187229933

   I will look into this.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] github-actions[bot] commented on pull request #13567: ARROW-16911: [C++] Add Equals method to Partitioning

Posted by GitBox <gi...@apache.org>.
github-actions[bot] commented on PR #13567:
URL: https://github.com/apache/arrow/pull/13567#issuecomment-1180255978

   https://issues.apache.org/jira/browse/ARROW-16911


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] lidavidm commented on a diff in pull request #13567: ARROW-16911: [C++] Add Equals method to Partitioning

Posted by GitBox <gi...@apache.org>.
lidavidm commented on code in PR #13567:
URL: https://github.com/apache/arrow/pull/13567#discussion_r917904533


##########
cpp/src/arrow/dataset/partition.h:
##########
@@ -310,6 +322,11 @@ class ARROW_DS_EXPORT FunctionPartitioning : public Partitioning {
 
   std::string type_name() const override { return name_; }
 
+  bool Equals(const Partitioning& other, bool check_metadata) const override {
+    return ::arrow::internal::checked_cast<const FunctionPartitioning&>(other)
+               .type_name() == type_name();

Review Comment:
   TBH, I'm not sure it makes sense to implement Equals for FunctionPartitioning. Or else at best we should check `this == &other`. Also why is this inline in the header unlike the others?



##########
cpp/src/arrow/dataset/partition.cc:
##########
@@ -381,6 +404,15 @@ Result<std::vector<KeyValuePartitioning::Key>> DirectoryPartitioning::ParseKeys(
   return ParsePartitionSegments(segments);
 }
 
+bool DirectoryPartitioning::Equals(const Partitioning& other, bool check_metadata) const {
+  if (this == &other) {
+    return true;
+  }
+  const auto& direct_part =
+      ::arrow::internal::checked_cast<const DirectoryPartitioning&>(other);
+  return type_name() == direct_part.type_name() && KeyValuePartitioning::Equals(other);

Review Comment:
   ```suggestion
     return type_name() == direct_part.type_name() && KeyValuePartitioning::Equals(other);
   ```



##########
cpp/src/arrow/dataset/partition.cc:
##########
@@ -82,6 +82,14 @@ std::shared_ptr<Partitioning> Partitioning::Default() {
 
     std::string type_name() const override { return "default"; }
 
+    bool Equals(const Partitioning& other, bool check_metadata) const override {
+      if (this == &other) {
+        return true;
+      }
+      return checked_cast<const DefaultPartitioning&>(other).type_name() == type_name() &&

Review Comment:
   Don't do the checked cast here, it'll be wrong if the other partitioning isn't the same type. You don't use any methods from the derived type anyways.



##########
cpp/src/arrow/dataset/partition.cc:
##########
@@ -381,6 +404,15 @@ Result<std::vector<KeyValuePartitioning::Key>> DirectoryPartitioning::ParseKeys(
   return ParsePartitionSegments(segments);
 }
 
+bool DirectoryPartitioning::Equals(const Partitioning& other, bool check_metadata) const {
+  if (this == &other) {
+    return true;
+  }
+  const auto& direct_part =
+      ::arrow::internal::checked_cast<const DirectoryPartitioning&>(other);
+  return type_name() == direct_part.type_name() && KeyValuePartitioning::Equals(other);

Review Comment:
   and ditto below (since KeyValuePartitioning::Equals does the `this == &other` check already



##########
cpp/src/arrow/dataset/partition_test.cc:
##########
@@ -206,6 +206,15 @@ TEST_F(TestPartitioning, DirectoryPartitioning) {
                                           equal(field_ref("beta"), literal("foo"))));
 }
 

Review Comment:
   I'd like to see tests that partitionings of different types don't compare equal.



##########
cpp/src/arrow/dataset/partition.cc:
##########
@@ -754,6 +794,17 @@ Result<PartitionPathFormat> HivePartitioning::FormatValues(
   return PartitionPathFormat{fs::internal::JoinAbstractPath(std::move(segments)), ""};
 }
 
+bool HivePartitioning::Equals(const Partitioning& other, bool check_metadata) const {
+  if (this == &other) {
+    return true;
+  }
+  const auto& hive_part = ::arrow::internal::checked_cast<const HivePartitioning&>(other);
+  return type_name() == hive_part.type_name() &&

Review Comment:
   Do the type name check first, then cast and do the rest of the comparisons.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] westonpace commented on a diff in pull request #13567: ARROW-16911: [C++] Add Equals method to Partitioning

Posted by GitBox <gi...@apache.org>.
westonpace commented on code in PR #13567:
URL: https://github.com/apache/arrow/pull/13567#discussion_r917974623


##########
cpp/src/arrow/dataset/partition.h:
##########
@@ -63,13 +64,18 @@ struct ARROW_DS_EXPORT PartitionPathFormat {
 /// Paths are consumed from left to right. Paths must be relative to
 /// the root of a partition; path prefixes must be removed before passing
 /// the path to a partitioning for parsing.
-class ARROW_DS_EXPORT Partitioning {
+class ARROW_DS_EXPORT Partitioning : public util::EqualityComparable<Partitioning> {
  public:
   virtual ~Partitioning() = default;
 
   /// \brief The name identifying the kind of partitioning
   virtual std::string type_name() const = 0;
 
+  /// \brief determines if partiioning is exactly equal
+  virtual bool Equals(const Partitioning& other, bool check_metadata = false) const {

Review Comment:
   We shouldn't need a `check_metadata` flag here.  This method here should pass `check_metadata=false` in always.
   
   If, by some chance, a future partitioning decides to depend on the schema metadata (which should be possible) then it should override this method and pass `check_metadata=true`.



##########
cpp/src/arrow/dataset/partition.cc:
##########
@@ -82,6 +82,14 @@ std::shared_ptr<Partitioning> Partitioning::Default() {
 
     std::string type_name() const override { return "default"; }
 
+    bool Equals(const Partitioning& other, bool check_metadata) const override {
+      if (this == &other) {
+        return true;
+      }
+      return checked_cast<const DefaultPartitioning&>(other).type_name() == type_name() &&

Review Comment:
   I agree.  If you received a pointer type `Partitioning*` you could do a `dynamic_cast` instead of the `type_name` comparison.  However, a failed `dynamic_cast` is an exception when you have a reference type and it's probably better to just compare `type_name`.
   
   The comparison to `type_name` is basically doing the same thing as a cast check anyways.
   
   The reason a `checked_cast` is a bad idea is that it should be ok for a user to do:
   
   ```
   DefaultPartitioning one;
   KeyValuePartitioning two(schema);
   if (one.Equals(two)) {
     std::cout << "equal" << std::endl;
   }
   ```
   
   This `Equals` method should always return false so this toy example is somewhat meaningless.  However, it could be useful if the actual partitions were based on file metadata or something dynamically calculated.  If there is a `checked_cast` then, instead of false, we will get an abort in debug mode and potentially a segmentation fault in release mode.



##########
cpp/src/arrow/dataset/partition.h:
##########
@@ -310,6 +322,11 @@ class ARROW_DS_EXPORT FunctionPartitioning : public Partitioning {
 
   std::string type_name() const override { return name_; }
 
+  bool Equals(const Partitioning& other, bool check_metadata) const override {
+    return ::arrow::internal::checked_cast<const FunctionPartitioning&>(other)
+               .type_name() == type_name();

Review Comment:
   I was going to suggest we at least compare `parse_impl_` and `format_impl_` but it appears [that is somewhat meaningless](https://en.cppreference.com/w/cpp/utility/functional/function/operator_cmp).
   
   I think we should always return false here and, if a user wants to allow comparisons, they can override this.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] github-actions[bot] commented on pull request #13567: ARROW-16911: [C++] Add Equals method to Partitioning

Posted by GitBox <gi...@apache.org>.
github-actions[bot] commented on PR #13567:
URL: https://github.com/apache/arrow/pull/13567#issuecomment-1180256021

   :warning: Ticket **has no components in JIRA**, make sure you assign one.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] westonpace commented on a diff in pull request #13567: ARROW-16911: [C++] Add Equals method to Partitioning

Posted by GitBox <gi...@apache.org>.
westonpace commented on code in PR #13567:
URL: https://github.com/apache/arrow/pull/13567#discussion_r918162355


##########
cpp/src/arrow/dataset/partition.cc:
##########
@@ -82,6 +82,8 @@ std::shared_ptr<Partitioning> Partitioning::Default() {
 
     std::string type_name() const override { return "default"; }
 
+    bool Equals(const Partitioning& other) const override { return false; }

Review Comment:
   I was thinking to always return false for `FunctionPartitioning`.  The default partitioning can simply return `type_name() == other.type_name()`



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] vibhatha commented on a diff in pull request #13567: ARROW-16911: [C++] Add Equals method to Partitioning

Posted by GitBox <gi...@apache.org>.
vibhatha commented on code in PR #13567:
URL: https://github.com/apache/arrow/pull/13567#discussion_r918041699


##########
cpp/src/arrow/dataset/partition.cc:
##########
@@ -82,6 +82,14 @@ std::shared_ptr<Partitioning> Partitioning::Default() {
 
     std::string type_name() const override { return "default"; }
 
+    bool Equals(const Partitioning& other, bool check_metadata) const override {
+      if (this == &other) {
+        return true;
+      }
+      return checked_cast<const DefaultPartitioning&>(other).type_name() == type_name() &&

Review Comment:
   Yes your absolutely right, sorry I missed this. Thanks for pointing this out. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] vibhatha commented on a diff in pull request #13567: ARROW-16911: [C++] Add Equals method to Partitioning

Posted by GitBox <gi...@apache.org>.
vibhatha commented on code in PR #13567:
URL: https://github.com/apache/arrow/pull/13567#discussion_r918503221


##########
cpp/src/arrow/dataset/partition.cc:
##########
@@ -82,6 +82,8 @@ std::shared_ptr<Partitioning> Partitioning::Default() {
 
     std::string type_name() const override { return "default"; }
 
+    bool Equals(const Partitioning& other) const override { return false; }

Review Comment:
   Oh!. Fixing it. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] pitrou commented on pull request #13567: ARROW-16911: [C++] Add Equals method to Partitioning

Posted by GitBox <gi...@apache.org>.
pitrou commented on PR #13567:
URL: https://github.com/apache/arrow/pull/13567#issuecomment-1187524939

   Yes, I meant related, sorry :-)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] pitrou merged pull request #13567: ARROW-16911: [C++] Add Equals method to Partitioning

Posted by GitBox <gi...@apache.org>.
pitrou merged PR #13567:
URL: https://github.com/apache/arrow/pull/13567


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] vibhatha commented on pull request #13567: ARROW-16911: [C++] Add Equals method to Partitioning

Posted by GitBox <gi...@apache.org>.
vibhatha commented on PR #13567:
URL: https://github.com/apache/arrow/pull/13567#issuecomment-1187228782

   > @vibhatha This CI failure looks unrelated, can you take a look? https://github.com/apache/arrow/runs/7351561711?check_suite_focus=true#step:6:2042
   
   @pitrou seems like related 
   
   ```bash
   /arrow/cpp/src/arrow/dataset/partition_test.cc
   [2024](https://github.com/apache/arrow/runs/7351561711?check_suite_focus=true#step:6:2025)
   In file included from /usr/include/x86_64-linux-gnu/c++/6/bits/c++allocator.h:33:0,
   [2025](https://github.com/apache/arrow/runs/7351561711?check_suite_focus=true#step:6:2026)
                    from /usr/include/c++/6/bits/allocator.h:46,
   [2026](https://github.com/apache/arrow/runs/7351561711?check_suite_focus=true#step:6:2027)
                    from /usr/include/c++/6/string:41,
   [2027](https://github.com/apache/arrow/runs/7351561711?check_suite_focus=true#step:6:2028)
                    from /usr/include/c++/6/stdexcept:39,
   [2028](https://github.com/apache/arrow/runs/7351561711?check_suite_focus=true#step:6:2029)
                    from /usr/include/c++/6/array:39,
   [2029](https://github.com/apache/arrow/runs/7351561711?check_suite_focus=true#step:6:2030)
                    from /usr/include/c++/6/tuple:39,
   [2030](https://github.com/apache/arrow/runs/7351561711?check_suite_focus=true#step:6:2031)
                    from /usr/include/c++/6/functional:55,
   [2031](https://github.com/apache/arrow/runs/7351561711?check_suite_focus=true#step:6:2032)
                    from /arrow/cpp/src/arrow/dataset/partition.h:22,
   [2032](https://github.com/apache/arrow/runs/7351561711?check_suite_focus=true#step:6:2033)
                    from /arrow/cpp/src/arrow/dataset/partition_test.cc:18:
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] vibhatha commented on pull request #13567: ARROW-16911: [C++] Add Equals method to Partitioning

Posted by GitBox <gi...@apache.org>.
vibhatha commented on PR #13567:
URL: https://github.com/apache/arrow/pull/13567#issuecomment-1188799948

   @pitrou Thanks for catching the wrong `push_back`. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] pitrou commented on a diff in pull request #13567: ARROW-16911: [C++] Add Equals method to Partitioning

Posted by GitBox <gi...@apache.org>.
pitrou commented on code in PR #13567:
URL: https://github.com/apache/arrow/pull/13567#discussion_r918650561


##########
cpp/src/arrow/dataset/partition_test.cc:
##########
@@ -426,6 +453,37 @@ TEST_F(TestPartitioning, HivePartitioning) {
   AssertParseError("/alpha=0.0/beta=3.25/");  // conversion of "0.0" to int32 fails
 }
 
+TEST_F(TestPartitioning, HivePartitioningEquals) {
+  const auto& array_vector = ArrayVector();
+  const auto& other_vector = {ArrayFromJSON(utf8(), R"(["foo", "bar", "baz"])"),
+                              ArrayFromJSON(utf8(), R"(["bar", "foo", "baz"])")};
+  auto part = std::make_shared<HivePartitioning>(
+      schema({field("alpha", int32()), field("beta", float32())}), array_vector, "xyz");
+  auto other_part = std::make_shared<HivePartitioning>(
+      schema({field("sigma", int32()), field("beta", float32())}), array_vector, "xyz");
+  auto another_part = std::make_shared<HivePartitioning>(
+      schema({field("alpha", int32()), field("beta", float32())}), other_vector, "xyz");
+  auto some_part = std::make_shared<HivePartitioning>(
+      schema({field("alpha", int32()), field("beta", float32())}), array_vector, "abc");

Review Comment:
   Same here.



##########
cpp/src/arrow/dataset/partition.h:
##########
@@ -63,13 +64,18 @@ struct ARROW_DS_EXPORT PartitionPathFormat {
 /// Paths are consumed from left to right. Paths must be relative to
 /// the root of a partition; path prefixes must be removed before passing
 /// the path to a partitioning for parsing.
-class ARROW_DS_EXPORT Partitioning {
+class ARROW_DS_EXPORT Partitioning : public util::EqualityComparable<Partitioning> {
  public:
   virtual ~Partitioning() = default;
 
   /// \brief The name identifying the kind of partitioning
   virtual std::string type_name() const = 0;
 
+  //// \brief Return whether the partitionings are equal
+  virtual bool Equals(const Partitioning& other) const {
+    return schema_->Equals(other.schema_, false);

Review Comment:
   Nit for code clarity
   ```suggestion
       return schema_->Equals(other.schema_, /*check_metadata=*/ false);
   ```



##########
cpp/src/arrow/dataset/partition_test.cc:
##########
@@ -959,6 +1025,16 @@ TEST_F(TestPartitioning, Range) {
                          less_equal(field_ref("z"), literal(3.0)))}));
 }
 
+TEST_F(TestPartitioning, RangePartitoningEquals) {

Review Comment:
   Not sure this check is actually useful? `RangePartitioning` is just a test class...



##########
cpp/src/arrow/dataset/partition_test.cc:
##########
@@ -222,6 +237,18 @@ TEST_F(TestPartitioning, FilenamePartitioning) {
                                          equal(field_ref("beta"), literal("foo"))));
 }
 
+TEST_F(TestPartitioning, FilenamePartitioningEquals) {
+  auto part = std::make_shared<FilenamePartitioning>(
+      schema({field("alpha", int32()), field("beta", utf8())}));
+  auto other_part = std::make_shared<FilenamePartitioning>(
+      schema({field("sigma", int32()), field("beta", utf8())}));
+  auto another_part = std::make_shared<FilenamePartitioning>(
+      schema({field("sigma", int64()), field("beta", utf8())}));
+  EXPECT_TRUE(part->Equals(*part));
+  EXPECT_FALSE(part->Equals(*other_part));
+  EXPECT_FALSE(other_part->Equals(*another_part));

Review Comment:
   Like above, introduce a distinct object that's equal to one of the others?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] lidavidm commented on pull request #13567: ARROW-16911: [C++] Add Equals method to Partitioning

Posted by GitBox <gi...@apache.org>.
lidavidm commented on PR #13567:
URL: https://github.com/apache/arrow/pull/13567#issuecomment-1181679011

   LGTM after Antoine's comments are addressed.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org