You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2020/07/01 15:08:41 UTC

[GitHub] [arrow] jorisvandenbossche opened a new pull request #7608: ARROW-9288: [C++][Dataset] Fix PartitioningFactory with dictionary encoding for HivePartioning

jorisvandenbossche opened a new pull request #7608:
URL: https://github.com/apache/arrow/pull/7608


   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] wesm closed pull request #7608: ARROW-9288: [C++][Dataset] Fix PartitioningFactory with dictionary encoding for HivePartioning

Posted by GitBox <gi...@apache.org>.
wesm closed pull request #7608:
URL: https://github.com/apache/arrow/pull/7608


   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] jorisvandenbossche commented on a change in pull request #7608: ARROW-9288: [C++][Dataset] Fix PartitioningFactory with dictionary encoding for HivePartioning

Posted by GitBox <gi...@apache.org>.
jorisvandenbossche commented on a change in pull request #7608:
URL: https://github.com/apache/arrow/pull/7608#discussion_r448438477



##########
File path: cpp/src/arrow/dataset/partition.cc
##########
@@ -646,15 +657,26 @@ class HivePartitioningFactory : public PartitioningFactory {
       }
     }
 
-    return impl.Finish(&dictionaries_);
+    auto schema_result = impl.Finish(&dictionaries_);
+    field_names_ = impl.FieldNames();
+    return schema_result;
   }
 
   Result<std::shared_ptr<Partitioning>> Finish(
       const std::shared_ptr<Schema>& schema) const override {
-    return std::shared_ptr<Partitioning>(new HivePartitioning(schema, dictionaries_));
+    for (FieldRef ref : field_names_) {

Review comment:
       I should probably guard here against the case that `field_names_` was not yet updated (if `Finish` is called without `Inspect` being called), with empty vector?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] jorisvandenbossche commented on a change in pull request #7608: ARROW-9288: [C++][Dataset] Fix PartitioningFactory with dictionary encoding for HivePartioning

Posted by GitBox <gi...@apache.org>.
jorisvandenbossche commented on a change in pull request #7608:
URL: https://github.com/apache/arrow/pull/7608#discussion_r450721280



##########
File path: cpp/src/arrow/dataset/partition.cc
##########
@@ -646,15 +657,26 @@ class HivePartitioningFactory : public PartitioningFactory {
       }
     }
 
-    return impl.Finish(&dictionaries_);
+    auto schema_result = impl.Finish(&dictionaries_);
+    field_names_ = impl.FieldNames();
+    return schema_result;
   }
 
   Result<std::shared_ptr<Partitioning>> Finish(
       const std::shared_ptr<Schema>& schema) const override {
-    return std::shared_ptr<Partitioning>(new HivePartitioning(schema, dictionaries_));
+    for (FieldRef ref : field_names_) {

Review comment:
       There is no `FieldNames()` method on the PartitioningFactory (only the `impl` has one, but that is not accessible here; that's the reason I added the `field_names_` private member to store those)




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] github-actions[bot] commented on pull request #7608: ARROW-9288: [C++][Dataset] Fix PartitioningFactory with dictionary encoding for HivePartioning

Posted by GitBox <gi...@apache.org>.
github-actions[bot] commented on pull request #7608:
URL: https://github.com/apache/arrow/pull/7608#issuecomment-652480774


   https://issues.apache.org/jira/browse/ARROW-9288


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] bkietz commented on a change in pull request #7608: ARROW-9288: [C++][Dataset] Fix PartitioningFactory with dictionary encoding for HivePartioning

Posted by GitBox <gi...@apache.org>.
bkietz commented on a change in pull request #7608:
URL: https://github.com/apache/arrow/pull/7608#discussion_r451690654



##########
File path: cpp/src/arrow/dataset/partition.cc
##########
@@ -646,15 +657,26 @@ class HivePartitioningFactory : public PartitioningFactory {
       }
     }
 
-    return impl.Finish(&dictionaries_);
+    auto schema_result = impl.Finish(&dictionaries_);
+    field_names_ = impl.FieldNames();
+    return schema_result;

Review comment:
       Finish doesn't mutate `name_to_index_` and that is the only data member accessed by `FieldNames()`. I don't see why Finish needs to be called first
   ```suggestion
       field_names_ = impl.FieldNames();
       return impl.Finish(&dictionaries_);
   ```

##########
File path: cpp/src/arrow/dataset/partition.cc
##########
@@ -646,15 +657,26 @@ class HivePartitioningFactory : public PartitioningFactory {
       }
     }
 
-    return impl.Finish(&dictionaries_);
+    auto schema_result = impl.Finish(&dictionaries_);
+    field_names_ = impl.FieldNames();
+    return schema_result;
   }
 
   Result<std::shared_ptr<Partitioning>> Finish(
       const std::shared_ptr<Schema>& schema) const override {
-    return std::shared_ptr<Partitioning>(new HivePartitioning(schema, dictionaries_));
+    for (FieldRef ref : field_names_) {
+      // ensure all of field_names_ are present in schema
+      RETURN_NOT_OK(ref.FindOne(*schema).status());
+    }
+
+    // drop fields which aren't in field_names_
+    auto out_schema = SchemaFromColumnNames(schema, field_names_);
+
+    return std::make_shared<HivePartitioning>(std::move(out_schema), dictionaries_);

Review comment:
       The check against field_names_ is only relevant if dictionaries_ is non-empty, which can only occur if Inspect has been called (and `field_names_` has therefore been initialized)
   ```suggestion
       if (dictionaries_.empty()) {
         return std::make_shared<HivePartitioning>(schema, dictionaries_);
       } else {
         for (FieldRef ref : field_names_) {
           // ensure all of field_names_ are present in schema
           RETURN_NOT_OK(ref.FindOne(*schema).status());
         }
   
         // drop fields which aren't in field_names_
         auto out_schema = SchemaFromColumnNames(schema, field_names_);
   
         return std::make_shared<HivePartitioning>(std::move(out_schema), dictionaries_);
       }
   ```




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] jorisvandenbossche commented on a change in pull request #7608: ARROW-9288: [C++][Dataset] Fix PartitioningFactory with dictionary encoding for HivePartioning

Posted by GitBox <gi...@apache.org>.
jorisvandenbossche commented on a change in pull request #7608:
URL: https://github.com/apache/arrow/pull/7608#discussion_r451801680



##########
File path: cpp/src/arrow/dataset/partition.cc
##########
@@ -646,15 +657,26 @@ class HivePartitioningFactory : public PartitioningFactory {
       }
     }
 
-    return impl.Finish(&dictionaries_);
+    auto schema_result = impl.Finish(&dictionaries_);
+    field_names_ = impl.FieldNames();
+    return schema_result;
   }
 
   Result<std::shared_ptr<Partitioning>> Finish(
       const std::shared_ptr<Schema>& schema) const override {
-    return std::shared_ptr<Partitioning>(new HivePartitioning(schema, dictionaries_));
+    for (FieldRef ref : field_names_) {
+      // ensure all of field_names_ are present in schema
+      RETURN_NOT_OK(ref.FindOne(*schema).status());
+    }
+
+    // drop fields which aren't in field_names_
+    auto out_schema = SchemaFromColumnNames(schema, field_names_);
+
+    return std::make_shared<HivePartitioning>(std::move(out_schema), dictionaries_);

Review comment:
       Thanks, that's a nice way by checking `dictionaries_` to ensure that `field_name_` is set or not




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] fsaintjacques commented on a change in pull request #7608: ARROW-9288: [C++][Dataset] Fix PartitioningFactory with dictionary encoding for HivePartioning

Posted by GitBox <gi...@apache.org>.
fsaintjacques commented on a change in pull request #7608:
URL: https://github.com/apache/arrow/pull/7608#discussion_r449148761



##########
File path: cpp/src/arrow/dataset/partition.cc
##########
@@ -646,15 +657,26 @@ class HivePartitioningFactory : public PartitioningFactory {
       }
     }
 
-    return impl.Finish(&dictionaries_);
+    auto schema_result = impl.Finish(&dictionaries_);
+    field_names_ = impl.FieldNames();
+    return schema_result;
   }
 
   Result<std::shared_ptr<Partitioning>> Finish(
       const std::shared_ptr<Schema>& schema) const override {
-    return std::shared_ptr<Partitioning>(new HivePartitioning(schema, dictionaries_));
+    for (FieldRef ref : field_names_) {

Review comment:
       Absolutely, the first line of this method should just call
   
   ```
   auto field_names = FieldNames();
   ```
   
   and replace occurrences of the private member.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org