You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2020/07/08 16:55:53 UTC

[GitHub] [arrow] bkietz commented on a change in pull request #7608: ARROW-9288: [C++][Dataset] Fix PartitioningFactory with dictionary encoding for HivePartioning

bkietz commented on a change in pull request #7608:
URL: https://github.com/apache/arrow/pull/7608#discussion_r451690654



##########
File path: cpp/src/arrow/dataset/partition.cc
##########
@@ -646,15 +657,26 @@ class HivePartitioningFactory : public PartitioningFactory {
       }
     }
 
-    return impl.Finish(&dictionaries_);
+    auto schema_result = impl.Finish(&dictionaries_);
+    field_names_ = impl.FieldNames();
+    return schema_result;

Review comment:
       Finish doesn't mutate `name_to_index_` and that is the only data member accessed by `FieldNames()`. I don't see why Finish needs to be called first
   ```suggestion
       field_names_ = impl.FieldNames();
       return impl.Finish(&dictionaries_);
   ```

##########
File path: cpp/src/arrow/dataset/partition.cc
##########
@@ -646,15 +657,26 @@ class HivePartitioningFactory : public PartitioningFactory {
       }
     }
 
-    return impl.Finish(&dictionaries_);
+    auto schema_result = impl.Finish(&dictionaries_);
+    field_names_ = impl.FieldNames();
+    return schema_result;
   }
 
   Result<std::shared_ptr<Partitioning>> Finish(
       const std::shared_ptr<Schema>& schema) const override {
-    return std::shared_ptr<Partitioning>(new HivePartitioning(schema, dictionaries_));
+    for (FieldRef ref : field_names_) {
+      // ensure all of field_names_ are present in schema
+      RETURN_NOT_OK(ref.FindOne(*schema).status());
+    }
+
+    // drop fields which aren't in field_names_
+    auto out_schema = SchemaFromColumnNames(schema, field_names_);
+
+    return std::make_shared<HivePartitioning>(std::move(out_schema), dictionaries_);

Review comment:
       The check against field_names_ is only relevant if dictionaries_ is non-empty, which can only occur if Inspect has been called (and `field_names_` has therefore been initialized)
   ```suggestion
       if (dictionaries_.empty()) {
         return std::make_shared<HivePartitioning>(schema, dictionaries_);
       } else {
         for (FieldRef ref : field_names_) {
           // ensure all of field_names_ are present in schema
           RETURN_NOT_OK(ref.FindOne(*schema).status());
         }
   
         // drop fields which aren't in field_names_
         auto out_schema = SchemaFromColumnNames(schema, field_names_);
   
         return std::make_shared<HivePartitioning>(std::move(out_schema), dictionaries_);
       }
   ```




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org