You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2022/11/08 12:30:12 UTC

[GitHub] [arrow] thisisnic commented on a diff in pull request #14444: ARROW-17784: [C++] Opening a dataset where partitioning variable is already in the dataset should error differently

thisisnic commented on code in PR #14444:
URL: https://github.com/apache/arrow/pull/14444#discussion_r1016569924


##########
cpp/src/arrow/dataset/discovery.cc:
##########
@@ -236,25 +236,35 @@ Result<std::vector<std::shared_ptr<Schema>>> FileSystemDatasetFactory::InspectSc
     InspectOptions options) {
   std::vector<std::shared_ptr<Schema>> schemas;
 
+  ARROW_ASSIGN_OR_RAISE(auto partition_schema,
+                        options_.partitioning.GetOrInferSchema(
+                            StripPrefixAndFilename(files_, options_.partition_base_dir)));
+
   const bool has_fragments_limit = options.fragments >= 0;
   int fragments = options.fragments;
   for (const auto& info : files_) {
     if (has_fragments_limit && fragments-- == 0) break;
     auto result = format_->Inspect({info, fs_});
+
     if (ARROW_PREDICT_FALSE(!result.ok())) {
       return result.status().WithMessage(
           "Error creating dataset. Could not read schema from '", info.path(),
           "': ", result.status().message(), ". Is this a '", format_->type_name(),
           "' file?");
     }
+
+    if (partition_schema->num_fields()) {
+      auto field_check =
+          result->get()->CanReferenceFieldsByNames(partition_schema->field_names());
+      if (ARROW_PREDICT_FALSE(field_check.ok())) {
+        return Status::Invalid(
+            "Error creating dataset. Partitioning field(s) present in fragment.");
+      }
+    }

Review Comment:
   My preference would be either silently ignoring it or making it configurable (with one of the configurable options being to silently ignore it).
   
   Users might inherit poorly-designed datasets that they have no control over how they've been written, and I'd hate for it to be impossible for them to be able to work with them in Arrow and just get an error.
   
   I'm not sure how/if having a dataset with 2 columns with the same name would work in R as I think it'd trigger an error somewhere (if it didn't, I think we'd maybe want to make it do so?), but if it'd be useful to have that as a feature, we can just catch it in R and add our own custom error, so I don't see a problem in that.
   
   Silently ignoring would be fine too, as we can always add our own error in R if we want to warn users that there's duplication, and it at least allows execution.
   
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org