You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by "westonpace (via GitHub)" <gi...@apache.org> on 2023/06/13 14:22:57 UTC

[GitHub] [arrow] westonpace commented on a diff in pull request #35798: GH-35579: [C++] Support non-named FieldRefs in Parquet scanner

westonpace commented on code in PR #35798:
URL: https://github.com/apache/arrow/pull/35798#discussion_r1228209673


##########
cpp/src/arrow/dataset/file_parquet.cc:
##########
@@ -248,7 +283,12 @@ Result<std::vector<int>> InferColumnProjection(const parquet::arrow::FileReader&
   }
 
   std::vector<int> columns_selection;
-  for (const auto& ref : field_refs) {
+  for (auto& ref : field_refs) {
+    // In the (unlikely) absence of a known dataset schema, we require that all
+    // materialized refs are named.
+    if (options.dataset_schema) {

Review Comment:
   Heh, at this point, if we don't have a dataset schema, I think something has gone very wrong.



##########
cpp/src/arrow/dataset/file_parquet.cc:
##########
@@ -224,6 +224,41 @@ Status ResolveOneFieldRef(
   return Status::OK();
 }
 
+bool IsNamedFieldRef(const FieldRef& ref) {
+  if (ref.IsName()) return true;
+  if (const auto* nested_refs = ref.nested_refs()) {
+    for (const auto& nested_ref : *nested_refs) {
+      if (!nested_ref.IsName()) return false;
+    }
+    return true;
+  }
+  return false;
+}

Review Comment:
   A minor thing but I wonder if we might want to add this directly to `FieldRef`?



##########
cpp/src/arrow/dataset/file_parquet_test.cc:
##########
@@ -694,5 +694,55 @@ TEST(TestParquetStatistics, NullMax) {
   EXPECT_EQ(stat_expression->ToString(), "(x >= 1)");
 }
 
+// Tests round-trip projection with nested/indexed FieldRefs
+// https://github.com/apache/arrow/issues/35579
+TEST(TestRoundTrip, ProjectedFieldRefs) {

Review Comment:
   We have some base classes / base tests that do some of this boilerplate already.  However, I don't recall off the top of my head how extensible they are (e.g. can you use them to test with this custom schema).  Did you take a look at the existing tests to see if you could modify one instead of the entire round trip?



##########
cpp/src/arrow/dataset/file_parquet.cc:
##########
@@ -224,6 +224,41 @@ Status ResolveOneFieldRef(
   return Status::OK();
 }
 
+bool IsNamedFieldRef(const FieldRef& ref) {
+  if (ref.IsName()) return true;
+  if (const auto* nested_refs = ref.nested_refs()) {
+    for (const auto& nested_ref : *nested_refs) {
+      if (!nested_ref.IsName()) return false;
+    }
+    return true;
+  }
+  return false;
+}
+
+// Converts a field ref into a position-independent ref (containing only a sequence of
+// names) based on the dataset schema. Returns `false` if no conversion was needed.
+Status MaybeConvertFieldRef(const FieldRef& ref, const Schema& dataset_schema,

Review Comment:
   Minor nit: Can we return `Result<FieldRef>` instead?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org