You are viewing a plain text version of this content. The canonical link for it is here.

Posted to github@arrow.apache.org by "benibus (via GitHub)" <gi...@apache.org> on 2023/05/26 21:47:15 UTC

[GitHub] [arrow] benibus opened a new pull request, #35798: GH-35579: [C++] Support non-named FieldRefs in Parquet scanner

benibus opened a new pull request, #35798:
URL: https://github.com/apache/arrow/pull/35798

   <!--
   Thanks for opening a pull request!
   If this is your first pull request you can find detailed information on how 
   to contribute here:
     * [New Contributor's Guide](https://arrow.apache.org/docs/dev/developers/guide/step_by_step/pr_lifecycle.html#reviews-and-merge-of-the-pull-request)
     * [Contributing Overview](https://arrow.apache.org/docs/dev/developers/overview.html)
   
   
   If this is not a [minor PR](https://github.com/apache/arrow/blob/main/CONTRIBUTING.md#Minor-Fixes). Could you open an issue for this pull request on GitHub? https://github.com/apache/arrow/issues/new/choose
   
   Opening GitHub issues ahead of time contributes to the [Openness](http://theapacheway.com/open/#:~:text=Openness%20allows%20new%20users%20the,must%20happen%20in%20the%20open.) of the Apache Arrow project.
   
   Then could you also rename the pull request title in the following format?
   
       GH-${GITHUB_ISSUE_ID}: [${COMPONENT}] ${SUMMARY}
   
   or
   
       MINOR: [${COMPONENT}] ${SUMMARY}
   
   In the case of PARQUET issues on JIRA the title also supports:
   
       PARQUET-${JIRA_ISSUE_ID}: [${COMPONENT}] ${SUMMARY}
   
   -->
   
   ### Rationale for this change
   
   When setting projections/filters for the file system scanner, the Parquet implementation requires that all materialized `FieldRef`s be position-independent (containing only names). However, it may be useful to support index-based field lookups as well - assuming the dataset schema is known.
   
   ### What changes are included in this PR?
   
   Adds a translation step for field refs prior to looking them up in the fragment schema. A known dataset schema is required to do this reliably, however (since the fragment schema may be a sub/superset of the dataset schema) - so in the absence of one, we fall back to the existing behavior.
   
   ### Are these changes tested?
   
   Yes (tests are included)
   
   ### Are there any user-facing changes?
   
   Yes
   
   <!--
   If there are any breaking changes to public APIs, please uncomment the line below and explain which changes are breaking.
   -->
   <!-- **This PR includes breaking changes to public APIs.** -->
   
   <!--
   Please uncomment the line below (and provide explanation) if the changes fix either (a) a security vulnerability, (b) a bug that caused incorrect or invalid data to be produced, or (c) a bug that causes a crash (even when the API contract is upheld). We use this to highlight fixes to issues that may affect users without their knowledge. For this reason, fixing bugs that cause errors don't count, since those are usually obvious.
   -->
   <!-- **This PR contains a "Critical Fix".** -->


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] westonpace commented on a diff in pull request #35798: GH-35579: [C++] Support non-named FieldRefs in Parquet scanner

Posted by "westonpace (via GitHub)" <gi...@apache.org>.

westonpace commented on code in PR #35798:
URL: https://github.com/apache/arrow/pull/35798#discussion_r1235608494


##########
cpp/src/arrow/dataset/file_parquet_test.cc:
##########
@@ -694,5 +694,55 @@ TEST(TestParquetStatistics, NullMax) {
   EXPECT_EQ(stat_expression->ToString(), "(x >= 1)");
 }
 
+// Tests round-trip projection with nested/indexed FieldRefs
+// https://github.com/apache/arrow/issues/35579
+TEST(TestRoundTrip, ProjectedFieldRefs) {

Review Comment:
   Ok.  I don't feel too strongly here.  So feel free to let me know which you prefer.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] conbench-apache-arrow[bot] commented on pull request #35798: GH-35579: [C++] Support non-named FieldRefs in Parquet scanner

Posted by "conbench-apache-arrow[bot] (via GitHub)" <gi...@apache.org>.

conbench-apache-arrow[bot] commented on PR #35798:
URL: https://github.com/apache/arrow/pull/35798#issuecomment-1610421912

   Conbench analyzed the 6 benchmark runs on commit `10eedbe6`.
   
   There were 5 benchmark results indicating a performance regression:
   
   - Commit Run on `ursa-thinkcentre-m75q` at [2023-06-26 23:32:20Z](http://conbench.ursa.dev/compare/runs/737e90ba9e0e404894eaa88b16b43837...6de8c1cf0ab045839562c49a24701a2f/)
     - [params=<Repetition::REQUIRED, Compression::LZ4>/65536/1024, source=cpp-micro, suite=parquet-column-io-benchmark](http://conbench.ursa.dev/compare/benchmarks/06499f9d7b0d795a8000ea055f180236...0649a20aacd876f580001708750361b4)
   
   - Commit Run on `ursa-i9-9960x` at [2023-06-27 21:57:10Z](http://conbench.ursa.dev/compare/runs/d1fcca6e86bd430fa751225e9ed389ed...b4d54a2afbf54c87887f6fce2e2ccada/)
     - [engine=arrow, format=parquet, language=R, memory_map=False, query_id=TPCH-04, scale_factor=1](http://conbench.ursa.dev/compare/benchmarks/0649b459f7067a2580008dd660c0ac4f...0649b74ede3f728b8000cce730aa8a5a)
   - and 3 more (see the report linked below)
   
   The [full Conbench report](https://github.com/apache/arrow/runs/14608447323) has more details.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] westonpace commented on pull request #35798: GH-35579: [C++] Support non-named FieldRefs in Parquet scanner

Posted by "westonpace (via GitHub)" <gi...@apache.org>.

westonpace commented on PR #35798:
URL: https://github.com/apache/arrow/pull/35798#issuecomment-1605165429

   CI issues seem unrelated


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] lidavidm commented on pull request #35798: GH-35579: [C++] Support non-named FieldRefs in Parquet scanner

Posted by "lidavidm (via GitHub)" <gi...@apache.org>.

lidavidm commented on PR #35798:
URL: https://github.com/apache/arrow/pull/35798#issuecomment-1581196771

   There's build failures, but I think they're unrelated?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] lidavidm commented on pull request #35798: GH-35579: [C++] Support non-named FieldRefs in Parquet scanner

Posted by "lidavidm (via GitHub)" <gi...@apache.org>.

lidavidm commented on PR #35798:
URL: https://github.com/apache/arrow/pull/35798#issuecomment-1572210883

   I'm not sure what the problem is @mapleFU? We still support named refs even with this PR. This just allows the user to also provide indices, and we resolve them into names against the overall dataset schema, so that should actually allow for schema evolution if the positions of those fields changes.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] lidavidm commented on a diff in pull request #35798: GH-35579: [C++] Support non-named FieldRefs in Parquet scanner

Posted by "lidavidm (via GitHub)" <gi...@apache.org>.

lidavidm commented on code in PR #35798:
URL: https://github.com/apache/arrow/pull/35798#discussion_r1213283140


##########
cpp/src/arrow/dataset/file_parquet.cc:
##########
@@ -224,6 +224,41 @@ Status ResolveOneFieldRef(
   return Status::OK();
 }
 
+bool IsValidFieldRef(const FieldRef& ref) {
+  if (ref.IsName()) return true;
+  if (const auto* nested_refs = ref.nested_refs()) {
+    for (const auto& nested_ref : *nested_refs) {
+      if (!nested_ref.IsName()) return false;
+    }
+    return true;
+  }
+  return false;
+}
+
+// Converts a field ref into a position-independent ref (containing only a sequence of
+// names) based on the dataset schema. Returns `false` if no conversion was needed.
+Result<bool> ValidateFieldRef(const FieldRef& ref, const Schema& dataset_schema,

Review Comment:
   Additionally the return value isn't ever used so maybe just return Status?



##########
cpp/src/arrow/dataset/file_parquet.cc:
##########
@@ -224,6 +224,41 @@ Status ResolveOneFieldRef(
   return Status::OK();
 }
 
+bool IsValidFieldRef(const FieldRef& ref) {
+  if (ref.IsName()) return true;
+  if (const auto* nested_refs = ref.nested_refs()) {
+    for (const auto& nested_ref : *nested_refs) {
+      if (!nested_ref.IsName()) return false;
+    }
+    return true;
+  }
+  return false;
+}
+
+// Converts a field ref into a position-independent ref (containing only a sequence of
+// names) based on the dataset schema. Returns `false` if no conversion was needed.
+Result<bool> ValidateFieldRef(const FieldRef& ref, const Schema& dataset_schema,

Review Comment:
   nit: the naming is a little misleading; for instance ValidateFieldRef might be more something like MaybeConvertFieldRef?



##########
cpp/src/arrow/dataset/file_parquet.cc:
##########
@@ -224,6 +224,41 @@ Status ResolveOneFieldRef(
   return Status::OK();
 }
 
+bool IsValidFieldRef(const FieldRef& ref) {

Review Comment:
   maybe IsNamedFieldRef?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] lidavidm commented on pull request #35798: GH-35579: [C++] Support non-named FieldRefs in Parquet scanner

Posted by "lidavidm (via GitHub)" <gi...@apache.org>.

lidavidm commented on PR #35798:
URL: https://github.com/apache/arrow/pull/35798#issuecomment-1572212022

   I think right now, of course, we can't yet unify files of different schemas into a consistent schema. But this doesn't affect that either way.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] benibus commented on a diff in pull request #35798: GH-35579: [C++] Support non-named FieldRefs in Parquet scanner

Posted by "benibus (via GitHub)" <gi...@apache.org>.

benibus commented on code in PR #35798:
URL: https://github.com/apache/arrow/pull/35798#discussion_r1210568744


##########
cpp/src/arrow/dataset/file_parquet.cc:
##########
@@ -224,6 +224,41 @@ Status ResolveOneFieldRef(
   return Status::OK();
 }
 
+bool IsValidFieldRef(const FieldRef& ref) {
+  if (ref.IsName()) return true;
+  if (const auto* nested_refs = ref.nested_refs()) {
+    for (const auto& nested_ref : *nested_refs) {
+      if (!nested_ref.IsName()) return false;
+    }
+    return true;
+  }
+  return false;
+}
+
+// Converts a field ref into a position-independent ref (containing only a sequence of
+// names) based on the dataset schema. Returns `false` if no conversion was needed.
+Result<bool> ValidateFieldRef(const FieldRef& ref, const Schema& dataset_schema,
+                              FieldRef* out) {
+  if (ARROW_PREDICT_TRUE(IsValidFieldRef(ref))) {

Review Comment:
   Mostly because all existing user code uses named lookups. Plus, I'd imagine indexed lookups would be fairly rare in practice since it requires knowing the exact structure (i.e. field order) of the dataset schema upfront, which may not be predictable if it was inferred from multiple files.
   
   (It's probably not very consequential though)



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] mapleFU commented on pull request #35798: GH-35579: [C++] Support non-named FieldRefs in Parquet scanner

Posted by "mapleFU (via GitHub)" <gi...@apache.org>.

mapleFU commented on PR #35798:
URL: https://github.com/apache/arrow/pull/35798#issuecomment-1569685796

   Yeah, I know what you mean, however, Parquet files in a dataset might has different schema, the most typical case is at: https://iceberg.apache.org/docs/latest/evolution/
   
   Assume user insert a column, name might be better than FieldIndex, because it can maintain some consistency.
   
   If we're sure we don't need to support that case, or user can make sure that file has same schema, then I'm +1 on this patch


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] benibus commented on a diff in pull request #35798: GH-35579: [C++] Support non-named FieldRefs in Parquet scanner

Posted by "benibus (via GitHub)" <gi...@apache.org>.

benibus commented on code in PR #35798:
URL: https://github.com/apache/arrow/pull/35798#discussion_r1240156764


##########
cpp/src/arrow/dataset/file_parquet.cc:
##########
@@ -224,6 +224,41 @@ Status ResolveOneFieldRef(
   return Status::OK();
 }
 
+bool IsNamedFieldRef(const FieldRef& ref) {
+  if (ref.IsName()) return true;
+  if (const auto* nested_refs = ref.nested_refs()) {
+    for (const auto& nested_ref : *nested_refs) {
+      if (!nested_ref.IsName()) return false;
+    }
+    return true;
+  }
+  return false;
+}

Review Comment:
   Moved it into `FieldRef` in the update. Looking at it now though, I suspect it may be too niche to justify its place there - at least on its own.
   
   Methods that transform a ref into a flat vector of names or indices might be more useful in general (but less trivial, of course).



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] mapleFU commented on pull request #35798: GH-35579: [C++] Support non-named FieldRefs in Parquet scanner

Posted by "mapleFU (via GitHub)" <gi...@apache.org>.

mapleFU commented on PR #35798:
URL: https://github.com/apache/arrow/pull/35798#issuecomment-1576598477

   Okay, I think currently `parquet::SchemaManifest` can build the bridge from arrow to parquet and parquet to arrow, but for file with different schema, it need to follow some rules, maybe by `"PARQUET:field_id"` or others. I'm ok on this patch now!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] benibus commented on a diff in pull request #35798: GH-35579: [C++] Support non-named FieldRefs in Parquet scanner

Posted by "benibus (via GitHub)" <gi...@apache.org>.

benibus commented on code in PR #35798:
URL: https://github.com/apache/arrow/pull/35798#discussion_r1228415712


##########
cpp/src/arrow/dataset/file_parquet_test.cc:
##########
@@ -694,5 +694,55 @@ TEST(TestParquetStatistics, NullMax) {
   EXPECT_EQ(stat_expression->ToString(), "(x >= 1)");
 }
 
+// Tests round-trip projection with nested/indexed FieldRefs
+// https://github.com/apache/arrow/issues/35579
+TEST(TestRoundTrip, ProjectedFieldRefs) {

Review Comment:
   Yeah, I only did it this way to more or less match the example in the original issue. Using the existing fixtures would certainly be more convenient. AFAIK the only difference is that it would test the intermediate batch rather than doing the full round trip on the projection - but that's consistent with the other tests.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] lidavidm commented on pull request #35798: GH-35579: [C++] Support non-named FieldRefs in Parquet scanner

Posted by "lidavidm (via GitHub)" <gi...@apache.org>.

lidavidm commented on PR #35798:
URL: https://github.com/apache/arrow/pull/35798#issuecomment-1581222872

   Probably what happened in https://github.com/conda-forge/cpp-opentelemetry-sdk-feedstock/issues/29
   
   We need to set WITH_STL=ON for the bundled OpenTelemetry build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] mapleFU commented on pull request #35798: GH-35579: [C++] Support non-named FieldRefs in Parquet scanner

Posted by "mapleFU (via GitHub)" <gi...@apache.org>.

mapleFU commented on PR #35798:
URL: https://github.com/apache/arrow/pull/35798#issuecomment-1574120874

   OK, thanks for your explanition @lidavidm .
   I think when schema evolution happens, seek by index might provide unconsistent result, and seek by name doesn't has that problem. But if we we can't yet unify files of different schemas into a consistent schema, I'm +1 on this patch


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] mapleFU commented on a diff in pull request #35798: GH-35579: [C++] Support non-named FieldRefs in Parquet scanner

Posted by "mapleFU (via GitHub)" <gi...@apache.org>.

mapleFU commented on code in PR #35798:
URL: https://github.com/apache/arrow/pull/35798#discussion_r1209431926


##########
cpp/src/arrow/dataset/file_parquet.cc:
##########
@@ -224,6 +224,41 @@ Status ResolveOneFieldRef(
   return Status::OK();
 }
 
+bool IsValidFieldRef(const FieldRef& ref) {
+  if (ref.IsName()) return true;
+  if (const auto* nested_refs = ref.nested_refs()) {
+    for (const auto& nested_ref : *nested_refs) {
+      if (!nested_ref.IsName()) return false;
+    }
+    return true;
+  }
+  return false;
+}
+
+// Converts a field ref into a position-independent ref (containing only a sequence of
+// names) based on the dataset schema. Returns `false` if no conversion was needed.
+Result<bool> ValidateFieldRef(const FieldRef& ref, const Schema& dataset_schema,
+                              FieldRef* out) {
+  if (ARROW_PREDICT_TRUE(IsValidFieldRef(ref))) {

Review Comment:
   Since you support LookUp by index, why `PREDICT_TRUE` is used here?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] westonpace commented on a diff in pull request #35798: GH-35579: [C++] Support non-named FieldRefs in Parquet scanner

Posted by "westonpace (via GitHub)" <gi...@apache.org>.

westonpace commented on code in PR #35798:
URL: https://github.com/apache/arrow/pull/35798#discussion_r1240528021


##########
cpp/src/arrow/dataset/file_parquet.cc:
##########
@@ -224,6 +224,41 @@ Status ResolveOneFieldRef(
   return Status::OK();
 }
 
+bool IsNamedFieldRef(const FieldRef& ref) {
+  if (ref.IsName()) return true;
+  if (const auto* nested_refs = ref.nested_refs()) {
+    for (const auto& nested_ref : *nested_refs) {
+      if (!nested_ref.IsName()) return false;
+    }
+    return true;
+  }
+  return false;
+}

Review Comment:
   I don't think it's too niche.  I feel like I have run into situations a few times now in the scanner where I've needed to know if a ref is all-names, all-indices, or mixed (a lot of the new scanner stuff normalizes to all-indices).  We do have FieldPath already which is a flat vector of indices.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] lidavidm commented on pull request #35798: GH-35579: [C++] Support non-named FieldRefs in Parquet scanner

Posted by "lidavidm (via GitHub)" <gi...@apache.org>.

lidavidm commented on PR #35798:
URL: https://github.com/apache/arrow/pull/35798#issuecomment-1574188447

   What (would) happen is we resolve the file schemas into an overall dataset schema, _then_ resolve any indices against that unified schema _back into names_, so that issue shouldn't come up


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] benibus commented on pull request #35798: GH-35579: [C++] Support non-named FieldRefs in Parquet scanner

Posted by "benibus (via GitHub)" <gi...@apache.org>.

benibus commented on PR #35798:
URL: https://github.com/apache/arrow/pull/35798#issuecomment-1568782245

   > What would happen if manifest not matches input in nested schema? Would extra checking is required here?
   Assume we have schema evolution in some of the dataset file, how to handle these file with mismatched Field?
   
   We shouldn't need any additional checks there, I don't think. The routine for resolving discrepancies between the dataset schema and file manifest (`ResolveOneFieldRef`) is unchanged - and the logic is still in terms of field names exclusively. All this PR really does is convert indexed refs into named refs (using the dataset schema) before those checks occur.
   
   That case should probably be reflected in the tests though... (anecdotally, it _did_ work in my ad hoc testing)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] benibus commented on pull request #35798: GH-35579: [C++] Support non-named FieldRefs in Parquet scanner

Posted by "benibus (via GitHub)" <gi...@apache.org>.

benibus commented on PR #35798:
URL: https://github.com/apache/arrow/pull/35798#issuecomment-1581208146

   > There's build failures, but I think they're unrelated?
   
   I think so... I'm seeing similar macos failures elsewhere after a fresh rebase


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] westonpace merged pull request #35798: GH-35579: [C++] Support non-named FieldRefs in Parquet scanner

Posted by "westonpace (via GitHub)" <gi...@apache.org>.

westonpace merged PR #35798:
URL: https://github.com/apache/arrow/pull/35798


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] github-actions[bot] commented on pull request #35798: GH-35579: [C++] Support non-named FieldRefs in Parquet scanner

Posted by "github-actions[bot] (via GitHub)" <gi...@apache.org>.

github-actions[bot] commented on PR #35798:
URL: https://github.com/apache/arrow/pull/35798#issuecomment-1564994215

   * Closes: #35579


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] benibus commented on a diff in pull request #35798: GH-35579: [C++] Support non-named FieldRefs in Parquet scanner

Posted by "benibus (via GitHub)" <gi...@apache.org>.

benibus commented on code in PR #35798:
URL: https://github.com/apache/arrow/pull/35798#discussion_r1228396242


##########
cpp/src/arrow/dataset/file_parquet.cc:
##########
@@ -248,7 +283,12 @@ Result<std::vector<int>> InferColumnProjection(const parquet::arrow::FileReader&
   }
 
   std::vector<int> columns_selection;
-  for (const auto& ref : field_refs) {
+  for (auto& ref : field_refs) {
+    // In the (unlikely) absence of a known dataset schema, we require that all
+    // materialized refs are named.
+    if (options.dataset_schema) {

Review Comment:
   Haha I figured as much... That being said, some of the existing tests do leave it null.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] westonpace commented on a diff in pull request #35798: GH-35579: [C++] Support non-named FieldRefs in Parquet scanner

Posted by "westonpace (via GitHub)" <gi...@apache.org>.

westonpace commented on code in PR #35798:
URL: https://github.com/apache/arrow/pull/35798#discussion_r1228209673


##########
cpp/src/arrow/dataset/file_parquet.cc:
##########
@@ -248,7 +283,12 @@ Result<std::vector<int>> InferColumnProjection(const parquet::arrow::FileReader&
   }
 
   std::vector<int> columns_selection;
-  for (const auto& ref : field_refs) {
+  for (auto& ref : field_refs) {
+    // In the (unlikely) absence of a known dataset schema, we require that all
+    // materialized refs are named.
+    if (options.dataset_schema) {

Review Comment:
   Heh, at this point, if we don't have a dataset schema, I think something has gone very wrong.



##########
cpp/src/arrow/dataset/file_parquet.cc:
##########
@@ -224,6 +224,41 @@ Status ResolveOneFieldRef(
   return Status::OK();
 }
 
+bool IsNamedFieldRef(const FieldRef& ref) {
+  if (ref.IsName()) return true;
+  if (const auto* nested_refs = ref.nested_refs()) {
+    for (const auto& nested_ref : *nested_refs) {
+      if (!nested_ref.IsName()) return false;
+    }
+    return true;
+  }
+  return false;
+}

Review Comment:
   A minor thing but I wonder if we might want to add this directly to `FieldRef`?



##########
cpp/src/arrow/dataset/file_parquet_test.cc:
##########
@@ -694,5 +694,55 @@ TEST(TestParquetStatistics, NullMax) {
   EXPECT_EQ(stat_expression->ToString(), "(x >= 1)");
 }
 
+// Tests round-trip projection with nested/indexed FieldRefs
+// https://github.com/apache/arrow/issues/35579
+TEST(TestRoundTrip, ProjectedFieldRefs) {

Review Comment:
   We have some base classes / base tests that do some of this boilerplate already.  However, I don't recall off the top of my head how extensible they are (e.g. can you use them to test with this custom schema).  Did you take a look at the existing tests to see if you could modify one instead of the entire round trip?



##########
cpp/src/arrow/dataset/file_parquet.cc:
##########
@@ -224,6 +224,41 @@ Status ResolveOneFieldRef(
   return Status::OK();
 }
 
+bool IsNamedFieldRef(const FieldRef& ref) {
+  if (ref.IsName()) return true;
+  if (const auto* nested_refs = ref.nested_refs()) {
+    for (const auto& nested_ref : *nested_refs) {
+      if (!nested_ref.IsName()) return false;
+    }
+    return true;
+  }
+  return false;
+}
+
+// Converts a field ref into a position-independent ref (containing only a sequence of
+// names) based on the dataset schema. Returns `false` if no conversion was needed.
+Status MaybeConvertFieldRef(const FieldRef& ref, const Schema& dataset_schema,

Review Comment:
   Minor nit: Can we return `Result<FieldRef>` instead?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org