You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by "westonpace (via GitHub)" <gi...@apache.org> on 2023/03/31 16:44:18 UTC

[GitHub] [arrow] westonpace opened a new pull request, #33634: GH-20385: [C++] reject partial loads of an extension type in the parquet reader

westonpace opened a new pull request, #33634:
URL: https://github.com/apache/arrow/pull/33634

   When the parquet reader is performing a partial read it will try and maintain field structure.  So, for example, given a schema of `points: struct<x: int32, y:int32>` and a load of `points.x` it will return `points: struct<x: int32>`.  However, if there is an extension type `points: Point` where `Point` has a storage type `struct<x: int32, y: int32>` then suddenly the reference `points.x` no longer makes sense.
   * Closes: #20385


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] jorisvandenbossche commented on pull request #33634: GH-20385: [C++] reject partial loads of an extension type in the parquet reader

Posted by "jorisvandenbossche (via GitHub)" <gi...@apache.org>.
jorisvandenbossche commented on PR #33634:
URL: https://github.com/apache/arrow/pull/33634#issuecomment-1403251774

   Sorry, I don't understand your last comment. Why does it need a new method? (you already can get the storage type / array from an extension type / array) And why would that be meaningless for the Point example? (I think it can makes sense to want to read just one of x or y)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] amol- closed pull request #33634: GH-20385: [C++] reject partial loads of an extension type in the parquet reader

Posted by "amol- (via GitHub)" <gi...@apache.org>.
amol- closed pull request #33634: GH-20385: [C++] reject partial loads of an extension type in the parquet reader
URL: https://github.com/apache/arrow/pull/33634


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] jorisvandenbossche commented on pull request #33634: GH-20385: [C++] reject partial loads of an extension type in the parquet reader

Posted by GitBox <gi...@apache.org>.
jorisvandenbossche commented on PR #33634:
URL: https://github.com/apache/arrow/pull/33634#issuecomment-1385290928

   I think ideally we would still allow reading a field of an extension type, but in that case just return the plain field (not wrapped in an extension type, since that is no longer possible)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] westonpace commented on pull request #33634: GH-20385: [C++] reject partial loads of an extension type in the parquet reader

Posted by "westonpace (via GitHub)" <gi...@apache.org>.
westonpace commented on PR #33634:
URL: https://github.com/apache/arrow/pull/33634#issuecomment-1403733507

   Sorry, I will clarify.  If the column `my_points` has the type `Point` and you ask for `my_points.x` then your only option is to return `x: int32`.  However, if the column `my_points` has the type `AnnotatedStruct<x: int32, y: int32>` then it may very well make sense to return `AnnotatedStruct<x: int32>`.
   
   `PointType::WithType(struct<x: int32>)` would return `nullptr`.  `AnnotatedStructType::WithType(struct<x: int32>)` would return something valid (a new data type with the given storage type but reusing the metadata / annotations on the original type).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] westonpace commented on pull request #33634: GH-20385: [C++] reject partial loads of an extension type in the parquet reader

Posted by "westonpace (via GitHub)" <gi...@apache.org>.
westonpace commented on PR #33634:
URL: https://github.com/apache/arrow/pull/33634#issuecomment-1402788652

   > I think ideally we would still allow reading a field of an extension type, but in that case just return the plain field (not wrapped in an extension type, since that is no longer possible)
   
   It could be possible but would require another abstract method in `ExtensionType`.  Something like `std::shared_ptr<ExtensionType> ExtensionType::WithType(std::shared_ptr<Type> storage_type)`.  However, in cases like `Point` that would be meaningless.  However, in some cases like `MyCustomStruct` it might make sense.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] github-actions[bot] commented on pull request #33634: GH-20385: [C++] reject partial loads of an extension type in the parquet reader

Posted by GitBox <gi...@apache.org>.
github-actions[bot] commented on PR #33634:
URL: https://github.com/apache/arrow/pull/33634#issuecomment-1380733518

   :warning: GitHub issue #20385 **has been automatically assigned in GitHub** to PR creator.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] amol- commented on pull request #33634: GH-20385: [C++] reject partial loads of an extension type in the parquet reader

Posted by "amol- (via GitHub)" <gi...@apache.org>.
amol- commented on PR #33634:
URL: https://github.com/apache/arrow/pull/33634#issuecomment-1507186492

   @AlenkaF @wjones127 can you take a look at this one as soon as you have a chance? It has been pending review for a while and it would be great to lead it to completion as it's an helpful improvement.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] pitrou commented on pull request #33634: GH-20385: [C++] reject partial loads of an extension type in the parquet reader

Posted by "pitrou (via GitHub)" <gi...@apache.org>.
pitrou commented on PR #33634:
URL: https://github.com/apache/arrow/pull/33634#issuecomment-1549862560

   @westonpace Do you want to update this PR?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] github-actions[bot] commented on pull request #33634: GH-20385: [C++] reject partial loads of an extension type in the parquet reader

Posted by GitBox <gi...@apache.org>.
github-actions[bot] commented on PR #33634:
URL: https://github.com/apache/arrow/pull/33634#issuecomment-1380733464

   * Closes: #20385


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] amol- commented on pull request #33634: GH-20385: [C++] reject partial loads of an extension type in the parquet reader

Posted by "amol- (via GitHub)" <gi...@apache.org>.
amol- commented on PR #33634:
URL: https://github.com/apache/arrow/pull/33634#issuecomment-1490644527

   Closing because it has been untouched for a while, in case it's still relevant feel free to reopen and move it forward 👍


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] wjones127 commented on a diff in pull request #33634: GH-20385: [C++] reject partial loads of an extension type in the parquet reader

Posted by "wjones127 (via GitHub)" <gi...@apache.org>.
wjones127 commented on code in PR #33634:
URL: https://github.com/apache/arrow/pull/33634#discussion_r1165957392


##########
cpp/src/parquet/arrow/reader.cc:
##########
@@ -842,7 +842,16 @@ Status GetReader(const SchemaField& field, const std::shared_ptr<Field>& arrow_f
     auto storage_field = arrow_field->WithType(
         checked_cast<const ExtensionType&>(*arrow_field->type()).storage_type());
     RETURN_NOT_OK(GetReader(field, storage_field, ctx, out));
-    *out = std::make_unique<ExtensionReader>(arrow_field, std::move(*out));
+    if (*out) {
+      auto storage_type = (*out)->field()->type();
+      if (!storage_type->Equals(
+              checked_cast<const ExtensionType&>(*arrow_field->type()).storage_type())) {

Review Comment:
   I think Antoine is correct. This can be simplified.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] pitrou commented on a diff in pull request #33634: GH-20385: [C++] reject partial loads of an extension type in the parquet reader

Posted by GitBox <gi...@apache.org>.
pitrou commented on code in PR #33634:
URL: https://github.com/apache/arrow/pull/33634#discussion_r1072056477


##########
cpp/src/parquet/arrow/reader.cc:
##########
@@ -842,7 +842,16 @@ Status GetReader(const SchemaField& field, const std::shared_ptr<Field>& arrow_f
     auto storage_field = arrow_field->WithType(
         checked_cast<const ExtensionType&>(*arrow_field->type()).storage_type());
     RETURN_NOT_OK(GetReader(field, storage_field, ctx, out));
-    *out = std::make_unique<ExtensionReader>(arrow_field, std::move(*out));
+    if (*out) {
+      auto storage_type = (*out)->field()->type();
+      if (!storage_type->Equals(
+              checked_cast<const ExtensionType&>(*arrow_field->type()).storage_type())) {

Review Comment:
   Isn't this simply
   ```suggestion
         if (!storage_type->Equals(storage_field->type()) {
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] westonpace commented on pull request #33634: GH-20385: [C++] reject partial loads of an extension type in the parquet reader

Posted by "westonpace (via GitHub)" <gi...@apache.org>.
westonpace commented on PR #33634:
URL: https://github.com/apache/arrow/pull/33634#issuecomment-1403776508

   Actually, never mind, after thinking this through, a nested reference always discards the parents.  The above behavior would only with Substrait's mask references (though it is still a valid concern, that is for later).
   
   It appears that the C++ parquet reader has the ability to support this kind of "masked projection" but the Arrow parquet reader does not expose an API for it.  I'll have to read through in more detail.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] westonpace commented on pull request #33634: GH-20385: [C++] reject partial loads of an extension type in the parquet reader

Posted by GitBox <gi...@apache.org>.
westonpace commented on PR #33634:
URL: https://github.com/apache/arrow/pull/33634#issuecomment-1380734525

   This should fix the issue I believe but it might not be the best fix.  I wonder if there is some place higher up (e.g. where we are converting the path strings into some kind of parquet column references) that we might reject the request.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] pitrou merged pull request #33634: GH-20385: [C++][Parquet] Reject partial load of an extension type

Posted by "pitrou (via GitHub)" <gi...@apache.org>.
pitrou merged PR #33634:
URL: https://github.com/apache/arrow/pull/33634


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] ursabot commented on pull request #33634: GH-20385: [C++][Parquet] Reject partial load of an extension type

Posted by "ursabot (via GitHub)" <gi...@apache.org>.
ursabot commented on PR #33634:
URL: https://github.com/apache/arrow/pull/33634#issuecomment-1555593531

   Benchmark runs are scheduled for baseline = 37975d52e7bf829f9e5a23a65d584523d7616a4d and contender = f59e37f8a676a13f0bd82a09c211ddb45c6b48ac. f59e37f8a676a13f0bd82a09c211ddb45c6b48ac is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
   Conbench compare runs links:
   [Finished :arrow_down:0.0% :arrow_up:0.0%] [ec2-t3-xlarge-us-east-2](https://conbench.ursa.dev/compare/runs/3a8dfd4d8b6a484b84771fb05199326d...ec7a2dae518f4ab3a2b212495bc79cbe/)
   [Finished :arrow_down:0.15% :arrow_up:0.03%] [test-mac-arm](https://conbench.ursa.dev/compare/runs/e846476bb10c4fd1a97872c0c3f186d3...62ecc1cdb9364a97aae8c62c770c58ad/)
   [Finished :arrow_down:0.0% :arrow_up:0.0%] [ursa-i9-9960x](https://conbench.ursa.dev/compare/runs/45470e88d00e4ddc81e765f986ae2601...204e64dc3b3744179b66a1aa2b112f57/)
   [Finished :arrow_down:0.6% :arrow_up:0.39%] [ursa-thinkcentre-m75q](https://conbench.ursa.dev/compare/runs/6aab720dc3214cf488d2cf3d09c2ec4f...471b2e5093984364b11dd939931764c7/)
   Buildkite builds:
   [Finished] [`f59e37f8` ec2-t3-xlarge-us-east-2](https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ec2-t3-xlarge-us-east-2/builds/2890)
   [Finished] [`f59e37f8` test-mac-arm](https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-test-mac-arm/builds/2926)
   [Finished] [`f59e37f8` ursa-i9-9960x](https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ursa-i9-9960x/builds/2891)
   [Finished] [`f59e37f8` ursa-thinkcentre-m75q](https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ursa-thinkcentre-m75q/builds/2916)
   [Finished] [`37975d52` ec2-t3-xlarge-us-east-2](https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ec2-t3-xlarge-us-east-2/builds/2889)
   [Finished] [`37975d52` test-mac-arm](https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-test-mac-arm/builds/2925)
   [Finished] [`37975d52` ursa-i9-9960x](https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ursa-i9-9960x/builds/2890)
   [Finished] [`37975d52` ursa-thinkcentre-m75q](https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ursa-thinkcentre-m75q/builds/2915)
   Supported benchmarks:
   ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True
   test-mac-arm: Supported benchmark langs: C++, Python, R
   ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
   ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] westonpace commented on pull request #33634: GH-20385: [C++] reject partial loads of an extension type in the parquet reader

Posted by "westonpace (via GitHub)" <gi...@apache.org>.
westonpace commented on PR #33634:
URL: https://github.com/apache/arrow/pull/33634#issuecomment-1536868761

   > this LGTM though I think antoine's suggestion should be applied first.
   
   Agreed.  I've let this slide for a while.  I'll try and refresh this next week.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org