You are viewing a plain text version of this content. The canonical link for it is here.

Posted to github@arrow.apache.org by "sundy-li (via GitHub)" <gi...@apache.org> on 2023/05/28 04:40:05 UTC

[GitHub] [arrow-rs] sundy-li opened a new pull request, #4299: chore: export fn parquet_to_array_schema_and_fields

sundy-li opened a new pull request, #4299:
URL: https://github.com/apache/arrow-rs/pull/4299

   # Which issue does this PR close?
   
   <!--
   We generally require a GitHub issue to be filed for all bug fixes and enhancements and this helps us generate change logs for our releases. You can link an issue to this PR using the GitHub syntax. For example `Closes #123` indicates that this PR will close issue #123.
   -->
   
   Closes 4298
   
   # Rationale for this change
    
   <!--
   Why are you proposing this change? If this is already explained clearly in the issue then this section is not needed.
   Explaining clearly why changes are proposed helps reviewers understand your changes and offer better suggestions for fixes.
   -->
   
   # What changes are included in this PR?
   
   <!--
   There is no need to duplicate the description in the issue here but it is sometimes worth providing a summary of the individual changes in this PR.
   -->
   
   export fn parquet_to_array_schema_and_fields
   
   # Are there any user-facing changes?
   
   
   <!--
   If there are user-facing changes then we may require documentation to be updated before approving the PR.
   -->
   
   <!---
   If there are any breaking changes to public APIs, please add the `breaking change` label.
   -->
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-rs] sundy-li commented on pull request #4299: chore: export fn parquet_to_array_schema_and_fields

Posted by "sundy-li (via GitHub)" <gi...@apache.org>.

sundy-li commented on PR #4299:
URL: https://github.com/apache/arrow-rs/pull/4299#issuecomment-1567110449

   Thanks, we'd like that arrow-rs could expose some api to cover this method. 
   
   https://github.com/datafuselabs/databend/blob/529d184808eecbf1cd09ce31f3629553478f8ef2/src/query/storages/fuse/src/io/read/block/block_reader_parquet_deserialize.rs#L86-L172


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-rs] sundy-li commented on pull request #4299: chore: export fn parquet_to_array_schema_and_fields

Posted by "sundy-li (via GitHub)" <gi...@apache.org>.

sundy-li commented on PR #4299:
URL: https://github.com/apache/arrow-rs/pull/4299#issuecomment-1566325254

   Because we don't need the high-level api to read the metadata. The metadata is stored outside parquet files in our database system. Once the metadata and rowgroup index are reader,  rowgroups are pruned and we only need to read the data by row-group with Bytes of each column.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-rs] tustvold commented on pull request #4299: chore: export fn parquet_to_array_schema_and_fields

Posted by "tustvold (via GitHub)" <gi...@apache.org>.

tustvold commented on PR #4299:
URL: https://github.com/apache/arrow-rs/pull/4299#issuecomment-1566271541

   I'm a little apprehensive about exposing these APIs, they are fairly low-level and not really designed to be part of the public API surface. The issue states
   
   > We want to use some low-level APIs to read parquet files by rowgroup 
   
   The higher-level APIs support [filtering by row group](https://docs.rs/parquet/latest/parquet/arrow/arrow_reader/struct.ArrowReaderBuilder.html#method.with_row_groups), among other things. Is there a particular piece of functionality that is missing from this?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-rs] tustvold commented on pull request #4299: chore: export fn parquet_to_array_schema_and_fields

Posted by "tustvold (via GitHub)" <gi...@apache.org>.

tustvold commented on PR #4299:
URL: https://github.com/apache/arrow-rs/pull/4299#issuecomment-1566695201

   You can override how metadata is loaded by providing a custom AsynFileReader? Is this insufficient?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-rs] sundy-li commented on pull request #4299: chore: export fn parquet_to_array_schema_and_fields

Posted by "sundy-li (via GitHub)" <gi...@apache.org>.

sundy-li commented on PR #4299:
URL: https://github.com/apache/arrow-rs/pull/4299#issuecomment-1567035679

   Yes, `AsyncFileReader`'s `get_metadata` can work. 
   
   But:
   1. We don't store the whole metadata of the parquet files, we just store the `Vec<ColumnChunkMetaData>` of each leaf column, because we only write one row group, so it's much simple and small metadata.
   2. The `ParquetRecordBatchStream` will be reading IO task in dedicated async runtime and decoding in blocking threads. But we have completely separated the two processes, we will first fetch the Bytes in dedicated async runtime and send the results to a thread pool to decode them into arrrays.
   
    


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-rs] sundy-li closed pull request #4299: chore: export fn parquet_to_array_schema_and_fields

Posted by "sundy-li (via GitHub)" <gi...@apache.org>.

sundy-li closed pull request #4299: chore: export fn parquet_to_array_schema_and_fields
URL: https://github.com/apache/arrow-rs/pull/4299


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org