You are viewing a plain text version of this content. The canonical link for it is here.

Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2021/12/11 21:05:28 UTC

[GitHub] [arrow-rs] tustvold opened a new issue #1032: Reduce Public Parquet API

tustvold opened a new issue #1032:
URL: https://github.com/apache/arrow-rs/issues/1032

**Is your feature request related to a problem or challenge? Please describe what you are trying to do.**

The current API for the parquet crate is rather large, and exposes quite a lot of implementation detail.

This has a couple of implications:

* It complicates iterating on the crate without making breaking changes to public APIs
* It adds to user's cognitive load as they have to work out what APIs to use

Some examples of this

* The `util` module contains all sorts of random stuff - a hash implementation, maths functions, memory tracking, etc...
* The `compression` module
* `data_type::AsBytes`, `data_type::SliceAsBytes`, `data_type::SliceAsBytesDataType`
* `data_type::DataType`, `ColumnReaderImpl`, `RecordReader`
* `schema::types::to_thrift`

**Describe the solution you'd like**

I'm not familiar enough with the design of the crate to authoritatively weigh in on what should or shouldn't be public, however, it is my observation that a number of the APIs don't appear to be optimised for external consumption.

My **personal** preference would be to make everything lower than the file-level, i.e. `SerializedFileReader`, `ParquetFileArrowReader`, `RowIter` crate-local. This would have the benefit of being pretty unambiguous and easy to communicate and maintain.

This would obviously need to be made in a major arrow release, the next of which I believe is in January 2022 (@alamb could maybe confirm). I don't know if there are people making use of the lower-level APIs operating on columns, row groups, column chunks, pages, etc... However, any APIs could be made public again in a point-release based on user feedback.

I think this sort of touches on the objectives for the crate, is the intent to provide APIs for manipulating parquet files, or APIs for implementing parquet readers and writers for your own custom in-memory format. If the latter, this change would be at odds with it, but I'm not sure this is the case?

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-rs] alamb commented on issue #1032: Reduce Public Parquet API

Posted by GitBox <gi...@apache.org>.

alamb commented on issue #1032:
URL: https://github.com/apache/arrow-rs/issues/1032#issuecomment-1031511221


   There will be some natural tension between exposing all internals of the parquet reader for advanced usecases and the ability to make non-breaking / minimally breaking changes to its implementation. As long as we are deliberate as we find the balance I think we'll be good


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-rs] alamb commented on issue #1032: Reduce Public Parquet API

Posted by GitBox <gi...@apache.org>.

alamb commented on issue #1032:
URL: https://github.com/apache/arrow-rs/issues/1032#issuecomment-1024116811


   IOx did as well https://github.com/influxdata/influxdb_iox/blob/main/parquet_file/Cargo.toml#L20 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-rs] asayers commented on issue #1032: Reduce Public Parquet API

Posted by GitBox <gi...@apache.org>.

asayers commented on issue #1032:
URL: https://github.com/apache/arrow-rs/issues/1032#issuecomment-1023952706


   Crowd here.  Just letting you know that [sqlite2parquet](https://docs.rs/sqlite2parquet/latest/sqlite2parquet/) required adding the "experimental" flag to upgrade to 8.0.0, in order to get `DataType`, `ByteArray`, `FixedLenByteArray`, and `Int96`.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-rs] tustvold commented on issue #1032: Reduce Public Parquet API

Posted by GitBox <gi...@apache.org>.

tustvold commented on issue #1032:
URL: https://github.com/apache/arrow-rs/issues/1032#issuecomment-1023995119


   @asayers thank you for reporting, I'll take a look later today and see what we can do 👍


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-rs] alamb closed issue #1032: Reduce Public Parquet API

Posted by GitBox <gi...@apache.org>.

alamb closed issue #1032:
URL: https://github.com/apache/arrow-rs/issues/1032


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-rs] zeevm commented on issue #1032: Reduce Public Parquet API

Posted by GitBox <gi...@apache.org>.

zeevm commented on issue #1032:
URL: https://github.com/apache/arrow-rs/issues/1032#issuecomment-1030837765


   FWIW I make use of all levels of the api through FileReader, ColumnReader and even PageReader


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-rs] alamb commented on issue #1032: Reduce Public Parquet API

Posted by GitBox <gi...@apache.org>.

alamb commented on issue #1032:
URL: https://github.com/apache/arrow-rs/issues/1032#issuecomment-992952108


   > I don't know if there are people making use of the lower-level APIs operating on columns, row groups, column chunks, pages, etc... However, any APIs made private could be made public again in a point-release based on user feedback.
   
   I don't know either -- the idea of marking everything lower level as private and then making it re-pub as use cases arise seems reasonable to me, though a bit of an annoying crowd sourcing exercise (annoying to the crowd that is)
   
   > I think this sort of touches on the objectives for the crate, is the intent to provide APIs for manipulating parquet files, or APIs for implementing parquet readers and writers for your own custom in-memory format. If the latter, this change would be at odds with it, but I'm not sure this is the case?
   
   I think @sunchao  or @nevi-me  perhaps are the most knowledgeable about the goals of the crate so perhaps they would want to comment
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org