You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2021/05/05 14:54:48 UTC

[GitHub] [arrow-rs] nevi-me opened a new pull request #261: re-export parquet-format

nevi-me opened a new pull request #261:
URL: https://github.com/apache/arrow-rs/pull/261


   # Which issue does this PR close?
   
   Closes ##237 .
   
    # Rationale for this change
   
   We might need to expose more `parquet-format` internals in future, as we already expose the `FileMetadata` struct. Users who wish to use these structs sometimes need to also use the `parquet-format` crate.
   There is a risk that users might end up importing incompatible versions.
   
   Re-exporting this crate would make things simpler for users. 
   
   # What changes are included in this PR?
   
   * Exporting `parquet_format` as `parquet::format`
   * Reusing `parquet::format` wherever we internally use `parquet_format`
   
   # Are there any user-facing changes?
   
   We are exposing a new submodule, `parquet::format`. This is not a breaking change.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-rs] jorgecarleitao commented on pull request #261: re-export parquet-format

Posted by GitBox <gi...@apache.org>.
jorgecarleitao commented on pull request #261:
URL: https://github.com/apache/arrow-rs/pull/261#issuecomment-834096310


   Thanks a lot for the investigation, @nevi-me . I agree that it is not trivial :)
   
   Would it be an option to export everything on the list that you presented? E.g. create a `parquet/logical.rs` or something that contains both `LogicalType` and `pub use parquet_format::LogicalType`, together with all the logical types in `parquet_format` (via `pub use parquet_format::X`), and an equivalent approach to `parquet/schema.rs` with the SchemaElement`, `FileMetaData`, etc?
   
   This way we only need to export what is needed, we do not require users to pin the version to the exact one we use, and we do not need to create conversion methods.
   
   If this is too much of a complexity, let's then merge this as is :) I just think that we are making it harder for us long run, and exposing public stuff is usually a one-way street, so I was trying to reduce how much we go down that road.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-rs] nevi-me commented on pull request #261: re-export parquet-format

Posted by GitBox <gi...@apache.org>.
nevi-me commented on pull request #261:
URL: https://github.com/apache/arrow-rs/pull/261#issuecomment-834079037


   > Does this make sense?
   
   It makes sense, thanks.
   
   In this case, I want to expose `parquet_format::FileMetadata`, but in order to have it public, one also potentially ends up seeing the below, because `FileMetadata` is a root type that then uses all the other types in one way or another.
   
   * SchemaElement
   * > Type
   * > FieldRepetitionType
   * > ConvertedType
   * > LogicalType (this carries all the types)
   * RowGroup (this is also useful to expose)
   * > ColumnChunk
   * >> ColumnMetadata
   * >>> Encoding
   * >>> CompressionCodec
   * >>> Statistics (this is one of the structs that we wanted to expose by returning `FileMetadata`)
   * > SortingColumn
   * KeyValue
   * ColumnOrder
   * > TypeDefinedOrder
   
   I paused when I got to something like this:
   
   ```rust
   // lib.rs
   
   pub mod format;
   
   // format.rs
   
   pub(crate) use parquet_format::*;
   
   /// Re-export parquet_format as `parquet::format`.
   ///
   /// Users are encouraged to use this, to avoid format mismatches.
   pub use parquet_format::{
       BsonType, ColumnOrder, DateType, DecimalType, EnumType, FileMetaData, IntType,
       JsonType, ListType, MapType, NullType, RowGroup, SchemaElement, StringType, TimeType,
       TimeUnit, TimestampType, UUIDType, TypeDefinedOrder, 
   };
   
   ```
   
   I was trying to avoid forcing a user to do this (https://github.com/delta-io/kafka-delta-ingest/blob/587a7ca5429f985876d3f6c4492519341f141e97/Cargo.toml#L19) in order to get the `parquet_format` structs (e.g. if using them in function signatures). In this crate, the user needs the column stats so they can write them to a metadata file.
   
   We could return `parquet::file::metadata::ParquetMetaData` (https://docs.rs/parquet/4.0.0/parquet/file/metadata/struct.ParquetMetaData.html#method.file_metadata), in which case we wouldn't need to expose `parquet_format::FileMetaData`. 
   The downside is that we need to then convert from `parquet_format::FileMetaData` to `parquet::file::metadata::ParquetMetaData`.
   
   Maybe a solution there is to `impl From<parquet_format::MetaData> for parquet::file::metadata::FileMetaData`, but that requires a bunch of other conversions for almost all the structs in that laundry list above.
   
   What options can you suggest going forward? I see:
   
   1. We do nothing. The onus is on users to check that they're using the same `parquet-format` version as our crate (if they want to use `parquet_format::FileMetaData` or its descendants.
   2. We implement `From` for a bunch of structs, and convert to `parquet::file::metadata::{ParquetMetaData|FileMetaData}`, then expose the latter.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-rs] jorgecarleitao commented on pull request #261: re-export parquet-format

Posted by GitBox <gi...@apache.org>.
jorgecarleitao commented on pull request #261:
URL: https://github.com/apache/arrow-rs/pull/261#issuecomment-834058292


   Let me try to give an example:
   
   Say we expose all interfaces (`A`, `B`, `C`) from `parquet-format`, and that the `parquet`  crate only has public interfaces that consume `A` and `B` (e.g. the `FileMetadata` bit and something else).
   
   Some time later, `parquet-format` bumps major and breaks backward compatibility over `C`. Because we expose `C`, someone may be using `parquet::format::C` and thus if we bump `parquet-format` in our library, anyone using `C` will suffer, even if we do not care about `C`.
   
   If we had only exposed what our users do require to interact with our crate `parquet` e.g., `A` and `B`, we could have bumped the patch version to catch up with `parquet-format`, because there was no API changes from our users' perspective.
   
   This is what I was trying to say with the "public surface": my opinion is that if we need ABI compatibility over some dependency, because we consume that ABI, then we should definitely export that, like this PR is proposing, so that the consumers do not have depend on the same version of `parquet-format`. However, if we do not depend on a given `ABI` (the `C` above), then imo we should not expose it in our crate.
   
   Does this make sense?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-rs] jorgecarleitao edited a comment on pull request #261: re-export parquet-format

Posted by GitBox <gi...@apache.org>.
jorgecarleitao edited a comment on pull request #261:
URL: https://github.com/apache/arrow-rs/pull/261#issuecomment-834058292


   Let me try to give an example:
   
   Say we expose all interfaces (`A`, `B`, `C`) from `parquet-format`, and that the `parquet`  crate only has public interfaces that consume `A` and `B` (e.g. the `FileMetadata` bit and something else).
   
   Some time later, `parquet-format` bumps major and breaks backward compatibility over `C`. Because we expose `C`, someone may be using `parquet::format::C` and thus if we bump `parquet-format` in our library, anyone using `C` will suffer, even if we do not care about `C`, i.e. we have to bump major.
   
   If we had only exposed what our users do require to interact with our crate `parquet` e.g., `A` and `B`, we could have bumped the patch version to catch up with `parquet-format`, because there was no API changes from our users' perspective.
   
   This is what I was trying to say with the "public surface": my opinion is that if we need ABI compatibility over some dependency, because we consume that ABI, then we should definitely export that, like this PR is proposing, so that the consumers do not have depend on the same version of `parquet-format`. However, if we do not depend on a given `ABI` (the `C` above), then imo we should not expose it in our crate.
   
   Does this make sense?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-rs] nevi-me closed pull request #261: re-export parquet-format

Posted by GitBox <gi...@apache.org>.
nevi-me closed pull request #261:
URL: https://github.com/apache/arrow-rs/pull/261


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-rs] nevi-me commented on pull request #261: re-export parquet-format

Posted by GitBox <gi...@apache.org>.
nevi-me commented on pull request #261:
URL: https://github.com/apache/arrow-rs/pull/261#issuecomment-834044510


   > Wouldn't it be better to only expose the bits that dependencies may require? Exposing everything significantly increases the "public" surface of this crate, thereby increasing the risk of backward incompatibility if/when parquet-format changes.
   
   I don't see it as a risk, because we actually want users to use the exact parquet-format version that is used on the crate. We lag on the format implementation, so there could also be some argument to make that advanced users might still need some of the structs that are exposed from the format.
   
   We're effectively making the compiled `parquet.thrift` output fully public.
   
   @jorgecarleitao is your preference still to expose only what we need as `parquet::format`?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-rs] nevi-me commented on pull request #261: re-export parquet-format

Posted by GitBox <gi...@apache.org>.
nevi-me commented on pull request #261:
URL: https://github.com/apache/arrow-rs/pull/261#issuecomment-832755322


   CC @houqp @xianwill


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-rs] nevi-me commented on pull request #261: re-export parquet-format

Posted by GitBox <gi...@apache.org>.
nevi-me commented on pull request #261:
URL: https://github.com/apache/arrow-rs/pull/261#issuecomment-834098232


   > If this is too much of a complexity, let's then merge this as is :) I just think that we are making it harder for us long run, and exposing public stuff is usually a one-way street, so I was trying to reduce how much we go down that road.
   
   Yea, I understand this, and agree with you.
   
   May we please keep this on hold for now, I'll think of a solution based on our discussion and your suggestions.
   
   If we weren't constrained by not being able to implement traits for external types without a newtype approach, I would have preferred that we replace the Parquet equivalent structs with what's in the Parquet format crate.
   
   Defining our own structs has previously made it a bit difficult for a less experienced impelementer like me, to see what new functionality is enabled in a new parquet version. I prefer the "this broke because of a new field" kind of thing.
   
   A good example is `TIMESTAMP_NANO`, it took me very loooong to figure out that we needed to switch to `LogicalType` because had a mismatch in names :)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org