You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@parquet.apache.org by GitBox <gi...@apache.org> on 2021/01/23 10:56:34 UTC

[GitHub] [parquet-format] nevi-me opened a new pull request #165: PARQUET-675: Specify Interval LogicalType

nevi-me opened a new pull request #165:
URL: https://github.com/apache/parquet-format/pull/165


   I am working on the Parquet Rust implementation, specifically conversion with Arrow.
   One of the outstanding items in the Parquet types is how to deal with interval types.
   
   This PR proposes adding `LogicalType::Interval(IntervalType)`, which is compatible with Arrow. I have only made changes to the thrift file, as I'd like to get feedback on viability, before documenting the behaviour in the LogicalTypes.md.
   Much of the detail is however below in this message.
   
   The legacy `ConvertedType` has `INTERVAL`, but this interval is ambiguous for Arrow, because Arrow defines:
   1. Interval::YearMonth: i32 representing the number of elapsed whole months
   2. Interval::DayTime: i64 stored as 2 contiguous 32-bit integers, representing the number of elapsed days and milliseconds, respectively
   
   @julienledem initially suggested deprecating `INTERVAL` by replacing it with 2 converted types in #43, but given that Parquet got logical types since then, we could offer a better alternative by using the `LogicalType::Interval(IntervalUnit)`alternative.
   
   This would either be 32-bit or 64-bit based on the interval unit.
   On the 64-bit representation, I'm not opinionated on whether we should use an INT64 or FiXED_LEN_BYTE_ARRAY(8). I suspect though that we initially used FIXED_LEN_BYTE_ARRAY(12) because there's no 96-bit primitive.
   
   # Backward Compatibility with ConvertedType
   
   We do not deprecate the `INTERVAL` converted type, as one could always convert the LogicalType value to either the first 4 bytes, or the last 8, depending on the IntervalUnit; and so write only those bytes.
   
   We would mark converting from ConvertedType to LogicalType as undefined behaviour, because without any additional information on which of the 12 bytes are populated, readers could lose information (what Rust is currently doing).
   
   implementations that rely on the old behaviour still have the option of populating both converted type and logical type, so they should not populate the logical type in this instance.
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [parquet-format] houqp commented on pull request #165: PARQUET-675: Specify Interval LogicalType

Posted by GitBox <gi...@apache.org>.
houqp commented on pull request #165:
URL: https://github.com/apache/parquet-format/pull/165#issuecomment-767105058


   I do agree that if this is something specific to Arrow, then we shouldn't impose it onto parquet. However, given that #43 was created to non-arrow use-case, it looks like the application of a more flexible interval type would have a broader impact?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [parquet-format] nevi-me commented on pull request #165: PARQUET-675: Specify Interval LogicalType

Posted by GitBox <gi...@apache.org>.
nevi-me commented on pull request #165:
URL: https://github.com/apache/parquet-format/pull/165#issuecomment-767132481


   Thanks for the responses, I'll look at the interval vs duration support in more detail in the coming weeks, and rather put together a Google doc for discussion in the Parquet mailing list.
   
   > The existing interval type seems reasonable in parquet.
   
   It's reasonable for converted types, but has no equivalent logical type. While this is still fine as we have to write converted types in the metadata for backwards-compatibility, it still means that one has to write a 4-byte value into a 12-byte FLA in order for the converted type to make sense when being read by another application.
   
   The Arrow schema is binary, so in the long-run I wouldn't expect a non-Arrow reader to care about it. So, I was trying to approach adding INTERVAL to logical types, so that one could drop the INTERVAL converted type (as it wouldn't be correct to map Logical::INTERVAL to Converted::INTERVAL.
   
   I'll research this subject, read and write files with v2 of the writer, and come back with observations.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [parquet-format] emkornfield commented on pull request #165: PARQUET-675: Specify Interval LogicalType

Posted by GitBox <gi...@apache.org>.
emkornfield commented on pull request #165:
URL: https://github.com/apache/parquet-format/pull/165#issuecomment-767097727


   @nevi-me I don't think we should be imposing arrow's modeling of interval type on parquet.  The existing interval type seems reasonable in parquet.  I think there are three cases to consider:
   1.  Arrow writing to parquet (this is well defined without data loss).
   2. Arrow Reading from a parquet file that was written with arrow.  In this case, the additional written arrow schema should be sufficient for decoding.
   3. Reading interval written from another implementation.  In this case, I think we probably want a user configurable options in arrow, as there can be a few options:
      1.  Read as Duration
      2.  Read as struct of both interval types.
      3. Read assuming as of the two interval types and truncate/error out data that doesn't fit in the type.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [parquet-format] nevi-me edited a comment on pull request #165: PARQUET-675: Specify Interval LogicalType

Posted by GitBox <gi...@apache.org>.
nevi-me edited a comment on pull request #165:
URL: https://github.com/apache/parquet-format/pull/165#issuecomment-767132481


   Thanks for the responses, I'll look at the interval vs duration support in more detail in the coming weeks, and rather put together a Google doc for discussion in the Parquet mailing list.
   
   > The existing interval type seems reasonable in parquet.
   
   It's reasonable for converted types, but has no equivalent logical type. While this is still fine as we have to write converted types in the metadata for backwards-compatibility, it still means that one has to write a 4-byte value into a 12-byte FLA in order for the converted type to make sense when being read by another application.
   
   The Arrow schema is binary, so in the long-run I wouldn't expect a non-Arrow reader to care about it. So, I was trying to approach adding INTERVAL to logical types, so that one could drop the INTERVAL converted type, as it wouldn't be correct to map Logical::INTERVAL to Converted::INTERVAL.
   
   I'll research this subject, read and write files with v2 of the writer, and come back with observations.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [parquet-format] nevi-me commented on pull request #165: PARQUET-675: Specify Interval LogicalType

Posted by GitBox <gi...@apache.org>.
nevi-me commented on pull request #165:
URL: https://github.com/apache/parquet-format/pull/165#issuecomment-767132481


   Thanks for the responses, I'll look at the interval vs duration support in more detail in the coming weeks, and rather put together a Google doc for discussion in the Parquet mailing list.
   
   > The existing interval type seems reasonable in parquet.
   
   It's reasonable for converted types, but has no equivalent logical type. While this is still fine as we have to write converted types in the metadata for backwards-compatibility, it still means that one has to write a 4-byte value into a 12-byte FLA in order for the converted type to make sense when being read by another application.
   
   The Arrow schema is binary, so in the long-run I wouldn't expect a non-Arrow reader to care about it. So, I was trying to approach adding INTERVAL to logical types, so that one could drop the INTERVAL converted type (as it wouldn't be correct to map Logical::INTERVAL to Converted::INTERVAL.
   
   I'll research this subject, read and write files with v2 of the writer, and come back with observations.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [parquet-format] nevi-me closed pull request #165: PARQUET-675: Specify Interval LogicalType

Posted by GitBox <gi...@apache.org>.
nevi-me closed pull request #165:
URL: https://github.com/apache/parquet-format/pull/165


   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [parquet-format] nevi-me closed pull request #165: PARQUET-675: Specify Interval LogicalType

Posted by GitBox <gi...@apache.org>.
nevi-me closed pull request #165:
URL: https://github.com/apache/parquet-format/pull/165


   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [parquet-format] emkornfield commented on pull request #165: PARQUET-675: Specify Interval LogicalType

Posted by GitBox <gi...@apache.org>.
emkornfield commented on pull request #165:
URL: https://github.com/apache/parquet-format/pull/165#issuecomment-767114642


   If I read the conclusion from #43 it was to potentially store these as separate columns for day_time.  So if we really want to introduce these types I think not merging values together would be the way to go.  Also, at least from the Arrow C++ perspective, we don't have great support for the interval types (duration is the preferred type in C++ see discussion on https://github.com/apache/arrow/pull/3644)


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [parquet-format] gszadovszky commented on pull request #165: PARQUET-675: Specify Interval LogicalType

Posted by GitBox <gi...@apache.org>.
gszadovszky commented on pull request #165:
URL: https://github.com/apache/parquet-format/pull/165#issuecomment-766730709


   @nevi-me, based on the previous PR it will require some time to agree on it. Starting a discussion in the dev list might also help as a heads up. (I do not have the required experience on the SQL standard and the different engines to have an opinion.)
   
   About this PR. We will certainly need an update in LogicalTypes.md as well. `parquet-format` has the java code generation included with some additional unit tests so it should be covered automatically. The java code for production is generation inside `parquet-mr` after the `parquet-format` version reference is upgraded. Because of that we will need a `parquet-format` release first.
   Currently, I am not sure why the checks are failing. It seems to be a travis issue but I am not sure. You may try a close/re-open on this PR to re-trigger the travis build.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [parquet-format] gszadovszky commented on pull request #165: PARQUET-675: Specify Interval LogicalType

Posted by GitBox <gi...@apache.org>.
gszadovszky commented on pull request #165:
URL: https://github.com/apache/parquet-format/pull/165#issuecomment-766730709


   @nevi-me, based on the previous PR it will require some time to agree on it. Starting a discussion in the dev list might also help as a heads up. (I do not have the required experience on the SQL standard and the different engines to have an opinion.)
   
   About this PR. We will certainly need an update in LogicalTypes.md as well. `parquet-format` has the java code generation included with some additional unit tests so it should be covered automatically. The java code for production is generation inside `parquet-mr` after the `parquet-format` version reference is upgraded. Because of that we will need a `parquet-format` release first.
   Currently, I am not sure why the checks are failing. It seems to be a travis issue but I am not sure. You may try a close/re-open on this PR to re-trigger the travis build.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [parquet-format] emkornfield commented on pull request #165: PARQUET-675: Specify Interval LogicalType

Posted by GitBox <gi...@apache.org>.
emkornfield commented on pull request #165:
URL: https://github.com/apache/parquet-format/pull/165#issuecomment-862023416


   Maybe we should resurrect this to accomodate the new COMPLEX arrow type we are working on?  I think we also need backwards compatibility with the existing parquet interval type?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [parquet-format] nevi-me edited a comment on pull request #165: PARQUET-675: Specify Interval LogicalType

Posted by GitBox <gi...@apache.org>.
nevi-me edited a comment on pull request #165:
URL: https://github.com/apache/parquet-format/pull/165#issuecomment-767132481


   Thanks for the responses, I'll look at the interval vs duration support in more detail in the coming weeks, and rather put together a Google doc for discussion in the Parquet mailing list.
   
   > The existing interval type seems reasonable in parquet.
   
   It's reasonable for converted types, but has no equivalent logical type. While this is still fine as we have to write converted types in the metadata for backwards-compatibility, it still means that one has to write a 4-byte value into a 12-byte FLA in order for the converted type to make sense when being read by another application.
   
   The Arrow schema is binary, so in the long-run I wouldn't expect a non-Arrow reader to care about it. So, I was trying to approach adding INTERVAL to logical types, so that one could drop the INTERVAL converted type, as it wouldn't be correct to map Logical::INTERVAL to Converted::INTERVAL.
   
   I'll research this subject, read and write files with v2 of the writer, and come back with observations.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [parquet-format] emkornfield commented on pull request #165: PARQUET-675: Specify Interval LogicalType

Posted by GitBox <gi...@apache.org>.
emkornfield commented on pull request #165:
URL: https://github.com/apache/parquet-format/pull/165#issuecomment-767097727






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [parquet-format] houqp commented on pull request #165: PARQUET-675: Specify Interval LogicalType

Posted by GitBox <gi...@apache.org>.
houqp commented on pull request #165:
URL: https://github.com/apache/parquet-format/pull/165#issuecomment-767105058


   I do agree that if this is something specific to Arrow, then we shouldn't impose it onto parquet. However, given that #43 was created to non-arrow use-case, it looks like the application of a more flexible interval type would have a broader impact?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [parquet-format] nevi-me commented on pull request #165: PARQUET-675: Specify Interval LogicalType

Posted by GitBox <gi...@apache.org>.
nevi-me commented on pull request #165:
URL: https://github.com/apache/parquet-format/pull/165#issuecomment-765905568


   I apologise if this is spammy, but I'd like to tag the following to bring it to your attention:
   
   @julienledem from working on the initial PR, and well ...
   @pitrou @pitrou @emkornfield from the C++ Arrow & Parquet side
   @sunchao from the Rust Parquet side
   
   This is my first time contributing to parquet-format, so I don't know if I need to generate the Java thrift files, but I can try to do that based on the guidance that I receive.
   
   Thanks


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org