You are viewing a plain text version of this content. The canonical link for it is here.

Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2020/10/14 16:35:55 UTC

[GitHub] [arrow] bkietz opened a new pull request #8462: ARROW-10145: [C++][Dataset] Assert integer overflow in partitioning falls back to string

bkietz opened a new pull request #8462:
URL: https://github.com/apache/arrow/pull/8462


   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] github-actions[bot] commented on pull request #8462: ARROW-10145: [C++][Dataset] Assert integer overflow in partitioning falls back to string

Posted by GitBox <gi...@apache.org>.

github-actions[bot] commented on pull request #8462:
URL: https://github.com/apache/arrow/pull/8462#issuecomment-708528484


   https://issues.apache.org/jira/browse/ARROW-10145


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] bkietz commented on pull request #8462: ARROW-10145: [C++][Dataset] Assert integer overflow in partitioning falls back to string

Posted by GitBox <gi...@apache.org>.

bkietz commented on pull request #8462:
URL: https://github.com/apache/arrow/pull/8462#issuecomment-709310353


   I think it's not worthwhile to fallback to `int64`. The only use cases I can think of for long integers in a partition column are
   - A low cardinality* set of externally derived identifiers which happen to be numeric. Arithmetic operations on such identifiers would not be meaningful, so there is no benefit to making that column integral
   - Year stored as nanoseconds since the epoch (or similar int-stored-as-multiple, int-stored-with-offset situation). In this situation I'd advise the user to derive a new column which stores the year more meaningfully
   
   * If the column has high cardinality then it's not really a candidate for partitioning in the first place.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] bkietz edited a comment on pull request #8462: ARROW-10145: [C++][Dataset] Assert integer overflow in partitioning falls back to string

Posted by GitBox <gi...@apache.org>.

bkietz edited a comment on pull request #8462:
URL: https://github.com/apache/arrow/pull/8462#issuecomment-709310353


   I think it's not worthwhile to fallback to `int64`. The only use cases I can think of for long integers in a partition column are
   - A low cardinality`*` set of externally derived identifiers which happen to be numeric. Arithmetic operations on such identifiers would not be meaningful, so there is no benefit to making that column integral
   - Year stored as nanoseconds since the epoch (or similar int-stored-as-multiple, int-stored-with-offset situation). In this situation I'd advise the user to derive a new column which stores the year more meaningfully
   
   `*` If the column has high cardinality then it's not really a candidate for partitioning in the first place.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] jorisvandenbossche closed pull request #8462: ARROW-10145: [C++][Dataset] Assert integer overflow in partitioning falls back to string

Posted by GitBox <gi...@apache.org>.

jorisvandenbossche closed pull request #8462:
URL: https://github.com/apache/arrow/pull/8462


   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] jorisvandenbossche commented on pull request #8462: ARROW-10145: [C++][Dataset] Assert integer overflow in partitioning falls back to string

Posted by GitBox <gi...@apache.org>.

jorisvandenbossche commented on pull request #8462:
URL: https://github.com/apache/arrow/pull/8462#issuecomment-709065744


   Thanks! So this was already fixed in partition type simplification PR (https://github.com/apache/arrow/pull/8367). The added tests look good.
   
   Now, falling back to string is certainly better as raising an error. But I am wondering: do we want to do more inference than just int32 and string? Could we infer int64 here? That might quickly get complicated / a can of worms ?
   
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] jorisvandenbossche commented on pull request #8462: ARROW-10145: [C++][Dataset] Assert integer overflow in partitioning falls back to string

Posted by GitBox <gi...@apache.org>.

jorisvandenbossche commented on pull request #8462:
URL: https://github.com/apache/arrow/pull/8462#issuecomment-709390751


   >  to derive a new column which stores the year more meaningfully
   
   But to derive that, you might need to cast the numbers/strings first to ints, to be able to do the calculation to timestamps? 
   
   But anyway, your arguments are convincing that this is probably not a typical use case. And, the user can still provide a manual schema is they really want to). So this is fine for me!


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org