You are viewing a plain text version of this content. The canonical link for it is here.
Posted to jira@arrow.apache.org by "Joris Van den Bossche (Jira)" <ji...@apache.org> on 2020/09/30 16:00:00 UTC

[jira] [Commented] (ARROW-10145) [C++][Dataset] Integer-like partition field values outside int32 range error on reading

    [ https://issues.apache.org/jira/browse/ARROW-10145?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17204836#comment-17204836 ] 

Joris Van den Bossche commented on ARROW-10145:
-----------------------------------------------

I think at least we should fall back to string instead of raising an error like this (as long as we only support reading partition fields as string or int32), but long term we should expand the supported types.

> [C++][Dataset] Integer-like partition field values outside int32 range error on reading
> ---------------------------------------------------------------------------------------
>
>                 Key: ARROW-10145
>                 URL: https://issues.apache.org/jira/browse/ARROW-10145
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: C++
>            Reporter: Joris Van den Bossche
>            Priority: Major
>              Labels: dataset
>
> From https://stackoverflow.com/questions/64137664/how-to-override-type-inference-for-partition-columns-in-hive-partitioned-dataset
> Small reproducer:
> {code}
> import pyarrow as pa
> import pyarrow.parquet as pq
> table = pa.table({'part': [3760212050]*10, 'col': range(10)})
> pq.write_to_dataset(table, "test_int64_partition", partition_cols=['part'])
> In [35]: pq.read_table("test_int64_partition/")
> ...
> ArrowInvalid: error parsing '3760212050' as scalar of type int32
> In ../src/arrow/scalar.cc, line 333, code: VisitTypeInline(*type_, this)
> In ../src/arrow/dataset/partition.cc, line 218, code: (_error_or_value26).status()
> In ../src/arrow/dataset/partition.cc, line 229, code: (_error_or_value27).status()
> In ../src/arrow/dataset/discovery.cc, line 256, code: (_error_or_value17).status()
> In [36]: pq.read_table("test_int64_partition/", use_legacy_dataset=True)
> Out[36]: 
> pyarrow.Table
> col: int64
> part: dictionary<values=int64, indices=int32, ordered=0>
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)