You are viewing a plain text version of this content. The canonical link for it is here.
Posted to jira@arrow.apache.org by "Neville Dipale (Jira)" <ji...@apache.org> on 2020/12/19 15:53:00 UTC

[jira] [Commented] (ARROW-6780) [C++][Parquet] Support DurationType in writing/reading parquet

    [ https://issues.apache.org/jira/browse/ARROW-6780?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17252207#comment-17252207 ] 

Neville Dipale commented on ARROW-6780:
---------------------------------------

Hi [~jorisvandenbossche], I'm having a similar issue/dilemma on the Rust side.

Given that we serialize the Arrow schema and store it in the Parquet metadata, it becomes easier to write intervals as FixedLenBinary. On the read side, we take guidance from the Arrow schema on which IntervalUnit to use.

The problem comes if we read an interval without an Arrow schema. I think it'd be the same with the Duration type.

I've looked at various JIRAs here, and saw that Pandas stores Intervals as an extension array with nested storage (https://issues.apache.org/jira/browse/ARROW-9078).

Given that the Duration type is not composite, how about we store it as an INT32 or INT64 depending on the resolution, then rely on `ARROW::schema` to roundtrip it correctly? CC [~emkornfield]  as you've recently worked on this part of the C++ impl.

> [C++][Parquet] Support DurationType in writing/reading parquet
> --------------------------------------------------------------
>
>                 Key: ARROW-6780
>                 URL: https://issues.apache.org/jira/browse/ARROW-6780
>             Project: Apache Arrow
>          Issue Type: Improvement
>          Components: C++
>            Reporter: Joris Van den Bossche
>            Priority: Major
>              Labels: parquet
>
> Currently this is not supported:
> {code}
> In [37]: table = pa.table({'a': pa.array([1, 2], pa.duration('s'))}) 
> In [39]: table
> Out[39]: 
> pyarrow.Table
> a: duration[s]
> In [41]: pq.write_table(table, 'test_duration.parquet')
> ...
> ArrowNotImplementedError: Unhandled type for Arrow to Parquet schema conversion: duration[s]
> {code}
> There is no direct mapping to Parquet logical types. There is an INTERVAL type, but this more matches Arrow's  ( YEAR_MONTH or DAY_TIME) interval type. 
> But, those duration values could be stored as just integers, and based on the serialized arrow schema, it could be restored when reading back in.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)