You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@arrow.apache.org by Colin Nichols <co...@narrativ.com> on 2018/04/18 03:11:58 UTC

[Py] writing 2- or 4-byte decimal columns to Parquet

Hi there,

I know (py)arrow has the decimal128() type, and using this type it's easy
to take an array of Python Decimals, convert to a pa.array, and write out
to Parquet.

In the absence (afaict) of decimal32 and decimal64 types, is it possible to
go from an array of Decimals (with compatible precision/scale) and write
them to a parquet column of 32- or 64- bit width?

Relevant parquet spec --
https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#decimal

I'm looking to add this functionality to the project Spectrify, as AWS
Redshift Spectrum will not query unnecessarily-wide DECIMAL columns --
https://github.com/hellonarrativ/spectrify/issues/14

Thanks,
Colin

Re: [Py] writing 2- or 4-byte decimal columns to Parquet

Posted by Colin Nichols <co...@narrativ.com>.
Wes & Phillip, thank you both for doing some investigating.  Really interesting
-- afaict I should be taking advantage of the column width shrinking; but
the error messages I'm seeing from Redshift Spectrum suggest otherwise:
more info https://github.com/hellonarrativ/spectrify/issues/14

I'm probably doing something silly; hopefully I can help improve the docs
at least :) Probably best to continue the conversation in the Spectrify
issue until there's more info.

Best,
Colin




On Thu, Apr 19, 2018 at 9:54 AM, Phillip Cloud <cp...@gmail.com> wrote:

> That's right. Shrinking happens here:
> https://github.com/apache/parquet-cpp/blob/master/src/
> parquet/arrow/writer.cc#L808-L809
>
> On Thu, Apr 19, 2018 at 9:40 AM Wes McKinney <we...@gmail.com> wrote:
>
> > We do "shrink" the input 128-bit decimals to the smallest number of
> > bytes that fits, though, is that right?
> >
> >
> > https://github.com/apache/parquet-cpp/blob/
> c405bf36506ec584e8009a6d53349277e600467d/src/parquet/arrow/schema.cc#L635
> >
> > On Thu, Apr 19, 2018 at 8:09 AM, Phillip Cloud <cp...@gmail.com>
> wrote:
> > > Hi Colin,
> > >
> > > Only 128 bit decimal writing is supported right now. Feel free to open
> a
> > > JIRA about this.
> > >
> > > On Wed, Apr 18, 2018, 19:10 Wes McKinney <we...@gmail.com> wrote:
> > >
> > >> hi Colin,
> > >>
> > >> Phillip Cloud is the expert on this topic, but I believe we only
> > >> support writing decimals to FIXED_LEN_BYTE_ARRAY physical type in
> > >> Parquet right now
> > >>
> > >>
> > >>
> > https://github.com/apache/parquet-cpp/blob/master/src/
> parquet/arrow/writer.cc#L798
> > >>
> > >> The size of the type depends on the decimal precision, so if we can
> > >> write to 32- or 64-bit, then we do that. Writing to INT32 or INT64
> > >> would be more complicated and require some work in parquet-cpp
> > >>
> > >> - Wes
> > >>
> > >> On Wed, Apr 18, 2018 at 7:04 PM, Colin Nichols <co...@narrativ.com>
> > wrote:
> > >> > Hi all,
> > >> >
> > >> > Any thoughts on the below?  I did a little more code browsing and
> I'm
> > not
> > >> > sure this is supported right now, should I open a Jira ticket?
> > >> >
> > >> > - Colin
> > >> >
> > >> > On Tue, Apr 17, 2018 at 11:11 PM, Colin Nichols <colin@narrativ.com
> >
> > >> wrote:
> > >> >
> > >> >> Hi there,
> > >> >>
> > >> >> I know (py)arrow has the decimal128() type, and using this type
> it's
> > >> easy
> > >> >> to take an array of Python Decimals, convert to a pa.array, and
> write
> > >> out
> > >> >> to Parquet.
> > >> >>
> > >> >> In the absence (afaict) of decimal32 and decimal64 types, is it
> > possible
> > >> >> to go from an array of Decimals (with compatible precision/scale)
> and
> > >> write
> > >> >> them to a parquet column of 32- or 64- bit width?
> > >> >>
> > >> >> Relevant parquet spec -- https://github.com/apache/
> > >> >> parquet-format/blob/master/LogicalTypes.md#decimal
> > >> >>
> > >> >> I'm looking to add this functionality to the project Spectrify, as
> > AWS
> > >> >> Redshift Spectrum will not query unnecessarily-wide DECIMAL columns
> > --
> > >> >> https://github.com/hellonarrativ/spectrify/issues/14
> > >> >>
> > >> >> Thanks,
> > >> >> Colin
> > >> >>
> > >> >>
> > >>
> >
>

Re: [Py] writing 2- or 4-byte decimal columns to Parquet

Posted by Phillip Cloud <cp...@gmail.com>.
That's right. Shrinking happens here:
https://github.com/apache/parquet-cpp/blob/master/src/parquet/arrow/writer.cc#L808-L809

On Thu, Apr 19, 2018 at 9:40 AM Wes McKinney <we...@gmail.com> wrote:

> We do "shrink" the input 128-bit decimals to the smallest number of
> bytes that fits, though, is that right?
>
>
> https://github.com/apache/parquet-cpp/blob/c405bf36506ec584e8009a6d53349277e600467d/src/parquet/arrow/schema.cc#L635
>
> On Thu, Apr 19, 2018 at 8:09 AM, Phillip Cloud <cp...@gmail.com> wrote:
> > Hi Colin,
> >
> > Only 128 bit decimal writing is supported right now. Feel free to open a
> > JIRA about this.
> >
> > On Wed, Apr 18, 2018, 19:10 Wes McKinney <we...@gmail.com> wrote:
> >
> >> hi Colin,
> >>
> >> Phillip Cloud is the expert on this topic, but I believe we only
> >> support writing decimals to FIXED_LEN_BYTE_ARRAY physical type in
> >> Parquet right now
> >>
> >>
> >>
> https://github.com/apache/parquet-cpp/blob/master/src/parquet/arrow/writer.cc#L798
> >>
> >> The size of the type depends on the decimal precision, so if we can
> >> write to 32- or 64-bit, then we do that. Writing to INT32 or INT64
> >> would be more complicated and require some work in parquet-cpp
> >>
> >> - Wes
> >>
> >> On Wed, Apr 18, 2018 at 7:04 PM, Colin Nichols <co...@narrativ.com>
> wrote:
> >> > Hi all,
> >> >
> >> > Any thoughts on the below?  I did a little more code browsing and I'm
> not
> >> > sure this is supported right now, should I open a Jira ticket?
> >> >
> >> > - Colin
> >> >
> >> > On Tue, Apr 17, 2018 at 11:11 PM, Colin Nichols <co...@narrativ.com>
> >> wrote:
> >> >
> >> >> Hi there,
> >> >>
> >> >> I know (py)arrow has the decimal128() type, and using this type it's
> >> easy
> >> >> to take an array of Python Decimals, convert to a pa.array, and write
> >> out
> >> >> to Parquet.
> >> >>
> >> >> In the absence (afaict) of decimal32 and decimal64 types, is it
> possible
> >> >> to go from an array of Decimals (with compatible precision/scale) and
> >> write
> >> >> them to a parquet column of 32- or 64- bit width?
> >> >>
> >> >> Relevant parquet spec -- https://github.com/apache/
> >> >> parquet-format/blob/master/LogicalTypes.md#decimal
> >> >>
> >> >> I'm looking to add this functionality to the project Spectrify, as
> AWS
> >> >> Redshift Spectrum will not query unnecessarily-wide DECIMAL columns
> --
> >> >> https://github.com/hellonarrativ/spectrify/issues/14
> >> >>
> >> >> Thanks,
> >> >> Colin
> >> >>
> >> >>
> >>
>

Re: [Py] writing 2- or 4-byte decimal columns to Parquet

Posted by Wes McKinney <we...@gmail.com>.
We do "shrink" the input 128-bit decimals to the smallest number of
bytes that fits, though, is that right?

https://github.com/apache/parquet-cpp/blob/c405bf36506ec584e8009a6d53349277e600467d/src/parquet/arrow/schema.cc#L635

On Thu, Apr 19, 2018 at 8:09 AM, Phillip Cloud <cp...@gmail.com> wrote:
> Hi Colin,
>
> Only 128 bit decimal writing is supported right now. Feel free to open a
> JIRA about this.
>
> On Wed, Apr 18, 2018, 19:10 Wes McKinney <we...@gmail.com> wrote:
>
>> hi Colin,
>>
>> Phillip Cloud is the expert on this topic, but I believe we only
>> support writing decimals to FIXED_LEN_BYTE_ARRAY physical type in
>> Parquet right now
>>
>>
>> https://github.com/apache/parquet-cpp/blob/master/src/parquet/arrow/writer.cc#L798
>>
>> The size of the type depends on the decimal precision, so if we can
>> write to 32- or 64-bit, then we do that. Writing to INT32 or INT64
>> would be more complicated and require some work in parquet-cpp
>>
>> - Wes
>>
>> On Wed, Apr 18, 2018 at 7:04 PM, Colin Nichols <co...@narrativ.com> wrote:
>> > Hi all,
>> >
>> > Any thoughts on the below?  I did a little more code browsing and I'm not
>> > sure this is supported right now, should I open a Jira ticket?
>> >
>> > - Colin
>> >
>> > On Tue, Apr 17, 2018 at 11:11 PM, Colin Nichols <co...@narrativ.com>
>> wrote:
>> >
>> >> Hi there,
>> >>
>> >> I know (py)arrow has the decimal128() type, and using this type it's
>> easy
>> >> to take an array of Python Decimals, convert to a pa.array, and write
>> out
>> >> to Parquet.
>> >>
>> >> In the absence (afaict) of decimal32 and decimal64 types, is it possible
>> >> to go from an array of Decimals (with compatible precision/scale) and
>> write
>> >> them to a parquet column of 32- or 64- bit width?
>> >>
>> >> Relevant parquet spec -- https://github.com/apache/
>> >> parquet-format/blob/master/LogicalTypes.md#decimal
>> >>
>> >> I'm looking to add this functionality to the project Spectrify, as AWS
>> >> Redshift Spectrum will not query unnecessarily-wide DECIMAL columns --
>> >> https://github.com/hellonarrativ/spectrify/issues/14
>> >>
>> >> Thanks,
>> >> Colin
>> >>
>> >>
>>

Re: [Py] writing 2- or 4-byte decimal columns to Parquet

Posted by Phillip Cloud <cp...@gmail.com>.
Hi Colin,

Only 128 bit decimal writing is supported right now. Feel free to open a
JIRA about this.

On Wed, Apr 18, 2018, 19:10 Wes McKinney <we...@gmail.com> wrote:

> hi Colin,
>
> Phillip Cloud is the expert on this topic, but I believe we only
> support writing decimals to FIXED_LEN_BYTE_ARRAY physical type in
> Parquet right now
>
>
> https://github.com/apache/parquet-cpp/blob/master/src/parquet/arrow/writer.cc#L798
>
> The size of the type depends on the decimal precision, so if we can
> write to 32- or 64-bit, then we do that. Writing to INT32 or INT64
> would be more complicated and require some work in parquet-cpp
>
> - Wes
>
> On Wed, Apr 18, 2018 at 7:04 PM, Colin Nichols <co...@narrativ.com> wrote:
> > Hi all,
> >
> > Any thoughts on the below?  I did a little more code browsing and I'm not
> > sure this is supported right now, should I open a Jira ticket?
> >
> > - Colin
> >
> > On Tue, Apr 17, 2018 at 11:11 PM, Colin Nichols <co...@narrativ.com>
> wrote:
> >
> >> Hi there,
> >>
> >> I know (py)arrow has the decimal128() type, and using this type it's
> easy
> >> to take an array of Python Decimals, convert to a pa.array, and write
> out
> >> to Parquet.
> >>
> >> In the absence (afaict) of decimal32 and decimal64 types, is it possible
> >> to go from an array of Decimals (with compatible precision/scale) and
> write
> >> them to a parquet column of 32- or 64- bit width?
> >>
> >> Relevant parquet spec -- https://github.com/apache/
> >> parquet-format/blob/master/LogicalTypes.md#decimal
> >>
> >> I'm looking to add this functionality to the project Spectrify, as AWS
> >> Redshift Spectrum will not query unnecessarily-wide DECIMAL columns --
> >> https://github.com/hellonarrativ/spectrify/issues/14
> >>
> >> Thanks,
> >> Colin
> >>
> >>
>

Re: [Py] writing 2- or 4-byte decimal columns to Parquet

Posted by Wes McKinney <we...@gmail.com>.
hi Colin,

Phillip Cloud is the expert on this topic, but I believe we only
support writing decimals to FIXED_LEN_BYTE_ARRAY physical type in
Parquet right now

https://github.com/apache/parquet-cpp/blob/master/src/parquet/arrow/writer.cc#L798

The size of the type depends on the decimal precision, so if we can
write to 32- or 64-bit, then we do that. Writing to INT32 or INT64
would be more complicated and require some work in parquet-cpp

- Wes

On Wed, Apr 18, 2018 at 7:04 PM, Colin Nichols <co...@narrativ.com> wrote:
> Hi all,
>
> Any thoughts on the below?  I did a little more code browsing and I'm not
> sure this is supported right now, should I open a Jira ticket?
>
> - Colin
>
> On Tue, Apr 17, 2018 at 11:11 PM, Colin Nichols <co...@narrativ.com> wrote:
>
>> Hi there,
>>
>> I know (py)arrow has the decimal128() type, and using this type it's easy
>> to take an array of Python Decimals, convert to a pa.array, and write out
>> to Parquet.
>>
>> In the absence (afaict) of decimal32 and decimal64 types, is it possible
>> to go from an array of Decimals (with compatible precision/scale) and write
>> them to a parquet column of 32- or 64- bit width?
>>
>> Relevant parquet spec -- https://github.com/apache/
>> parquet-format/blob/master/LogicalTypes.md#decimal
>>
>> I'm looking to add this functionality to the project Spectrify, as AWS
>> Redshift Spectrum will not query unnecessarily-wide DECIMAL columns --
>> https://github.com/hellonarrativ/spectrify/issues/14
>>
>> Thanks,
>> Colin
>>
>>

Re: [Py] writing 2- or 4-byte decimal columns to Parquet

Posted by Colin Nichols <co...@narrativ.com>.
Hi all,

Any thoughts on the below?  I did a little more code browsing and I'm not
sure this is supported right now, should I open a Jira ticket?

- Colin

On Tue, Apr 17, 2018 at 11:11 PM, Colin Nichols <co...@narrativ.com> wrote:

> Hi there,
>
> I know (py)arrow has the decimal128() type, and using this type it's easy
> to take an array of Python Decimals, convert to a pa.array, and write out
> to Parquet.
>
> In the absence (afaict) of decimal32 and decimal64 types, is it possible
> to go from an array of Decimals (with compatible precision/scale) and write
> them to a parquet column of 32- or 64- bit width?
>
> Relevant parquet spec -- https://github.com/apache/
> parquet-format/blob/master/LogicalTypes.md#decimal
>
> I'm looking to add this functionality to the project Spectrify, as AWS
> Redshift Spectrum will not query unnecessarily-wide DECIMAL columns --
> https://github.com/hellonarrativ/spectrify/issues/14
>
> Thanks,
> Colin
>
>