You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@arrow.apache.org by ro...@gmail.com on 2019/10/28 10:55:59 UTC

State of decimal support in Arrow (from/to Parquet Decimal Logicaltype)

Hi everyone,

 

I have a question about the state of decimal support in Arrow when reading
from/writing to Parquet.

*	Is writing decimals to parquet supposed to work? Are there any
examples on how to do this in C++?
*	When reading decimals in a parquet file with pyarrow and converting
the resulting table to a pandas dataframe, datatype in the cells is
"object". As a consequence, performance when doing analysis on this table is
suboptimal. Can I somehow directly get the decimals from the parquet file
into floats/doubles in a pandas dataframe?

 

Thanks in advance,

Roman

AW: State of decimal support in Arrow (from/to Parquet Decimal Logicaltype)

Posted by ro...@gmail.com.

Hi Wes,

the data is indeed not originating from Arrow, so I was looking for how to call the low level WriteBatch API. I figured it out now, it's actually straightforward in the Arrow-API, I just got confused a little with the spec at https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#DECIMAL

So for future reference: I multiply each value in a floating point array with pow(10, scale) and pass the resulting array (in my case: int32_t) directly to WriteBatch().

One thing I can imagine that could make the API a little easier to use: Provide a function that directly takes an array of floats or doubles which does the conversion internally. But it's not really needed, so it's not really worth adding.

Thanks for your help and sorry for the annoyance,
Roman


-----Ursprüngliche Nachricht-----
Von: Wes McKinney <we...@gmail.com> 
Gesendet: Dienstag, 29. Oktober 2019 16:19
An: dev <de...@arrow.apache.org>
Betreff: Re: State of decimal support in Arrow (from/to Parquet Decimal Logicaltype)

It depends on the origin of your data.

If your data is not originating from Arrow, then it may be better to produce an array of FixedLenByteArray and pass that to the low level WriteBatch API. If you would like some other API, please feel free to propose something.

On Tue, Oct 29, 2019 at 10:13 AM <ro...@gmail.com> wrote:
>
> Hi Wes,
>
> that was a bit unclear, sorry for that. With "an array", I'm referring to a plain c++-type array, i.e. an array of float, uint32_t, ...
> This means that I do not use the arrow::Array-based write API, but I use the TypedColumnWriter::WriteBatch() function directly and do not have any arrow arrays. Are there any advantages of not using the writebatch directly and instead using arrow::Arrays?
>
> Thanks,
> Roman
>
> -----Ursprüngliche Nachricht-----
> Von: Wes McKinney <we...@gmail.com>
> Gesendet: Dienstag, 29. Oktober 2019 15:59
> An: dev <de...@arrow.apache.org>
> Betreff: Re: State of decimal support in Arrow (from/to Parquet 
> Decimal Logicaltype)
>
> On Tue, Oct 29, 2019 at 3:11 AM <ro...@gmail.com> wrote:
> >
> > Hi Wes,
> >
> > thanks for the response. There's one thing that is still a little unclear to me:
> > I had a look at the code for function WriteArrowSerialize<FLBAType, arrow::Decimal128Type> in the reference you provided. I don't have arrow data in the first place, but as I understand it, I need to have an array of FixedLenByteArrays objects which then point to the actual decimal values in the big_endian_values buffer. Is this the only way to write decimal types or is it also possible to directly provide an array with values to writeBatch()?
> >
>
> Could you clarify what you mean by "an array"? If you use the 
> arrow::Array-based write API then it will invoke this serializer 
> specialization
>
> https://github.com/apache/arrow/blob/46cdf557eb710f17f71a10609e5f497ca
> 585ae1c/cpp/src/parquet/column_writer.cc#L1569
>
> That's what we're calling (if I'm not mistaken, since I just worked on 
> this code recently) when writing arrow::Decimal128Array. If you set a 
> breakpoint with gdb there you can see the call stack
>
> > For the issues, I also found https://issues.apache.org/jira/browse/ARROW-6990, but I'm not sure if this is also related to the issues you created.
> >
> > Thanks,
> > Roman
> >
> > -----Ursprüngliche Nachricht-----
> > Von: Wes McKinney <we...@gmail.com>
> > Gesendet: Montag, 28. Oktober 2019 21:11
> > An: dev <de...@arrow.apache.org>
> > Betreff: Re: State of decimal support in Arrow (from/to Parquet 
> > Decimal Logicaltype)
> >
> > hi Roman,
> >
> > On Mon, Oct 28, 2019 at 5:56 AM <ro...@gmail.com> wrote:
> > >
> > > Hi everyone,
> > >
> > >
> > >
> > > I have a question about the state of decimal support in Arrow when 
> > > reading from/writing to Parquet.
> > >
> > > *       Is writing decimals to parquet supposed to work? Are there any
> > > examples on how to do this in C++?
> >
> > Yes, it's supported, the details are here
> >
> > https://github.com/apache/arrow/blob/46cdf557eb710f17f71a10609e5f497
> > ca
> > 585ae1c/cpp/src/parquet/column_writer.cc#L1511
> >
> > > *       When reading decimals in a parquet file with pyarrow and converting
> > > the resulting table to a pandas dataframe, datatype in the cells 
> > > is "object". As a consequence, performance when doing analysis on 
> > > this table is suboptimal. Can I somehow directly get the decimals 
> > > from the parquet file into floats/doubles in a pandas dataframe?
> >
> > Some work will be required. The cleanest way would be to cast
> > decimal128 columns to float32/float64 prior to converting to pandas.
> >
> > I didn't see an issue for this right away so I opened
> >
> > https://issues.apache.org/jira/browse/ARROW-7010
> >
> > I also opened
> >
> > https://issues.apache.org/jira/browse/ARROW-7011
> >
> > about going the other way. This would be a useful thing to contribute to the project.
> >
> > Thanks
> > Wes
> >
> > >
> > >
> > > Thanks in advance,
> > >
> > > Roman
> > >
> > >
> > >
> >
>

Re: State of decimal support in Arrow (from/to Parquet Decimal Logicaltype)

Posted by Wes McKinney <we...@gmail.com>.

It depends on the origin of your data.

If your data is not originating from Arrow, then it may be better to
produce an array of FixedLenByteArray and pass that to the low level
WriteBatch API. If you would like some other API, please feel free to
propose something.

On Tue, Oct 29, 2019 at 10:13 AM <ro...@gmail.com> wrote:
>
> Hi Wes,
>
> that was a bit unclear, sorry for that. With "an array", I'm referring to a plain c++-type array, i.e. an array of float, uint32_t, ...
> This means that I do not use the arrow::Array-based write API, but I use the TypedColumnWriter::WriteBatch() function directly and do not have any arrow arrays. Are there any advantages of not using the writebatch directly and instead using arrow::Arrays?
>
> Thanks,
> Roman
>
> -----Ursprüngliche Nachricht-----
> Von: Wes McKinney <we...@gmail.com>
> Gesendet: Dienstag, 29. Oktober 2019 15:59
> An: dev <de...@arrow.apache.org>
> Betreff: Re: State of decimal support in Arrow (from/to Parquet Decimal Logicaltype)
>
> On Tue, Oct 29, 2019 at 3:11 AM <ro...@gmail.com> wrote:
> >
> > Hi Wes,
> >
> > thanks for the response. There's one thing that is still a little unclear to me:
> > I had a look at the code for function WriteArrowSerialize<FLBAType, arrow::Decimal128Type> in the reference you provided. I don't have arrow data in the first place, but as I understand it, I need to have an array of FixedLenByteArrays objects which then point to the actual decimal values in the big_endian_values buffer. Is this the only way to write decimal types or is it also possible to directly provide an array with values to writeBatch()?
> >
>
> Could you clarify what you mean by "an array"? If you use the arrow::Array-based write API then it will invoke this serializer specialization
>
> https://github.com/apache/arrow/blob/46cdf557eb710f17f71a10609e5f497ca585ae1c/cpp/src/parquet/column_writer.cc#L1569
>
> That's what we're calling (if I'm not mistaken, since I just worked on this code recently) when writing arrow::Decimal128Array. If you set a breakpoint with gdb there you can see the call stack
>
> > For the issues, I also found https://issues.apache.org/jira/browse/ARROW-6990, but I'm not sure if this is also related to the issues you created.
> >
> > Thanks,
> > Roman
> >
> > -----Ursprüngliche Nachricht-----
> > Von: Wes McKinney <we...@gmail.com>
> > Gesendet: Montag, 28. Oktober 2019 21:11
> > An: dev <de...@arrow.apache.org>
> > Betreff: Re: State of decimal support in Arrow (from/to Parquet
> > Decimal Logicaltype)
> >
> > hi Roman,
> >
> > On Mon, Oct 28, 2019 at 5:56 AM <ro...@gmail.com> wrote:
> > >
> > > Hi everyone,
> > >
> > >
> > >
> > > I have a question about the state of decimal support in Arrow when
> > > reading from/writing to Parquet.
> > >
> > > *       Is writing decimals to parquet supposed to work? Are there any
> > > examples on how to do this in C++?
> >
> > Yes, it's supported, the details are here
> >
> > https://github.com/apache/arrow/blob/46cdf557eb710f17f71a10609e5f497ca
> > 585ae1c/cpp/src/parquet/column_writer.cc#L1511
> >
> > > *       When reading decimals in a parquet file with pyarrow and converting
> > > the resulting table to a pandas dataframe, datatype in the cells is
> > > "object". As a consequence, performance when doing analysis on this
> > > table is suboptimal. Can I somehow directly get the decimals from
> > > the parquet file into floats/doubles in a pandas dataframe?
> >
> > Some work will be required. The cleanest way would be to cast
> > decimal128 columns to float32/float64 prior to converting to pandas.
> >
> > I didn't see an issue for this right away so I opened
> >
> > https://issues.apache.org/jira/browse/ARROW-7010
> >
> > I also opened
> >
> > https://issues.apache.org/jira/browse/ARROW-7011
> >
> > about going the other way. This would be a useful thing to contribute to the project.
> >
> > Thanks
> > Wes
> >
> > >
> > >
> > > Thanks in advance,
> > >
> > > Roman
> > >
> > >
> > >
> >
>

AW: State of decimal support in Arrow (from/to Parquet Decimal Logicaltype)

Posted by ro...@gmail.com.

Hi Wes,

that was a bit unclear, sorry for that. With "an array", I'm referring to a plain c++-type array, i.e. an array of float, uint32_t, ...
This means that I do not use the arrow::Array-based write API, but I use the TypedColumnWriter::WriteBatch() function directly and do not have any arrow arrays. Are there any advantages of not using the writebatch directly and instead using arrow::Arrays?

Thanks,
Roman

-----Ursprüngliche Nachricht-----
Von: Wes McKinney <we...@gmail.com> 
Gesendet: Dienstag, 29. Oktober 2019 15:59
An: dev <de...@arrow.apache.org>
Betreff: Re: State of decimal support in Arrow (from/to Parquet Decimal Logicaltype)

On Tue, Oct 29, 2019 at 3:11 AM <ro...@gmail.com> wrote:
>
> Hi Wes,
>
> thanks for the response. There's one thing that is still a little unclear to me:
> I had a look at the code for function WriteArrowSerialize<FLBAType, arrow::Decimal128Type> in the reference you provided. I don't have arrow data in the first place, but as I understand it, I need to have an array of FixedLenByteArrays objects which then point to the actual decimal values in the big_endian_values buffer. Is this the only way to write decimal types or is it also possible to directly provide an array with values to writeBatch()?
>

Could you clarify what you mean by "an array"? If you use the arrow::Array-based write API then it will invoke this serializer specialization

https://github.com/apache/arrow/blob/46cdf557eb710f17f71a10609e5f497ca585ae1c/cpp/src/parquet/column_writer.cc#L1569

That's what we're calling (if I'm not mistaken, since I just worked on this code recently) when writing arrow::Decimal128Array. If you set a breakpoint with gdb there you can see the call stack

> For the issues, I also found https://issues.apache.org/jira/browse/ARROW-6990, but I'm not sure if this is also related to the issues you created.
>
> Thanks,
> Roman
>
> -----Ursprüngliche Nachricht-----
> Von: Wes McKinney <we...@gmail.com>
> Gesendet: Montag, 28. Oktober 2019 21:11
> An: dev <de...@arrow.apache.org>
> Betreff: Re: State of decimal support in Arrow (from/to Parquet 
> Decimal Logicaltype)
>
> hi Roman,
>
> On Mon, Oct 28, 2019 at 5:56 AM <ro...@gmail.com> wrote:
> >
> > Hi everyone,
> >
> >
> >
> > I have a question about the state of decimal support in Arrow when 
> > reading from/writing to Parquet.
> >
> > *       Is writing decimals to parquet supposed to work? Are there any
> > examples on how to do this in C++?
>
> Yes, it's supported, the details are here
>
> https://github.com/apache/arrow/blob/46cdf557eb710f17f71a10609e5f497ca
> 585ae1c/cpp/src/parquet/column_writer.cc#L1511
>
> > *       When reading decimals in a parquet file with pyarrow and converting
> > the resulting table to a pandas dataframe, datatype in the cells is 
> > "object". As a consequence, performance when doing analysis on this 
> > table is suboptimal. Can I somehow directly get the decimals from 
> > the parquet file into floats/doubles in a pandas dataframe?
>
> Some work will be required. The cleanest way would be to cast
> decimal128 columns to float32/float64 prior to converting to pandas.
>
> I didn't see an issue for this right away so I opened
>
> https://issues.apache.org/jira/browse/ARROW-7010
>
> I also opened
>
> https://issues.apache.org/jira/browse/ARROW-7011
>
> about going the other way. This would be a useful thing to contribute to the project.
>
> Thanks
> Wes
>
> >
> >
> > Thanks in advance,
> >
> > Roman
> >
> >
> >
>

Re: State of decimal support in Arrow (from/to Parquet Decimal Logicaltype)

Posted by Wes McKinney <we...@gmail.com>.

On Tue, Oct 29, 2019 at 3:11 AM <ro...@gmail.com> wrote:
>
> Hi Wes,
>
> thanks for the response. There's one thing that is still a little unclear to me:
> I had a look at the code for function WriteArrowSerialize<FLBAType, arrow::Decimal128Type> in the reference you provided. I don't have arrow data in the first place, but as I understand it, I need to have an array of FixedLenByteArrays objects which then point to the actual decimal values in the big_endian_values buffer. Is this the only way to write decimal types or is it also possible to directly provide an array with values to writeBatch()?
>

Could you clarify what you mean by "an array"? If you use the
arrow::Array-based write API then it will invoke this serializer
specialization

https://github.com/apache/arrow/blob/46cdf557eb710f17f71a10609e5f497ca585ae1c/cpp/src/parquet/column_writer.cc#L1569

That's what we're calling (if I'm not mistaken, since I just worked on
this code recently) when writing arrow::Decimal128Array. If you set a
breakpoint with gdb there you can see the call stack

> For the issues, I also found https://issues.apache.org/jira/browse/ARROW-6990, but I'm not sure if this is also related to the issues you created.
>
> Thanks,
> Roman
>
> -----Ursprüngliche Nachricht-----
> Von: Wes McKinney <we...@gmail.com>
> Gesendet: Montag, 28. Oktober 2019 21:11
> An: dev <de...@arrow.apache.org>
> Betreff: Re: State of decimal support in Arrow (from/to Parquet Decimal Logicaltype)
>
> hi Roman,
>
> On Mon, Oct 28, 2019 at 5:56 AM <ro...@gmail.com> wrote:
> >
> > Hi everyone,
> >
> >
> >
> > I have a question about the state of decimal support in Arrow when
> > reading from/writing to Parquet.
> >
> > *       Is writing decimals to parquet supposed to work? Are there any
> > examples on how to do this in C++?
>
> Yes, it's supported, the details are here
>
> https://github.com/apache/arrow/blob/46cdf557eb710f17f71a10609e5f497ca585ae1c/cpp/src/parquet/column_writer.cc#L1511
>
> > *       When reading decimals in a parquet file with pyarrow and converting
> > the resulting table to a pandas dataframe, datatype in the cells is
> > "object". As a consequence, performance when doing analysis on this
> > table is suboptimal. Can I somehow directly get the decimals from the
> > parquet file into floats/doubles in a pandas dataframe?
>
> Some work will be required. The cleanest way would be to cast
> decimal128 columns to float32/float64 prior to converting to pandas.
>
> I didn't see an issue for this right away so I opened
>
> https://issues.apache.org/jira/browse/ARROW-7010
>
> I also opened
>
> https://issues.apache.org/jira/browse/ARROW-7011
>
> about going the other way. This would be a useful thing to contribute to the project.
>
> Thanks
> Wes
>
> >
> >
> > Thanks in advance,
> >
> > Roman
> >
> >
> >
>

AW: State of decimal support in Arrow (from/to Parquet Decimal Logicaltype)

Posted by ro...@gmail.com.

Hi Wes,

thanks for the response. There's one thing that is still a little unclear to me:
I had a look at the code for function WriteArrowSerialize<FLBAType, arrow::Decimal128Type> in the reference you provided. I don't have arrow data in the first place, but as I understand it, I need to have an array of FixedLenByteArrays objects which then point to the actual decimal values in the big_endian_values buffer. Is this the only way to write decimal types or is it also possible to directly provide an array with values to writeBatch()?

For the issues, I also found https://issues.apache.org/jira/browse/ARROW-6990, but I'm not sure if this is also related to the issues you created.

Thanks,
Roman

-----Ursprüngliche Nachricht-----
Von: Wes McKinney <we...@gmail.com> 
Gesendet: Montag, 28. Oktober 2019 21:11
An: dev <de...@arrow.apache.org>
Betreff: Re: State of decimal support in Arrow (from/to Parquet Decimal Logicaltype)

hi Roman,

On Mon, Oct 28, 2019 at 5:56 AM <ro...@gmail.com> wrote:
>
> Hi everyone,
>
>
>
> I have a question about the state of decimal support in Arrow when 
> reading from/writing to Parquet.
>
> *       Is writing decimals to parquet supposed to work? Are there any
> examples on how to do this in C++?

Yes, it's supported, the details are here

https://github.com/apache/arrow/blob/46cdf557eb710f17f71a10609e5f497ca585ae1c/cpp/src/parquet/column_writer.cc#L1511

> *       When reading decimals in a parquet file with pyarrow and converting
> the resulting table to a pandas dataframe, datatype in the cells is 
> "object". As a consequence, performance when doing analysis on this 
> table is suboptimal. Can I somehow directly get the decimals from the 
> parquet file into floats/doubles in a pandas dataframe?

Some work will be required. The cleanest way would be to cast
decimal128 columns to float32/float64 prior to converting to pandas.

I didn't see an issue for this right away so I opened

https://issues.apache.org/jira/browse/ARROW-7010

I also opened

https://issues.apache.org/jira/browse/ARROW-7011

about going the other way. This would be a useful thing to contribute to the project.

Thanks
Wes

>
>
> Thanks in advance,
>
> Roman
>
>
>

Re: State of decimal support in Arrow (from/to Parquet Decimal Logicaltype)

Posted by Wes McKinney <we...@gmail.com>.

hi Roman,

On Mon, Oct 28, 2019 at 5:56 AM <ro...@gmail.com> wrote:
>
> Hi everyone,
>
>
>
> I have a question about the state of decimal support in Arrow when reading
> from/writing to Parquet.
>
> *       Is writing decimals to parquet supposed to work? Are there any
> examples on how to do this in C++?

Yes, it's supported, the details are here

https://github.com/apache/arrow/blob/46cdf557eb710f17f71a10609e5f497ca585ae1c/cpp/src/parquet/column_writer.cc#L1511

> *       When reading decimals in a parquet file with pyarrow and converting
> the resulting table to a pandas dataframe, datatype in the cells is
> "object". As a consequence, performance when doing analysis on this table is
> suboptimal. Can I somehow directly get the decimals from the parquet file
> into floats/doubles in a pandas dataframe?

Some work will be required. The cleanest way would be to cast
decimal128 columns to float32/float64 prior to converting to pandas.

I didn't see an issue for this right away so I opened

https://issues.apache.org/jira/browse/ARROW-7010

I also opened

https://issues.apache.org/jira/browse/ARROW-7011

about going the other way. This would be a useful thing to contribute
to the project.

Thanks
Wes

>
>
> Thanks in advance,
>
> Roman
>
>
>