You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@spark.apache.org by Michael Allman <mi...@videoamp.com> on 2014/10/12 22:51:21 UTC

reading/writing parquet decimal type

Hello,

I'm interested in reading/writing parquet SchemaRDDs that support the Parquet Decimal converted type. The first thing I did was update the Spark parquet dependency to version 1.5.0, as this version introduced support for decimals in parquet. However, conversion between the catalyst decimal type and the parquet decimal type is complicated by the fact that the catalyst type does not specify a decimal precision and scale but the parquet type requires them.

I'm wondering if perhaps we could add an optional precision and scale to the catalyst decimal type? The catalyst decimal type would have unspecified precision and scale by default for backwards compatibility, but users who want to serialize a SchemaRDD with decimal(s) to parquet would have to narrow their decimal type(s) by specifying a precision and scale.

Thoughts?

Michael
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
For additional commands, e-mail: dev-help@spark.apache.org


Re: reading/writing parquet decimal type

Posted by Michael Allman <mi...@videoamp.com>.
Hi Matei,

Another thing occurred to me. Will the binary format you're writing sort the data in numeric order? Or would the decimals have to be decoded for comparison?

Cheers,

Michael


> On Oct 12, 2014, at 10:48 PM, Matei Zaharia <ma...@gmail.com> wrote:
> 
> The fixed-length binary type can hold fewer bytes than an int64, though many encodings of int64 can probably do the right thing. We can look into supporting multiple ways to do this -- the spec does say that you should at least be able to read int32s and int64s.
> 
> Matei
> 
> On Oct 12, 2014, at 8:20 PM, Michael Allman <mi...@videoamp.com> wrote:
> 
>> Hi Matei,
>> 
>> Thanks, I can see you've been hard at work on this! I examined your patch and do have a question. It appears you're limiting the precision of decimals written to parquet to those that will fit in a long, yet you're writing the values as a parquet binary type. Why not write them using the int64 parquet type instead?
>> 
>> Cheers,
>> 
>> Michael
>> 
>> On Oct 12, 2014, at 3:32 PM, Matei Zaharia <ma...@gmail.com> wrote:
>> 
>>> Hi Michael,
>>> 
>>> I've been working on this in my repo: https://github.com/mateiz/spark/tree/decimal. I'll make some pull requests with these features soon, but meanwhile you can try this branch. See https://github.com/mateiz/spark/compare/decimal for the individual commits that went into it. It has exactly the precision stuff you need, plus some optimizations for working on decimals.
>>> 
>>> Matei
>>> 
>>> On Oct 12, 2014, at 1:51 PM, Michael Allman <mi...@videoamp.com> wrote:
>>> 
>>>> Hello,
>>>> 
>>>> I'm interested in reading/writing parquet SchemaRDDs that support the Parquet Decimal converted type. The first thing I did was update the Spark parquet dependency to version 1.5.0, as this version introduced support for decimals in parquet. However, conversion between the catalyst decimal type and the parquet decimal type is complicated by the fact that the catalyst type does not specify a decimal precision and scale but the parquet type requires them.
>>>> 
>>>> I'm wondering if perhaps we could add an optional precision and scale to the catalyst decimal type? The catalyst decimal type would have unspecified precision and scale by default for backwards compatibility, but users who want to serialize a SchemaRDD with decimal(s) to parquet would have to narrow their decimal type(s) by specifying a precision and scale.
>>>> 
>>>> Thoughts?
>>>> 
>>>> Michael
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
>>>> For additional commands, e-mail: dev-help@spark.apache.org
>>>> 
>>> 
>> 
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
For additional commands, e-mail: dev-help@spark.apache.org


Re: reading/writing parquet decimal type

Posted by Matei Zaharia <ma...@gmail.com>.
The fixed-length binary type can hold fewer bytes than an int64, though many encodings of int64 can probably do the right thing. We can look into supporting multiple ways to do this -- the spec does say that you should at least be able to read int32s and int64s.

Matei

On Oct 12, 2014, at 8:20 PM, Michael Allman <mi...@videoamp.com> wrote:

> Hi Matei,
> 
> Thanks, I can see you've been hard at work on this! I examined your patch and do have a question. It appears you're limiting the precision of decimals written to parquet to those that will fit in a long, yet you're writing the values as a parquet binary type. Why not write them using the int64 parquet type instead?
> 
> Cheers,
> 
> Michael
> 
> On Oct 12, 2014, at 3:32 PM, Matei Zaharia <ma...@gmail.com> wrote:
> 
>> Hi Michael,
>> 
>> I've been working on this in my repo: https://github.com/mateiz/spark/tree/decimal. I'll make some pull requests with these features soon, but meanwhile you can try this branch. See https://github.com/mateiz/spark/compare/decimal for the individual commits that went into it. It has exactly the precision stuff you need, plus some optimizations for working on decimals.
>> 
>> Matei
>> 
>> On Oct 12, 2014, at 1:51 PM, Michael Allman <mi...@videoamp.com> wrote:
>> 
>>> Hello,
>>> 
>>> I'm interested in reading/writing parquet SchemaRDDs that support the Parquet Decimal converted type. The first thing I did was update the Spark parquet dependency to version 1.5.0, as this version introduced support for decimals in parquet. However, conversion between the catalyst decimal type and the parquet decimal type is complicated by the fact that the catalyst type does not specify a decimal precision and scale but the parquet type requires them.
>>> 
>>> I'm wondering if perhaps we could add an optional precision and scale to the catalyst decimal type? The catalyst decimal type would have unspecified precision and scale by default for backwards compatibility, but users who want to serialize a SchemaRDD with decimal(s) to parquet would have to narrow their decimal type(s) by specifying a precision and scale.
>>> 
>>> Thoughts?
>>> 
>>> Michael
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
>>> For additional commands, e-mail: dev-help@spark.apache.org
>>> 
>> 
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
For additional commands, e-mail: dev-help@spark.apache.org


Re: reading/writing parquet decimal type

Posted by Michael Allman <mi...@videoamp.com>.
Hi Matei,

Thanks, I can see you've been hard at work on this! I examined your patch and do have a question. It appears you're limiting the precision of decimals written to parquet to those that will fit in a long, yet you're writing the values as a parquet binary type. Why not write them using the int64 parquet type instead?

Cheers,

Michael

On Oct 12, 2014, at 3:32 PM, Matei Zaharia <ma...@gmail.com> wrote:

> Hi Michael,
> 
> I've been working on this in my repo: https://github.com/mateiz/spark/tree/decimal. I'll make some pull requests with these features soon, but meanwhile you can try this branch. See https://github.com/mateiz/spark/compare/decimal for the individual commits that went into it. It has exactly the precision stuff you need, plus some optimizations for working on decimals.
> 
> Matei
> 
> On Oct 12, 2014, at 1:51 PM, Michael Allman <mi...@videoamp.com> wrote:
> 
>> Hello,
>> 
>> I'm interested in reading/writing parquet SchemaRDDs that support the Parquet Decimal converted type. The first thing I did was update the Spark parquet dependency to version 1.5.0, as this version introduced support for decimals in parquet. However, conversion between the catalyst decimal type and the parquet decimal type is complicated by the fact that the catalyst type does not specify a decimal precision and scale but the parquet type requires them.
>> 
>> I'm wondering if perhaps we could add an optional precision and scale to the catalyst decimal type? The catalyst decimal type would have unspecified precision and scale by default for backwards compatibility, but users who want to serialize a SchemaRDD with decimal(s) to parquet would have to narrow their decimal type(s) by specifying a precision and scale.
>> 
>> Thoughts?
>> 
>> Michael
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
>> For additional commands, e-mail: dev-help@spark.apache.org
>> 
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
For additional commands, e-mail: dev-help@spark.apache.org


Re: reading/writing parquet decimal type

Posted by Matei Zaharia <ma...@gmail.com>.
Hi Michael,

I've been working on this in my repo: https://github.com/mateiz/spark/tree/decimal. I'll make some pull requests with these features soon, but meanwhile you can try this branch. See https://github.com/mateiz/spark/compare/decimal for the individual commits that went into it. It has exactly the precision stuff you need, plus some optimizations for working on decimals.

Matei

On Oct 12, 2014, at 1:51 PM, Michael Allman <mi...@videoamp.com> wrote:

> Hello,
> 
> I'm interested in reading/writing parquet SchemaRDDs that support the Parquet Decimal converted type. The first thing I did was update the Spark parquet dependency to version 1.5.0, as this version introduced support for decimals in parquet. However, conversion between the catalyst decimal type and the parquet decimal type is complicated by the fact that the catalyst type does not specify a decimal precision and scale but the parquet type requires them.
> 
> I'm wondering if perhaps we could add an optional precision and scale to the catalyst decimal type? The catalyst decimal type would have unspecified precision and scale by default for backwards compatibility, but users who want to serialize a SchemaRDD with decimal(s) to parquet would have to narrow their decimal type(s) by specifying a precision and scale.
> 
> Thoughts?
> 
> Michael
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
> For additional commands, e-mail: dev-help@spark.apache.org
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
For additional commands, e-mail: dev-help@spark.apache.org