You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@parquet.apache.org by Cheng Lian <li...@gmail.com> on 2015/06/24 21:34:49 UTC

Is INT96 just an alias of FIXED_LENGTH_BYTE_ARRAY(12)?

Hey Parquet devs,

It seems that in parquet-mr, INT96 is always treated as 
FIXED_LENGTH_BYTE_ARRAY(12). I wonder is it reasonable to say that INT96 
is just a convenient alias of FIXED_LENGTH_BYTE_ARRAY(12)? Are there any 
semantics/performance differences? Currently, the only case where I 
found INT96 is useful is for representing timestamp type with nanosec 
precision in Impala. Did I miss something here?

Best,
Cheng

Re: Is INT96 just an alias of FIXED_LENGTH_BYTE_ARRAY(12)?

Posted by Cheng Lian <li...@gmail.com>.

Yeah, initial nanosec timestamp support in Spark SQL follows Impala and 
uses INT96 to improve interoperability with Impala. In Spark 
1.5.0-SNAPSHOT (the current master branch), although we still write 
timestamps as INT96, internally Spark SQL only uses a LONG to represent 
timestamps for better performance. The cost is that the precision is 
lowered to 100ns.

Since INT96 is being deprecated, what's the suggested/planned way to 
read/write high precision nanosec timestamps then? Spark SQL, Hive, and 
Impala all have nanosec timestamp type, while Parquet format spec 
doesn't include it (only TIMESTAMP_MILLIS and TIMESTAMP_MICROS are 
available for now). Should we add a TIMESTAMP_NANOS annotation over 
FIXED_LENGTH_BYTE_ARRAY(12) and corresponding backwards-compatibility rules?

Cheng

On 6/24/15 1:21 PM, Nathan Howell wrote:
> On 6/24/15, 1:17 PM, "Ryan Blue" <bl...@cloudera.com> wrote:
>
>> :(
>>
>> We'll want to deprecate those and move away from them. We're trying to
>> get support for real timestamps, along with backward-compatibility for
>> existing data, as soon as possible. I'm trying to get a commitment for
>> the next point release of CDH to fix it.
> Actually it seems to have been added in 1.3.0, not 1.4.0:
>
> https://issues.apache.org/jira/browse/SPARK-4987
>
>
> -n

Re: Is INT96 just an alias of FIXED_LENGTH_BYTE_ARRAY(12)?

Posted by Nathan Howell <nh...@godaddy.com>.

On 6/24/15, 1:17 PM, "Ryan Blue" <bl...@cloudera.com> wrote:

>:(
>
>We'll want to deprecate those and move away from them. We're trying to 
>get support for real timestamps, along with backward-compatibility for 
>existing data, as soon as possible. I'm trying to get a commitment for 
>the next point release of CDH to fix it.

Actually it seems to have been added in 1.3.0, not 1.4.0:

https://issues.apache.org/jira/browse/SPARK-4987


-n

Re: Is INT96 just an alias of FIXED_LENGTH_BYTE_ARRAY(12)?

Posted by Ryan Blue <bl...@cloudera.com>.

:(

We'll want to deprecate those and move away from them. We're trying to 
get support for real timestamps, along with backward-compatibility for 
existing data, as soon as possible. I'm trying to get a commitment for 
the next point release of CDH to fix it.

rb

On 06/24/2015 01:02 PM, Nathan Howell wrote:
>
> On 6/24/15, 12:39 PM, "Ryan Blue" <bl...@cloudera.com> wrote:
>> The only place it is used is for the Impala INT96 timestamp type. That
>> happened because we (Cloudera) didn't discuss how to properly store
>> timestamps with the upstream community. The implementers needed a way to
>> write the type and know it was the timestamp, and using INT96 for that
>> purpose seemed like a good idea at the time, I guess.
>
> Spark 1.4.x (recently released) is using INT96 for timestamps as well.
>
>
> -n
>


-- 
Ryan Blue
Software Engineer
Cloudera, Inc.

Re: Is INT96 just an alias of FIXED_LENGTH_BYTE_ARRAY(12)?

Posted by Nathan Howell <nh...@godaddy.com>.

On 6/24/15, 12:39 PM, "Ryan Blue" <bl...@cloudera.com> wrote:
>The only place it is used is for the Impala INT96 timestamp type. That 
>happened because we (Cloudera) didn't discuss how to properly store 
>timestamps with the upstream community. The implementers needed a way to 
>write the type and know it was the timestamp, and using INT96 for that 
>purpose seemed like a good idea at the time, I guess.

Spark 1.4.x (recently released) is using INT96 for timestamps as well.


-n

Re: Is INT96 just an alias of FIXED_LENGTH_BYTE_ARRAY(12)?

Posted by Ryan Blue <bl...@cloudera.com>.

INT96 *should* be reserved for actual integers and not a fixed(12). The 
type implies that it is a single big number.

The only place it is used is for the Impala INT96 timestamp type. That 
happened because we (Cloudera) didn't discuss how to properly store 
timestamps with the upstream community. The implementers needed a way to 
write the type and know it was the timestamp, and using INT96 for that 
purpose seemed like a good idea at the time, I guess.

The right way to add a type would have been to discuss the type with the 
upstream community and add an annotation, along with rules for where 
that annotation can be used. That allows us to use the right storage 
(e.g., a 12-byte fixed) and tell the type apart from other data with the 
same physical type. That's what we're doing with all new types these days.

rb

On 06/24/2015 12:34 PM, Cheng Lian wrote:
> Hey Parquet devs,
>
> It seems that in parquet-mr, INT96 is always treated as
> FIXED_LENGTH_BYTE_ARRAY(12). I wonder is it reasonable to say that INT96
> is just a convenient alias of FIXED_LENGTH_BYTE_ARRAY(12)? Are there any
> semantics/performance differences? Currently, the only case where I
> found INT96 is useful is for representing timestamp type with nanosec
> precision in Impala. Did I miss something here?
>
> Best,
> Cheng

-- 
Ryan Blue
Software Engineer
Cloudera, Inc.