You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@parquet.apache.org by Cheng Lian <li...@gmail.com> on 2015/06/24 21:34:49 UTC
Is INT96 just an alias of FIXED_LENGTH_BYTE_ARRAY(12)?
Hey Parquet devs,
It seems that in parquet-mr, INT96 is always treated as
FIXED_LENGTH_BYTE_ARRAY(12). I wonder is it reasonable to say that INT96
is just a convenient alias of FIXED_LENGTH_BYTE_ARRAY(12)? Are there any
semantics/performance differences? Currently, the only case where I
found INT96 is useful is for representing timestamp type with nanosec
precision in Impala. Did I miss something here?
Best,
Cheng
Re: Is INT96 just an alias of FIXED_LENGTH_BYTE_ARRAY(12)?
Posted by Cheng Lian <li...@gmail.com>.
Yeah, initial nanosec timestamp support in Spark SQL follows Impala and
uses INT96 to improve interoperability with Impala. In Spark
1.5.0-SNAPSHOT (the current master branch), although we still write
timestamps as INT96, internally Spark SQL only uses a LONG to represent
timestamps for better performance. The cost is that the precision is
lowered to 100ns.
Since INT96 is being deprecated, what's the suggested/planned way to
read/write high precision nanosec timestamps then? Spark SQL, Hive, and
Impala all have nanosec timestamp type, while Parquet format spec
doesn't include it (only TIMESTAMP_MILLIS and TIMESTAMP_MICROS are
available for now). Should we add a TIMESTAMP_NANOS annotation over
FIXED_LENGTH_BYTE_ARRAY(12) and corresponding backwards-compatibility rules?
Cheng
On 6/24/15 1:21 PM, Nathan Howell wrote:
> On 6/24/15, 1:17 PM, "Ryan Blue" <bl...@cloudera.com> wrote:
>
>> :(
>>
>> We'll want to deprecate those and move away from them. We're trying to
>> get support for real timestamps, along with backward-compatibility for
>> existing data, as soon as possible. I'm trying to get a commitment for
>> the next point release of CDH to fix it.
> Actually it seems to have been added in 1.3.0, not 1.4.0:
>
> https://issues.apache.org/jira/browse/SPARK-4987
>
>
> -n
Re: Is INT96 just an alias of FIXED_LENGTH_BYTE_ARRAY(12)?
Posted by Nathan Howell <nh...@godaddy.com>.
On 6/24/15, 1:17 PM, "Ryan Blue" <bl...@cloudera.com> wrote:
>:(
>
>We'll want to deprecate those and move away from them. We're trying to
>get support for real timestamps, along with backward-compatibility for
>existing data, as soon as possible. I'm trying to get a commitment for
>the next point release of CDH to fix it.
Actually it seems to have been added in 1.3.0, not 1.4.0:
https://issues.apache.org/jira/browse/SPARK-4987
-n
Re: Is INT96 just an alias of FIXED_LENGTH_BYTE_ARRAY(12)?
Posted by Ryan Blue <bl...@cloudera.com>.
:(
We'll want to deprecate those and move away from them. We're trying to
get support for real timestamps, along with backward-compatibility for
existing data, as soon as possible. I'm trying to get a commitment for
the next point release of CDH to fix it.
rb
On 06/24/2015 01:02 PM, Nathan Howell wrote:
>
> On 6/24/15, 12:39 PM, "Ryan Blue" <bl...@cloudera.com> wrote:
>> The only place it is used is for the Impala INT96 timestamp type. That
>> happened because we (Cloudera) didn't discuss how to properly store
>> timestamps with the upstream community. The implementers needed a way to
>> write the type and know it was the timestamp, and using INT96 for that
>> purpose seemed like a good idea at the time, I guess.
>
> Spark 1.4.x (recently released) is using INT96 for timestamps as well.
>
>
> -n
>
--
Ryan Blue
Software Engineer
Cloudera, Inc.
Re: Is INT96 just an alias of FIXED_LENGTH_BYTE_ARRAY(12)?
Posted by Nathan Howell <nh...@godaddy.com>.
On 6/24/15, 12:39 PM, "Ryan Blue" <bl...@cloudera.com> wrote:
>The only place it is used is for the Impala INT96 timestamp type. That
>happened because we (Cloudera) didn't discuss how to properly store
>timestamps with the upstream community. The implementers needed a way to
>write the type and know it was the timestamp, and using INT96 for that
>purpose seemed like a good idea at the time, I guess.
Spark 1.4.x (recently released) is using INT96 for timestamps as well.
-n
Re: Is INT96 just an alias of FIXED_LENGTH_BYTE_ARRAY(12)?
Posted by Ryan Blue <bl...@cloudera.com>.
INT96 *should* be reserved for actual integers and not a fixed(12). The
type implies that it is a single big number.
The only place it is used is for the Impala INT96 timestamp type. That
happened because we (Cloudera) didn't discuss how to properly store
timestamps with the upstream community. The implementers needed a way to
write the type and know it was the timestamp, and using INT96 for that
purpose seemed like a good idea at the time, I guess.
The right way to add a type would have been to discuss the type with the
upstream community and add an annotation, along with rules for where
that annotation can be used. That allows us to use the right storage
(e.g., a 12-byte fixed) and tell the type apart from other data with the
same physical type. That's what we're doing with all new types these days.
rb
On 06/24/2015 12:34 PM, Cheng Lian wrote:
> Hey Parquet devs,
>
> It seems that in parquet-mr, INT96 is always treated as
> FIXED_LENGTH_BYTE_ARRAY(12). I wonder is it reasonable to say that INT96
> is just a convenient alias of FIXED_LENGTH_BYTE_ARRAY(12)? Are there any
> semantics/performance differences? Currently, the only case where I
> found INT96 is useful is for representing timestamp type with nanosec
> precision in Impala. Did I miss something here?
>
> Best,
> Cheng
--
Ryan Blue
Software Engineer
Cloudera, Inc.