You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@orc.apache.org by Dain Sundstrom <da...@iq80.com> on 2019/03/19 20:30:54 UTC

Type length, scale, and precision?

For the types in the ORC footer, we have the following:

 // the maximum length of the type for varchar or char in UTF-8 characters
 optional uint32 maximumLength = 4;
 // the precision and scale for decimal
 optional uint32 precision = 5;
 optional uint32 scale = 6;

If the maximumLength, is set to N, can I be confident that no value for that column in the file will contain more than N UTF-8 characters?  Is this still true for concatenated ORC files.

I have a similar question about DECIMAL.  Decimal encoding currently uses the SECONDARY stream to encode the "scale".  Is this scale guaranteed to be the same scale as the type scale in the footer?

Thanks,

-dain


----
Dain Sundstrom
Co-founder @ Presto Software Foundation, Co-creator of Presto (https://prestosql.io)

Re: Type length, scale, and precision?

Posted by Owen O'Malley <ow...@gmail.com>.

Sorry, I managed to miss this message.

On Tue, Mar 19, 2019 at 9:31 PM Dain Sundstrom <da...@iq80.com> wrote:

> For the types in the ORC footer, we have the following:
>
>  // the maximum length of the type for varchar or char in UTF-8 characters
>  optional uint32 maximumLength = 4;
>  // the precision and scale for decimal
>  optional uint32 precision = 5;
>  optional uint32 scale = 6;
>
> If the maximumLength, is set to N, can I be confident that no value for
> that column in the file will contain more than N UTF-8 characters?  Is this
> still true for concatenated ORC files.
>

Yes. The merger should insist that the schemas are the same for all merged
files. We could consider loosening that restriction, but in all cases the
length of the values must be less than the declared length in the footer.

Until recently we had a bug that was truncating to N bytes instead of N
UTF-8 characters. That was a mistake.

> I have a similar question about DECIMAL.  Decimal encoding currently uses
> the SECONDARY stream to encode the "scale".  Is this scale guaranteed to be
> the same scale as the type scale in the footer?
>

In Hive 0.11 the decimal values didn't have a declared scale. That is why
the scale is encoded per a value. For short decimals (p <= 18) in recent
Hive/ORC versions, you'll have that guarantee. Otherwise, it still uses the
HiveDecimalWritable code, which removes trailing zeros, so the scale for a
value may be less than the declared scale.

> Thanks,
>
> -dain
>
>
> ----
> Dain Sundstrom
> Co-founder @ Presto Software Foundation, Co-creator of Presto (
> https://prestosql.io)
>
>