You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@orc.apache.org by Owen O'Malley <om...@apache.org> on 2016/09/07 15:57:23 UTC

Bloom filter hash broken

All,
   Dain Sundstrom pointed out to me in personal email that the ORC bloom
filters are currently using the default character encoding. That makes the
bloom filters non-portable between different computers that use different
default encodings. I've filed ORC-101 to address it, but I want to have a
wider discussion. I'd propose that we:

1. create a new WriterVersion for ORC-101.
2. move the bloom filter code from storage-api into ORC.
3. consistently use UTF-8 when creating new bloom filters
4. for ORC files older than ORC-101, test the default encoding instead of
UTF-8

Thoughts?

.. Owen

Re: Bloom filter hash broken

Posted by Owen O'Malley <om...@apache.org>.

One more wrinkle on this that I just discovered is that I inadvertently
fixed the problem for strings (but not decimals) as part of HIVE-12055,
which was released as part of Hive 2.1. So that means we currently have two
versions out there:

WriterVersion      String charset  Decimal charset
< HIVE-12055      jvm default      jvm default
>= HIVE-12055    UTF-8             jvm default

So I'm going ahead with the BLOOM_FILTER_UTF8 stream (and optionally the
old BLOOM_FILTER stream), but I'll make the reader recognize the
WriterVersion >= HIVE-12055 and trust BLOOM_FILTER for string bloom
filters, but not the decimal ones.

.. Owen


On Thu, Sep 8, 2016 at 7:25 PM, Dain Sundstrom <da...@iq80.com> wrote:

> Sounds good to me.
>
> Should we add a version field to the BLOOM_FILTER_UTF8 to deal with any
> future problems?
>
> One other thought, in the protobuf definition I think it would be more
> efficient to have the bitset encoded as a byte[] to avoid the boxed long
> array.
>
> -dain
>
> > On Sep 8, 2016, at 3:42 PM, Owen O'Malley <om...@apache.org> wrote:
> >
> > Dain,
> >   That is a great point. I wasn't thinking about having to implement that
> > in C++, where it would really suck. (It was hard enough dealing with the
> > timezones in C++, so I should know better!) I got the approach from this
> > morning working and pushed it as a branch
> > https://github.com/omalley/orc/commit/38752621863bf8dc1f05a6e7d34552
> 969395e5f5
> > .
> >
> >  The heaviest hammer would be to just create a new file version, but that
> > seems like overkill for this. (Although sooner or later we will get
> there,
> > since we need new encodings for decimal.)
> >
> >  Ok, so how about:
> >
> > 1. We create a new stream kind (BLOOM_FILTER_UTF8) that always has UTF-8
> > based bloom filters. It will be used for strings, chars, varchars, and
> > decimal. (Hopefully all charsets use the same bytes for ascii characters,
> > but I don't want to find the strange exceptions.)
> >
> > 2. We create a new config knob that let's you write both BLOOM_FILTER and
> > BLOOM_FILTER_UTF8 for users while they are transitioning.
> >
> > 3. The reader prefers BLOOM_FILTER_UTF8, but will fall back to
> BLOOM_FILTER
> > if it is an old file.
> >
> > Thoughts?
> >
> > .. Owen
> >
> > On Thu, Sep 8, 2016 at 11:02 AM, Dain Sundstrom <da...@iq80.com> wrote:
> >
> >>> On Sep 8, 2016, at 9:59 AM, Owen O'Malley <om...@apache.org> wrote:
> >>>
> >>> Ok, Prasanth found a problem with my proposed approach. In particular,
> >> the
> >>> old readers would misinterpret bloom filters from new files. Therefore,
> >> I'd
> >>> like to propose a more complicated solution:
> >>> 1. We extend the stripe footer or bloom filter index to record the
> >> default
> >>> encoding when we are writing a string or decimal bloom filter.
> >>> 2. When reading a bloom filter, we use the encoding if it is present.
> >>
> >> Does that mean that you always write with he platform encoding?  This
> >> would make using bloom filters for read in other programming languages
> >> difficult because you would need to do a UTF_8 to some arbitrary
> character
> >> encoding.  This will also make using these bloom filters in performance
> >> critical sections (join loops) computationally expensive as you have to
> do
> >> a transcode.
> >>
> >> Also, I think the spec need to be clarified.  The spec does not state
> the
> >> character encoding of the bloom filters.  I assumed it was UTF_8 to
> match
> >> the normal string column encoding.  It looks like the spec does not
> >> document the meaning of "the version of the writer” and what workarounds
> >> are necessary (or operating assumptions have been made).  Once we have
> >> that, we should document that old readers assume that the platform
> default
> >> charset is consistent for readers and writers.
> >>
> >> As and alternative, for new files we could add add a new stream ID, so
> the
> >> old readers skip them.
> >>
> >> -dain
>
>

Re: Bloom filter hash broken

Posted by Dain Sundstrom <da...@iq80.com>.

Sounds good to me.

Should we add a version field to the BLOOM_FILTER_UTF8 to deal with any future problems?

One other thought, in the protobuf definition I think it would be more efficient to have the bitset encoded as a byte[] to avoid the boxed long array.

-dain
 
> On Sep 8, 2016, at 3:42 PM, Owen O'Malley <om...@apache.org> wrote:
> 
> Dain,
>   That is a great point. I wasn't thinking about having to implement that
> in C++, where it would really suck. (It was hard enough dealing with the
> timezones in C++, so I should know better!) I got the approach from this
> morning working and pushed it as a branch
> https://github.com/omalley/orc/commit/38752621863bf8dc1f05a6e7d34552969395e5f5
> .
> 
>  The heaviest hammer would be to just create a new file version, but that
> seems like overkill for this. (Although sooner or later we will get there,
> since we need new encodings for decimal.)
> 
>  Ok, so how about:
> 
> 1. We create a new stream kind (BLOOM_FILTER_UTF8) that always has UTF-8
> based bloom filters. It will be used for strings, chars, varchars, and
> decimal. (Hopefully all charsets use the same bytes for ascii characters,
> but I don't want to find the strange exceptions.)
> 
> 2. We create a new config knob that let's you write both BLOOM_FILTER and
> BLOOM_FILTER_UTF8 for users while they are transitioning.
> 
> 3. The reader prefers BLOOM_FILTER_UTF8, but will fall back to BLOOM_FILTER
> if it is an old file.
> 
> Thoughts?
> 
> .. Owen
> 
> On Thu, Sep 8, 2016 at 11:02 AM, Dain Sundstrom <da...@iq80.com> wrote:
> 
>>> On Sep 8, 2016, at 9:59 AM, Owen O'Malley <om...@apache.org> wrote:
>>> 
>>> Ok, Prasanth found a problem with my proposed approach. In particular,
>> the
>>> old readers would misinterpret bloom filters from new files. Therefore,
>> I'd
>>> like to propose a more complicated solution:
>>> 1. We extend the stripe footer or bloom filter index to record the
>> default
>>> encoding when we are writing a string or decimal bloom filter.
>>> 2. When reading a bloom filter, we use the encoding if it is present.
>> 
>> Does that mean that you always write with he platform encoding?  This
>> would make using bloom filters for read in other programming languages
>> difficult because you would need to do a UTF_8 to some arbitrary character
>> encoding.  This will also make using these bloom filters in performance
>> critical sections (join loops) computationally expensive as you have to do
>> a transcode.
>> 
>> Also, I think the spec need to be clarified.  The spec does not state the
>> character encoding of the bloom filters.  I assumed it was UTF_8 to match
>> the normal string column encoding.  It looks like the spec does not
>> document the meaning of "the version of the writer” and what workarounds
>> are necessary (or operating assumptions have been made).  Once we have
>> that, we should document that old readers assume that the platform default
>> charset is consistent for readers and writers.
>> 
>> As and alternative, for new files we could add add a new stream ID, so the
>> old readers skip them.
>> 
>> -dain

Re: Bloom filter hash broken

Posted by Owen O'Malley <om...@apache.org>.

Dain,
   That is a great point. I wasn't thinking about having to implement that
in C++, where it would really suck. (It was hard enough dealing with the
timezones in C++, so I should know better!) I got the approach from this
morning working and pushed it as a branch
https://github.com/omalley/orc/commit/38752621863bf8dc1f05a6e7d34552969395e5f5
.

  The heaviest hammer would be to just create a new file version, but that
seems like overkill for this. (Although sooner or later we will get there,
since we need new encodings for decimal.)

  Ok, so how about:

1. We create a new stream kind (BLOOM_FILTER_UTF8) that always has UTF-8
based bloom filters. It will be used for strings, chars, varchars, and
decimal. (Hopefully all charsets use the same bytes for ascii characters,
but I don't want to find the strange exceptions.)

2. We create a new config knob that let's you write both BLOOM_FILTER and
BLOOM_FILTER_UTF8 for users while they are transitioning.

3. The reader prefers BLOOM_FILTER_UTF8, but will fall back to BLOOM_FILTER
if it is an old file.

Thoughts?

.. Owen

On Thu, Sep 8, 2016 at 11:02 AM, Dain Sundstrom <da...@iq80.com> wrote:

> > On Sep 8, 2016, at 9:59 AM, Owen O'Malley <om...@apache.org> wrote:
> >
> > Ok, Prasanth found a problem with my proposed approach. In particular,
> the
> > old readers would misinterpret bloom filters from new files. Therefore,
> I'd
> > like to propose a more complicated solution:
> > 1. We extend the stripe footer or bloom filter index to record the
> default
> > encoding when we are writing a string or decimal bloom filter.
> > 2. When reading a bloom filter, we use the encoding if it is present.
>
> Does that mean that you always write with he platform encoding?  This
> would make using bloom filters for read in other programming languages
> difficult because you would need to do a UTF_8 to some arbitrary character
> encoding.  This will also make using these bloom filters in performance
> critical sections (join loops) computationally expensive as you have to do
> a transcode.
>
> Also, I think the spec need to be clarified.  The spec does not state the
> character encoding of the bloom filters.  I assumed it was UTF_8 to match
> the normal string column encoding.  It looks like the spec does not
> document the meaning of "the version of the writer” and what workarounds
> are necessary (or operating assumptions have been made).  Once we have
> that, we should document that old readers assume that the platform default
> charset is consistent for readers and writers.
>
> As and alternative, for new files we could add add a new stream ID, so the
> old readers skip them.
>
> -dain

Re: Bloom filter hash broken

Posted by Dain Sundstrom <da...@iq80.com>.

> On Sep 8, 2016, at 9:59 AM, Owen O'Malley <om...@apache.org> wrote:
> 
> Ok, Prasanth found a problem with my proposed approach. In particular, the
> old readers would misinterpret bloom filters from new files. Therefore, I'd
> like to propose a more complicated solution:
> 1. We extend the stripe footer or bloom filter index to record the default
> encoding when we are writing a string or decimal bloom filter.
> 2. When reading a bloom filter, we use the encoding if it is present.

Does that mean that you always write with he platform encoding?  This would make using bloom filters for read in other programming languages difficult because you would need to do a UTF_8 to some arbitrary character encoding.  This will also make using these bloom filters in performance critical sections (join loops) computationally expensive as you have to do a transcode.

Also, I think the spec need to be clarified.  The spec does not state the character encoding of the bloom filters.  I assumed it was UTF_8 to match the normal string column encoding.  It looks like the spec does not document the meaning of "the version of the writer” and what workarounds are necessary (or operating assumptions have been made).  Once we have that, we should document that old readers assume that the platform default charset is consistent for readers and writers. 

As and alternative, for new files we could add add a new stream ID, so the old readers skip them.

-dain

Re: Bloom filter hash broken

Posted by Owen O'Malley <om...@apache.org>.

Ok, Prasanth found a problem with my proposed approach. In particular, the
old readers would misinterpret bloom filters from new files. Therefore, I'd
like to propose a more complicated solution:
1. We extend the stripe footer or bloom filter index to record the default
encoding when we are writing a string or decimal bloom filter.
2. When reading a bloom filter, we use the encoding if it is present.
3. I'd still like to bump the WriterVersion for ORC-101.

Thoughts?

.. Owen


On Wed, Sep 7, 2016 at 1:08 PM, Owen O'Malley <om...@apache.org> wrote:

> To expand on Prasanth's answer, in ORC we have both a format version,
> which is oldest version of the reader that can read the file (eg 0.11 and
> 0.12), and the writer version, which keeps track of which version of the
> software that wrote the file denoted by the jiras where there are
> significant changes in the writer (eg. original, hive-8732, hive-4243,
> hive-12055, hive-13083, and now orc-101). The reader uses the writer
> version to work around issues like this.
>
> .. Owen
>
> On Wed, Sep 7, 2016 at 12:08 PM, Prasanth Jayachandran <
> j.prasanth.j@gmail.com> wrote:
>
>> +1 to bump up the writer version to facilitate correct ppd for older
>> versions.
>> Alan - PPD will have to look at the writer version to detect old files.
>> Newer files will have writer version as ORC-101.
>>
>> Thanks
>> Prasanth
>>
>>
>>
>>
>> On Wed, Sep 7, 2016 at 1:12 PM -0500, "Alan Gates" <al...@gmail.com>
>> wrote:
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>> I think using the default encoding for the old files is the best option,
>> as it will be right 99% of the time.  I was wondering how the system would
>> know whether or not this was an old file.
>>
>> Alan.
>>
>> > On Sep 7, 2016, at 10:06, Owen O'Malley  wrote:
>> >
>> > 4 is about when you are using the bloom filter for predicate push down.
>> I'm
>> > saying old files should use the default encoding when checking the bloom
>> > filter. The other option is to always have the predicate push down say
>> > maybe if the file is an old one.
>> >
>> > .. Owen
>> >
>> > On Wed, Sep 7, 2016 at 9:34 AM, Alan Gates  wrote:
>> >
>> >> +1 to 1-3.  On 4, what do you mean by test?  Assume it’s the default
>> >> encoding and use that?  Is there a versioning concept in the bloom
>> filters
>> >> that will make it easy to determine if this is pre or post ORC-101?
>> >>
>> >> Alan.
>> >>
>> >>> On Sep 7, 2016, at 08:57, Owen O'Malley  wrote:
>> >>>
>> >>> All,
>> >>>  Dain Sundstrom pointed out to me in personal email that the ORC bloom
>> >>> filters are currently using the default character encoding. That makes
>> >> the
>> >>> bloom filters non-portable between different computers that use
>> different
>> >>> default encodings. I've filed ORC-101 to address it, but I want to
>> have a
>> >>> wider discussion. I'd propose that we:
>> >>>
>> >>> 1. create a new WriterVersion for ORC-101.
>> >>> 2. move the bloom filter code from storage-api into ORC.
>> >>> 3. consistently use UTF-8 when creating new bloom filters
>> >>> 4. for ORC files older than ORC-101, test the default encoding
>> instead of
>> >>> UTF-8
>> >>>
>> >>> Thoughts?
>> >>>
>> >>> .. Owen
>> >>
>> >>
>>
>>
>>
>>
>>
>>
>>
>

Re: Bloom filter hash broken

Posted by Owen O'Malley <om...@apache.org>.

To expand on Prasanth's answer, in ORC we have both a format version, which
is oldest version of the reader that can read the file (eg 0.11 and 0.12),
and the writer version, which keeps track of which version of the software
that wrote the file denoted by the jiras where there are significant
changes in the writer (eg. original, hive-8732, hive-4243, hive-12055,
hive-13083, and now orc-101). The reader uses the writer version to work
around issues like this.

.. Owen

On Wed, Sep 7, 2016 at 12:08 PM, Prasanth Jayachandran <
j.prasanth.j@gmail.com> wrote:

> +1 to bump up the writer version to facilitate correct ppd for older
> versions.
> Alan - PPD will have to look at the writer version to detect old files.
> Newer files will have writer version as ORC-101.
>
> Thanks
> Prasanth
>
>
>
>
> On Wed, Sep 7, 2016 at 1:12 PM -0500, "Alan Gates" <al...@gmail.com>
> wrote:
>
>
>
>
>
>
>
>
>
>
> I think using the default encoding for the old files is the best option,
> as it will be right 99% of the time.  I was wondering how the system would
> know whether or not this was an old file.
>
> Alan.
>
> > On Sep 7, 2016, at 10:06, Owen O'Malley  wrote:
> >
> > 4 is about when you are using the bloom filter for predicate push down.
> I'm
> > saying old files should use the default encoding when checking the bloom
> > filter. The other option is to always have the predicate push down say
> > maybe if the file is an old one.
> >
> > .. Owen
> >
> > On Wed, Sep 7, 2016 at 9:34 AM, Alan Gates  wrote:
> >
> >> +1 to 1-3.  On 4, what do you mean by test?  Assume it’s the default
> >> encoding and use that?  Is there a versioning concept in the bloom
> filters
> >> that will make it easy to determine if this is pre or post ORC-101?
> >>
> >> Alan.
> >>
> >>> On Sep 7, 2016, at 08:57, Owen O'Malley  wrote:
> >>>
> >>> All,
> >>>  Dain Sundstrom pointed out to me in personal email that the ORC bloom
> >>> filters are currently using the default character encoding. That makes
> >> the
> >>> bloom filters non-portable between different computers that use
> different
> >>> default encodings. I've filed ORC-101 to address it, but I want to
> have a
> >>> wider discussion. I'd propose that we:
> >>>
> >>> 1. create a new WriterVersion for ORC-101.
> >>> 2. move the bloom filter code from storage-api into ORC.
> >>> 3. consistently use UTF-8 when creating new bloom filters
> >>> 4. for ORC files older than ORC-101, test the default encoding instead
> of
> >>> UTF-8
> >>>
> >>> Thoughts?
> >>>
> >>> .. Owen
> >>
> >>
>
>
>
>
>
>
>

Re: Bloom filter hash broken

Posted by Prasanth Jayachandran <j....@gmail.com>.

+1 to bump up the writer version to facilitate correct ppd for older versions. 
Alan - PPD will have to look at the writer version to detect old files. Newer files will have writer version as ORC-101. 

Thanks
Prasanth




On Wed, Sep 7, 2016 at 1:12 PM -0500, "Alan Gates" <al...@gmail.com> wrote:










I think using the default encoding for the old files is the best option, as it will be right 99% of the time.  I was wondering how the system would know whether or not this was an old file.

Alan.

> On Sep 7, 2016, at 10:06, Owen O'Malley  wrote:
> 
> 4 is about when you are using the bloom filter for predicate push down. I'm
> saying old files should use the default encoding when checking the bloom
> filter. The other option is to always have the predicate push down say
> maybe if the file is an old one.
> 
> .. Owen
> 
> On Wed, Sep 7, 2016 at 9:34 AM, Alan Gates  wrote:
> 
>> +1 to 1-3.  On 4, what do you mean by test?  Assume it’s the default
>> encoding and use that?  Is there a versioning concept in the bloom filters
>> that will make it easy to determine if this is pre or post ORC-101?
>> 
>> Alan.
>> 
>>> On Sep 7, 2016, at 08:57, Owen O'Malley  wrote:
>>> 
>>> All,
>>>  Dain Sundstrom pointed out to me in personal email that the ORC bloom
>>> filters are currently using the default character encoding. That makes
>> the
>>> bloom filters non-portable between different computers that use different
>>> default encodings. I've filed ORC-101 to address it, but I want to have a
>>> wider discussion. I'd propose that we:
>>> 
>>> 1. create a new WriterVersion for ORC-101.
>>> 2. move the bloom filter code from storage-api into ORC.
>>> 3. consistently use UTF-8 when creating new bloom filters
>>> 4. for ORC files older than ORC-101, test the default encoding instead of
>>> UTF-8
>>> 
>>> Thoughts?
>>> 
>>> .. Owen
>> 
>>

Re: Bloom filter hash broken

Posted by Alan Gates <al...@gmail.com>.

I think using the default encoding for the old files is the best option, as it will be right 99% of the time.  I was wondering how the system would know whether or not this was an old file.

Alan.

> On Sep 7, 2016, at 10:06, Owen O'Malley <om...@apache.org> wrote:
> 
> 4 is about when you are using the bloom filter for predicate push down. I'm
> saying old files should use the default encoding when checking the bloom
> filter. The other option is to always have the predicate push down say
> maybe if the file is an old one.
> 
> .. Owen
> 
> On Wed, Sep 7, 2016 at 9:34 AM, Alan Gates <al...@gmail.com> wrote:
> 
>> +1 to 1-3.  On 4, what do you mean by test?  Assume it’s the default
>> encoding and use that?  Is there a versioning concept in the bloom filters
>> that will make it easy to determine if this is pre or post ORC-101?
>> 
>> Alan.
>> 
>>> On Sep 7, 2016, at 08:57, Owen O'Malley <om...@apache.org> wrote:
>>> 
>>> All,
>>>  Dain Sundstrom pointed out to me in personal email that the ORC bloom
>>> filters are currently using the default character encoding. That makes
>> the
>>> bloom filters non-portable between different computers that use different
>>> default encodings. I've filed ORC-101 to address it, but I want to have a
>>> wider discussion. I'd propose that we:
>>> 
>>> 1. create a new WriterVersion for ORC-101.
>>> 2. move the bloom filter code from storage-api into ORC.
>>> 3. consistently use UTF-8 when creating new bloom filters
>>> 4. for ORC files older than ORC-101, test the default encoding instead of
>>> UTF-8
>>> 
>>> Thoughts?
>>> 
>>> .. Owen
>> 
>>

Re: Bloom filter hash broken

Posted by Owen O'Malley <om...@apache.org>.

4 is about when you are using the bloom filter for predicate push down. I'm
saying old files should use the default encoding when checking the bloom
filter. The other option is to always have the predicate push down say
maybe if the file is an old one.

.. Owen

On Wed, Sep 7, 2016 at 9:34 AM, Alan Gates <al...@gmail.com> wrote:

> +1 to 1-3.  On 4, what do you mean by test?  Assume it’s the default
> encoding and use that?  Is there a versioning concept in the bloom filters
> that will make it easy to determine if this is pre or post ORC-101?
>
> Alan.
>
> > On Sep 7, 2016, at 08:57, Owen O'Malley <om...@apache.org> wrote:
> >
> > All,
> >   Dain Sundstrom pointed out to me in personal email that the ORC bloom
> > filters are currently using the default character encoding. That makes
> the
> > bloom filters non-portable between different computers that use different
> > default encodings. I've filed ORC-101 to address it, but I want to have a
> > wider discussion. I'd propose that we:
> >
> > 1. create a new WriterVersion for ORC-101.
> > 2. move the bloom filter code from storage-api into ORC.
> > 3. consistently use UTF-8 when creating new bloom filters
> > 4. for ORC files older than ORC-101, test the default encoding instead of
> > UTF-8
> >
> > Thoughts?
> >
> > .. Owen
>
>

Re: Bloom filter hash broken

Posted by Alan Gates <al...@gmail.com>.

+1 to 1-3.  On 4, what do you mean by test?  Assume it’s the default encoding and use that?  Is there a versioning concept in the bloom filters that will make it easy to determine if this is pre or post ORC-101?

Alan.

> On Sep 7, 2016, at 08:57, Owen O'Malley <om...@apache.org> wrote:
> 
> All,
>   Dain Sundstrom pointed out to me in personal email that the ORC bloom
> filters are currently using the default character encoding. That makes the
> bloom filters non-portable between different computers that use different
> default encodings. I've filed ORC-101 to address it, but I want to have a
> wider discussion. I'd propose that we:
> 
> 1. create a new WriterVersion for ORC-101.
> 2. move the bloom filter code from storage-api into ORC.
> 3. consistently use UTF-8 when creating new bloom filters
> 4. for ORC files older than ORC-101, test the default encoding instead of
> UTF-8
> 
> Thoughts?
> 
> .. Owen

Re: Bloom filter hash broken

Posted by Dain Sundstrom <da...@fb.com>.

> On Sep 7, 2016, at 9:25 AM, Owen O'Malley <om...@apache.org> wrote:
> 
> 4. for ORC files older than ORC-101, test the default encoding instead of UTF-8

Is the default encoding of the creator of the file available?  If not, I don’t think you can reliably use the bloom filters from these files.

-dain