You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@parquet.apache.org by Zoltan Ivanfi <zi...@cloudera.com.INVALID> on 2019/07/01 12:20:01 UTC

Re: [DISCUSS] Prepare release for parquet-format 2.7.0?

Hi,

I would like to clarify one point of my previous e-mail: While I reasoned
that for compressions and encodings we should avoid picking algorithms
superseded by better ones, I also reasoned that for bloom filters we do not
necessarily have to be as strict, because a reader with missing
implementation will still be able to read data from files that contain
unsupported bloom filter data structures.

Personally I'm fine with moving forward with the current hash proposal,
even if the chosen algorithm is not considered to be the best of its class.

Br,

Zoltan

On Sun, Jun 30, 2019 at 11:02 PM Jim Apple <jb...@apache.org> wrote:

> On 2019/06/28 16:43:23, Ryan Blue <rb...@netflix.com.INVALID> wrote:
> > I agree with Zoltan. Since we want to ensure compatibility, it would be
> > better to choose the best option now instead of making everyone support
> two
> > options forever.
>
> I'd guess there probably isn't a single best option. I suspect there's a
> tradeoff between ease of implementation and speed, for instance, since I
> expect it's easy to find an MD5 library in most programming languages and
> operating systems, yet MD5 is very slow compared to non-cryptographic hash
> functions designed for speed like xxhash.
>
> There's also a significant amount of variability across processor families
> (64-bit multiply-shift in ARM vs x86-64) or even different versions of the
> same processor family (CLHash in Haswell vs. Sandy Lake). There are also
> quality tradeoffs that depend on the average bye length of the input (FNV
> vs vhash) or how much L1 cache the user wants to use for the hash function
> (tabulation hashing vs. multiply-shift).
>
> To deal with this level of ambiguity, I'd suggest that v1 should include a
> hash function that works well for certain common environments. As far as I
> know, murmur and xxhash would both fit that bill.
>

Re: [DISCUSS] Prepare release for parquet-format 2.7.0?

Posted by Gidon Gershinsky <gg...@gmail.com>.

Hi Zoltan,

This has been brought up at the sync today, there was a general consensus
the encryption (spec and Thrift structures) should be released with the
parquet-format 2.7.

Cheers, Gidon.


On Fri, Jul 5, 2019 at 3:35 PM Zoltan Ivanfi <zi...@cloudera.com.invalid>
wrote:

> Hi,
>
> I just wanted to leave a comment on the pull request to update
> Encryption.md as well, but to my suprise it is not in master yet despite
> the vote for the encryption feature having passed 6 months ago. What are
> the plans for merging that? Should it be included in parquet-format 2.7?
>
> Thanks,
>
> Zoltan
>
> On Fri, Jul 5, 2019 at 1:33 PM Zoltan Ivanfi <zi...@cloudera.com> wrote:
>
> > Hi,
> >
> > I just noticed that yesterday I misunderstood that the Bloom filter is a
> > part of the column chunk metadata, when in fact it is only the offset of
> it
> > that is stored there. In this case we definitely need to pay more
> attention
> > to the encryption aspect because it won't happen automatically.
> >
> > Br,
> >
> > Zoltan
> >
> > On Fri, Jul 5, 2019 at 1:09 PM 俊杰陈 <cj...@gmail.com> wrote:
> >
> >> That would be great, thank you.
> >>
> >> On Fri, Jul 5, 2019 at 5:40 PM Gidon Gershinsky <gg...@gmail.com>
> wrote:
> >>
> >> > Hi Junjie,
> >> >
> >> > I'd be glad to have a look at the encryption part. Will add my
> comments
> >> > early next week.
> >> >
> >> > Cheers, Gidon.
> >> >
> >> > On Fri, Jul 5, 2019 at 12:16 PM 俊杰陈 <cj...@gmail.com> wrote:
> >> >
> >> > > Sorry, the latest file is
> >> > >
> >> > >
> >> >
> >>
> https://github.com/chenjunjiedada/parquet-format/blob/PARQUET-1617/BloomFilter.md
> >> > > .
> >> > >
> >> > > On Fri, Jul 5, 2019 at 5:14 PM 俊杰陈 <cj...@gmail.com> wrote:
> >> > >
> >> > > > Sure, please see this PR
> >> > > > <https://github.com/apache/parquet-format/pull/140> or update
> file
> >> > here
> >> > > > <
> >> > >
> >> >
> >>
> https://github.com/chenjunjiedada/parquet-format/blob/master/BloomFilter.md
> >> > > >
> >> > > > .
> >> > > >
> >> > > > Thanks for reviewing spec.
> >> > > >
> >> > > > On Thu, Jul 4, 2019 at 11:57 PM Zoltan Ivanfi
> >> <zi@cloudera.com.invalid
> >> > >
> >> > > > wrote:
> >> > > >
> >> > > >> Hi Junjie,
> >> > > >>
> >> > > >> I read through the specification and while I support the feature
> in
> >> > > >> general, I find that the documentation may not be detailed enough
> >> to
> >> > > allow
> >> > > >> developers of  different language bindings to implement it.
> >> > > Specifically,
> >> > > >> the Technical Approach section of the docs is very short and
> refers
> >> > the
> >> > > >> reader to two publications for details. I think the specification
> >> > would
> >> > > >> greatly benefit from including an explanation or a summary of the
> >> > > approach
> >> > > >> in this section.
> >> > > >>
> >> > > >> The "Build a Bloom filter" section contains a formula for
> >> calculating
> >> > > the
> >> > > >> optimal filter size for a desired false positive rate, but does
> not
> >> > > >> specify
> >> > > >> what false positive rates implementations should target by
> default
> >> and
> >> > > >> through what ways should they make it configurable by users. I
> >> > > understand
> >> > > >> that this may be an intentional omission, since targeting any
> false
> >> > > >> positive rate will result in a specification-compliant result,
> >> still I
> >> > > >> think it would be best to provide some recommendation for the
> >> > different
> >> > > >> language bindings.
> >> > > >>
> >> > > >> Since this feature is getting added after encryption, it should
> be
> >> > > briefly
> >> > > >> but explicitly mentioned how it interacts with that (basically
> >> that it
> >> > > has
> >> > > >> to be encrypted, otherwise it would leak sensitive information,
> >> but by
> >> > > >> placing it inside the column chunk metadata, this is
> automatically
> >> > taken
> >> > > >> care of).
> >> > > >>
> >> > > >> Finally, as a nitpick, I would prefer in-line links to related
> >> > materials
> >> > > >> instead of numeric references that one must manually look up at
> the
> >> > > bottom
> >> > > >> of the page.
> >> > > >>
> >> > > >> Could you please add these improvements to the specification?
> >> > > >>
> >> > > >> Thanks,
> >> > > >>
> >> > > >> Zoltan
> >> > > >>
> >> > > >> On Wed, Jul 3, 2019 at 4:03 PM 俊杰陈 <cj...@gmail.com> wrote:
> >> > > >>
> >> > > >> > You are welcome, it 's my honor.
> >> > > >> >
> >> > > >> > I think the PR <
> >> https://github.com/apache/parquet-format/pull/139>
> >> > > just
> >> > > >> > remove murmur3, that should express what I want.
> >> > > >> >
> >> > > >> >
> >> > > >> >
> >> > > >> >
> >> > > >> > On Wed, Jul 3, 2019 at 9:53 PM Zoltan Ivanfi
> >> > <zi@cloudera.com.invalid
> >> > > >
> >> > > >> > wrote:
> >> > > >> >
> >> > > >> > > Hi Junjie,
> >> > > >> > >
> >> > > >> > > Thanks for the update and also for your endruance in going
> >> through
> >> > > >> this
> >> > > >> > > tedious process in order to add bloom filtering to Parquet.
> >> > > >> > >
> >> > > >> > > I understand that your proposal is to go forward with xxHash
> >> > instead
> >> > > >> of
> >> > > >> > the
> >> > > >> > > eralier murmur3, which you suggest to deprecate. Since the
> >> murmur3
> >> > > >> hash
> >> > > >> > was
> >> > > >> > > never released, I think it could be completely removed from
> the
> >> > spec
> >> > > >> > > instead of just getting deprecated. What is your opinion on
> >> this?
> >> > > >> > >
> >> > > >> > > Thanks,
> >> > > >> > >
> >> > > >> > > Zoltan
> >> > > >> > >
> >> > > >> > > On Wed, Jul 3, 2019 at 3:31 PM 俊杰陈 <cj...@gmail.com>
> wrote:
> >> > > >> > >
> >> > > >> > > > I see, thanks for guiding on this.
> >> > > >> > > >
> >> > > >> > > > Per discussion in this thread and some investigation about
> >> > changes
> >> > > >> on
> >> > > >> > > > current java and c++ implementation, and I think that is
> not
> >> > hard
> >> > > to
> >> > > >> > > > handle. So I propose to use xxHash (the XXH64 version) as
> the
> >> > > >> default
> >> > > >> > > > hash strategy and deprecate previous murmur3 hash.
> >> > > >> > > >
> >> > > >> > > > I will update vote thread as well to make it clearer to
> all.
> >> > > >> > > >
> >> > > >> > > >
> >> > > >> > > > On Wed, Jul 3, 2019 at 6:08 PM Zoltan Ivanfi
> >> > > >> <zi...@cloudera.com.invalid>
> >> > > >> > > > wrote:
> >> > > >> > > > >
> >> > > >> > > > > Hi Junjie,
> >> > > >> > > > >
> >> > > >> > > > > I think the vote is ambigous in its current form (can
> >> people
> >> > > vote
> >> > > >> on
> >> > > >> > > one
> >> > > >> > > > > option only or can they vote on both?) and has a low
> >> chance of
> >> > > >> > getting
> >> > > >> > > > > votes in general because it's not a yes/no question but a
> >> > > >> > > > > choose-an-approach question instead. I think most
> >> contributors
> >> > > >> would
> >> > > >> > > > accept
> >> > > >> > > > > the hash chosen based on a community discussion but would
> >> be
> >> > > >> > reluctant
> >> > > >> > > to
> >> > > >> > > > > make that choice themselves in the form a vote because it
> >> > > >> requires a
> >> > > >> > > much
> >> > > >> > > > > deeper dive into the technical intricacies involved. The
> >> > > >> committers
> >> > > >> > are
> >> > > >> > > > > experienced in the parquet code base but may not be as
> >> > > >> experienced in
> >> > > >> > > > bloom
> >> > > >> > > > > filters as you are.
> >> > > >> > > > >
> >> > > >> > > > > In my opinion, to get bloom filtering into parquet-mr,
> you
> >> > > should
> >> > > >> > > > convince
> >> > > >> > > > > the committers that the proposal is viable by addressing
> >> their
> >> > > >> > concerns
> >> > > >> > > > > (which I believe you have done), and not by delegating
> the
> >> > task
> >> > > of
> >> > > >> > > making
> >> > > >> > > > > choices to them. I would suggest that you propose which
> one
> >> > (or
> >> > > >> both)
> >> > > >> > > of
> >> > > >> > > > > the hashes should be included, summarize your motivations
> >> in
> >> > > this
> >> > > >> > > thread
> >> > > >> > > > > and if you don't get any objections for a day or two,
> call
> >> a
> >> > > >> YES/NO
> >> > > >> > > vote
> >> > > >> > > > > for that specific proposal in a separate thread.
> >> > > >> > > > >
> >> > > >> > > > > Thanks,
> >> > > >> > > > >
> >> > > >> > > > > Zoltan
> >> > > >> > > > >
> >> > > >> > > > > On Tue, Jul 2, 2019 at 3:52 AM 俊杰陈 <cj...@gmail.com>
> >> > wrote:
> >> > > >> > > > >
> >> > > >> > > > > > Any thoughts from other committers and developers?
> >> > > >> > > > > >
> >> > > >> > > > > > I 'd like to start a vote firstly, you could either
> >> provide
> >> > > your
> >> > > >> > > input
> >> > > >> > > > here
> >> > > >> > > > > > or on vote thread.
> >> > > >> > > > > >
> >> > > >> > > > > >
> >> > > >> > > > > >
> >> > > >> > > > > > On Mon, Jul 1, 2019 at 8:20 PM Zoltan Ivanfi
> >> > > >> > <zi@cloudera.com.invalid
> >> > > >> > > >
> >> > > >> > > > > > wrote:
> >> > > >> > > > > >
> >> > > >> > > > > > > Hi,
> >> > > >> > > > > > >
> >> > > >> > > > > > > I would like to clarify one point of my previous
> >> e-mail:
> >> > > >> While I
> >> > > >> > > > reasoned
> >> > > >> > > > > > > that for compressions and encodings we should avoid
> >> > picking
> >> > > >> > > > algorithms
> >> > > >> > > > > > > superseded by better ones, I also reasoned that for
> >> bloom
> >> > > >> filters
> >> > > >> > > we
> >> > > >> > > > do
> >> > > >> > > > > > not
> >> > > >> > > > > > > necessarily have to be as strict, because a reader
> with
> >> > > >> missing
> >> > > >> > > > > > > implementation will still be able to read data from
> >> files
> >> > > that
> >> > > >> > > > contain
> >> > > >> > > > > > > unsupported bloom filter data structures.
> >> > > >> > > > > > >
> >> > > >> > > > > > > Personally I'm fine with moving forward with the
> >> current
> >> > > hash
> >> > > >> > > > proposal,
> >> > > >> > > > > > > even if the chosen algorithm is not considered to be
> >> the
> >> > > best
> >> > > >> of
> >> > > >> > > its
> >> > > >> > > > > > class.
> >> > > >> > > > > > >
> >> > > >> > > > > > > Br,
> >> > > >> > > > > > >
> >> > > >> > > > > > > Zoltan
> >> > > >> > > > > > >
> >> > > >> > > > > > > On Sun, Jun 30, 2019 at 11:02 PM Jim Apple <
> >> > > >> jbapple@apache.org>
> >> > > >> > > > wrote:
> >> > > >> > > > > > >
> >> > > >> > > > > > > > On 2019/06/28 16:43:23, Ryan Blue
> >> > > <rblue@netflix.com.INVALID
> >> > > >> >
> >> > > >> > > > wrote:
> >> > > >> > > > > > > > > I agree with Zoltan. Since we want to ensure
> >> > > >> compatibility,
> >> > > >> > it
> >> > > >> > > > would
> >> > > >> > > > > > be
> >> > > >> > > > > > > > > better to choose the best option now instead of
> >> making
> >> > > >> > everyone
> >> > > >> > > > > > support
> >> > > >> > > > > > > > two
> >> > > >> > > > > > > > > options forever.
> >> > > >> > > > > > > >
> >> > > >> > > > > > > > I'd guess there probably isn't a single best
> option.
> >> I
> >> > > >> suspect
> >> > > >> > > > there's
> >> > > >> > > > > > a
> >> > > >> > > > > > > > tradeoff between ease of implementation and speed,
> >> for
> >> > > >> > instance,
> >> > > >> > > > since
> >> > > >> > > > > > I
> >> > > >> > > > > > > > expect it's easy to find an MD5 library in most
> >> > > programming
> >> > > >> > > > languages
> >> > > >> > > > > > and
> >> > > >> > > > > > > > operating systems, yet MD5 is very slow compared to
> >> > > >> > > > non-cryptographic
> >> > > >> > > > > > > hash
> >> > > >> > > > > > > > functions designed for speed like xxhash.
> >> > > >> > > > > > > >
> >> > > >> > > > > > > > There's also a significant amount of variability
> >> across
> >> > > >> > processor
> >> > > >> > > > > > > families
> >> > > >> > > > > > > > (64-bit multiply-shift in ARM vs x86-64) or even
> >> > different
> >> > > >> > > > versions of
> >> > > >> > > > > > > the
> >> > > >> > > > > > > > same processor family (CLHash in Haswell vs. Sandy
> >> > Lake).
> >> > > >> There
> >> > > >> > > are
> >> > > >> > > > > > also
> >> > > >> > > > > > > > quality tradeoffs that depend on the average bye
> >> length
> >> > of
> >> > > >> the
> >> > > >> > > > input
> >> > > >> > > > > > (FNV
> >> > > >> > > > > > > > vs vhash) or how much L1 cache the user wants to
> use
> >> for
> >> > > the
> >> > > >> > hash
> >> > > >> > > > > > > function
> >> > > >> > > > > > > > (tabulation hashing vs. multiply-shift).
> >> > > >> > > > > > > >
> >> > > >> > > > > > > > To deal with this level of ambiguity, I'd suggest
> >> that
> >> > v1
> >> > > >> > should
> >> > > >> > > > > > include
> >> > > >> > > > > > > a
> >> > > >> > > > > > > > hash function that works well for certain common
> >> > > >> environments.
> >> > > >> > As
> >> > > >> > > > far
> >> > > >> > > > > > as
> >> > > >> > > > > > > I
> >> > > >> > > > > > > > know, murmur and xxhash would both fit that bill.
> >> > > >> > > > > > > >
> >> > > >> > > > > > >
> >> > > >> > > > > >
> >> > > >> > > > > >
> >> > > >> > > > > > --
> >> > > >> > > > > > Thanks & Best Regards
> >> > > >> > > > > >
> >> > > >> > > >
> >> > > >> > > >
> >> > > >> > > >
> >> > > >> > > > --
> >> > > >> > > > Thanks & Best Regards
> >> > > >> > > >
> >> > > >> > >
> >> > > >> >
> >> > > >> >
> >> > > >> > --
> >> > > >> > Thanks & Best Regards
> >> > > >> >
> >> > > >>
> >> > > >
> >> > > >
> >> > > > --
> >> > > > Thanks & Best Regards
> >> > > >
> >> > >
> >> > >
> >> > > --
> >> > > Thanks & Best Regards
> >> > >
> >> >
> >>
> >>
> >> --
> >> Thanks & Best Regards
> >>
> >
>

Re: [DISCUSS] Prepare release for parquet-format 2.7.0?

Posted by Zoltan Ivanfi <zi...@cloudera.com.INVALID>.

Hi,

I just wanted to leave a comment on the pull request to update
Encryption.md as well, but to my suprise it is not in master yet despite
the vote for the encryption feature having passed 6 months ago. What are
the plans for merging that? Should it be included in parquet-format 2.7?

Thanks,

Zoltan

On Fri, Jul 5, 2019 at 1:33 PM Zoltan Ivanfi <zi...@cloudera.com> wrote:

> Hi,
>
> I just noticed that yesterday I misunderstood that the Bloom filter is a
> part of the column chunk metadata, when in fact it is only the offset of it
> that is stored there. In this case we definitely need to pay more attention
> to the encryption aspect because it won't happen automatically.
>
> Br,
>
> Zoltan
>
> On Fri, Jul 5, 2019 at 1:09 PM 俊杰陈 <cj...@gmail.com> wrote:
>
>> That would be great, thank you.
>>
>> On Fri, Jul 5, 2019 at 5:40 PM Gidon Gershinsky <gg...@gmail.com> wrote:
>>
>> > Hi Junjie,
>> >
>> > I'd be glad to have a look at the encryption part. Will add my comments
>> > early next week.
>> >
>> > Cheers, Gidon.
>> >
>> > On Fri, Jul 5, 2019 at 12:16 PM 俊杰陈 <cj...@gmail.com> wrote:
>> >
>> > > Sorry, the latest file is
>> > >
>> > >
>> >
>> https://github.com/chenjunjiedada/parquet-format/blob/PARQUET-1617/BloomFilter.md
>> > > .
>> > >
>> > > On Fri, Jul 5, 2019 at 5:14 PM 俊杰陈 <cj...@gmail.com> wrote:
>> > >
>> > > > Sure, please see this PR
>> > > > <https://github.com/apache/parquet-format/pull/140> or update file
>> > here
>> > > > <
>> > >
>> >
>> https://github.com/chenjunjiedada/parquet-format/blob/master/BloomFilter.md
>> > > >
>> > > > .
>> > > >
>> > > > Thanks for reviewing spec.
>> > > >
>> > > > On Thu, Jul 4, 2019 at 11:57 PM Zoltan Ivanfi
>> <zi@cloudera.com.invalid
>> > >
>> > > > wrote:
>> > > >
>> > > >> Hi Junjie,
>> > > >>
>> > > >> I read through the specification and while I support the feature in
>> > > >> general, I find that the documentation may not be detailed enough
>> to
>> > > allow
>> > > >> developers of  different language bindings to implement it.
>> > > Specifically,
>> > > >> the Technical Approach section of the docs is very short and refers
>> > the
>> > > >> reader to two publications for details. I think the specification
>> > would
>> > > >> greatly benefit from including an explanation or a summary of the
>> > > approach
>> > > >> in this section.
>> > > >>
>> > > >> The "Build a Bloom filter" section contains a formula for
>> calculating
>> > > the
>> > > >> optimal filter size for a desired false positive rate, but does not
>> > > >> specify
>> > > >> what false positive rates implementations should target by default
>> and
>> > > >> through what ways should they make it configurable by users. I
>> > > understand
>> > > >> that this may be an intentional omission, since targeting any false
>> > > >> positive rate will result in a specification-compliant result,
>> still I
>> > > >> think it would be best to provide some recommendation for the
>> > different
>> > > >> language bindings.
>> > > >>
>> > > >> Since this feature is getting added after encryption, it should be
>> > > briefly
>> > > >> but explicitly mentioned how it interacts with that (basically
>> that it
>> > > has
>> > > >> to be encrypted, otherwise it would leak sensitive information,
>> but by
>> > > >> placing it inside the column chunk metadata, this is automatically
>> > taken
>> > > >> care of).
>> > > >>
>> > > >> Finally, as a nitpick, I would prefer in-line links to related
>> > materials
>> > > >> instead of numeric references that one must manually look up at the
>> > > bottom
>> > > >> of the page.
>> > > >>
>> > > >> Could you please add these improvements to the specification?
>> > > >>
>> > > >> Thanks,
>> > > >>
>> > > >> Zoltan
>> > > >>
>> > > >> On Wed, Jul 3, 2019 at 4:03 PM 俊杰陈 <cj...@gmail.com> wrote:
>> > > >>
>> > > >> > You are welcome, it 's my honor.
>> > > >> >
>> > > >> > I think the PR <
>> https://github.com/apache/parquet-format/pull/139>
>> > > just
>> > > >> > remove murmur3, that should express what I want.
>> > > >> >
>> > > >> >
>> > > >> >
>> > > >> >
>> > > >> > On Wed, Jul 3, 2019 at 9:53 PM Zoltan Ivanfi
>> > <zi@cloudera.com.invalid
>> > > >
>> > > >> > wrote:
>> > > >> >
>> > > >> > > Hi Junjie,
>> > > >> > >
>> > > >> > > Thanks for the update and also for your endruance in going
>> through
>> > > >> this
>> > > >> > > tedious process in order to add bloom filtering to Parquet.
>> > > >> > >
>> > > >> > > I understand that your proposal is to go forward with xxHash
>> > instead
>> > > >> of
>> > > >> > the
>> > > >> > > eralier murmur3, which you suggest to deprecate. Since the
>> murmur3
>> > > >> hash
>> > > >> > was
>> > > >> > > never released, I think it could be completely removed from the
>> > spec
>> > > >> > > instead of just getting deprecated. What is your opinion on
>> this?
>> > > >> > >
>> > > >> > > Thanks,
>> > > >> > >
>> > > >> > > Zoltan
>> > > >> > >
>> > > >> > > On Wed, Jul 3, 2019 at 3:31 PM 俊杰陈 <cj...@gmail.com> wrote:
>> > > >> > >
>> > > >> > > > I see, thanks for guiding on this.
>> > > >> > > >
>> > > >> > > > Per discussion in this thread and some investigation about
>> > changes
>> > > >> on
>> > > >> > > > current java and c++ implementation, and I think that is not
>> > hard
>> > > to
>> > > >> > > > handle. So I propose to use xxHash (the XXH64 version) as the
>> > > >> default
>> > > >> > > > hash strategy and deprecate previous murmur3 hash.
>> > > >> > > >
>> > > >> > > > I will update vote thread as well to make it clearer to all.
>> > > >> > > >
>> > > >> > > >
>> > > >> > > > On Wed, Jul 3, 2019 at 6:08 PM Zoltan Ivanfi
>> > > >> <zi...@cloudera.com.invalid>
>> > > >> > > > wrote:
>> > > >> > > > >
>> > > >> > > > > Hi Junjie,
>> > > >> > > > >
>> > > >> > > > > I think the vote is ambigous in its current form (can
>> people
>> > > vote
>> > > >> on
>> > > >> > > one
>> > > >> > > > > option only or can they vote on both?) and has a low
>> chance of
>> > > >> > getting
>> > > >> > > > > votes in general because it's not a yes/no question but a
>> > > >> > > > > choose-an-approach question instead. I think most
>> contributors
>> > > >> would
>> > > >> > > > accept
>> > > >> > > > > the hash chosen based on a community discussion but would
>> be
>> > > >> > reluctant
>> > > >> > > to
>> > > >> > > > > make that choice themselves in the form a vote because it
>> > > >> requires a
>> > > >> > > much
>> > > >> > > > > deeper dive into the technical intricacies involved. The
>> > > >> committers
>> > > >> > are
>> > > >> > > > > experienced in the parquet code base but may not be as
>> > > >> experienced in
>> > > >> > > > bloom
>> > > >> > > > > filters as you are.
>> > > >> > > > >
>> > > >> > > > > In my opinion, to get bloom filtering into parquet-mr, you
>> > > should
>> > > >> > > > convince
>> > > >> > > > > the committers that the proposal is viable by addressing
>> their
>> > > >> > concerns
>> > > >> > > > > (which I believe you have done), and not by delegating the
>> > task
>> > > of
>> > > >> > > making
>> > > >> > > > > choices to them. I would suggest that you propose which one
>> > (or
>> > > >> both)
>> > > >> > > of
>> > > >> > > > > the hashes should be included, summarize your motivations
>> in
>> > > this
>> > > >> > > thread
>> > > >> > > > > and if you don't get any objections for a day or two, call
>> a
>> > > >> YES/NO
>> > > >> > > vote
>> > > >> > > > > for that specific proposal in a separate thread.
>> > > >> > > > >
>> > > >> > > > > Thanks,
>> > > >> > > > >
>> > > >> > > > > Zoltan
>> > > >> > > > >
>> > > >> > > > > On Tue, Jul 2, 2019 at 3:52 AM 俊杰陈 <cj...@gmail.com>
>> > wrote:
>> > > >> > > > >
>> > > >> > > > > > Any thoughts from other committers and developers?
>> > > >> > > > > >
>> > > >> > > > > > I 'd like to start a vote firstly, you could either
>> provide
>> > > your
>> > > >> > > input
>> > > >> > > > here
>> > > >> > > > > > or on vote thread.
>> > > >> > > > > >
>> > > >> > > > > >
>> > > >> > > > > >
>> > > >> > > > > > On Mon, Jul 1, 2019 at 8:20 PM Zoltan Ivanfi
>> > > >> > <zi@cloudera.com.invalid
>> > > >> > > >
>> > > >> > > > > > wrote:
>> > > >> > > > > >
>> > > >> > > > > > > Hi,
>> > > >> > > > > > >
>> > > >> > > > > > > I would like to clarify one point of my previous
>> e-mail:
>> > > >> While I
>> > > >> > > > reasoned
>> > > >> > > > > > > that for compressions and encodings we should avoid
>> > picking
>> > > >> > > > algorithms
>> > > >> > > > > > > superseded by better ones, I also reasoned that for
>> bloom
>> > > >> filters
>> > > >> > > we
>> > > >> > > > do
>> > > >> > > > > > not
>> > > >> > > > > > > necessarily have to be as strict, because a reader with
>> > > >> missing
>> > > >> > > > > > > implementation will still be able to read data from
>> files
>> > > that
>> > > >> > > > contain
>> > > >> > > > > > > unsupported bloom filter data structures.
>> > > >> > > > > > >
>> > > >> > > > > > > Personally I'm fine with moving forward with the
>> current
>> > > hash
>> > > >> > > > proposal,
>> > > >> > > > > > > even if the chosen algorithm is not considered to be
>> the
>> > > best
>> > > >> of
>> > > >> > > its
>> > > >> > > > > > class.
>> > > >> > > > > > >
>> > > >> > > > > > > Br,
>> > > >> > > > > > >
>> > > >> > > > > > > Zoltan
>> > > >> > > > > > >
>> > > >> > > > > > > On Sun, Jun 30, 2019 at 11:02 PM Jim Apple <
>> > > >> jbapple@apache.org>
>> > > >> > > > wrote:
>> > > >> > > > > > >
>> > > >> > > > > > > > On 2019/06/28 16:43:23, Ryan Blue
>> > > <rblue@netflix.com.INVALID
>> > > >> >
>> > > >> > > > wrote:
>> > > >> > > > > > > > > I agree with Zoltan. Since we want to ensure
>> > > >> compatibility,
>> > > >> > it
>> > > >> > > > would
>> > > >> > > > > > be
>> > > >> > > > > > > > > better to choose the best option now instead of
>> making
>> > > >> > everyone
>> > > >> > > > > > support
>> > > >> > > > > > > > two
>> > > >> > > > > > > > > options forever.
>> > > >> > > > > > > >
>> > > >> > > > > > > > I'd guess there probably isn't a single best option.
>> I
>> > > >> suspect
>> > > >> > > > there's
>> > > >> > > > > > a
>> > > >> > > > > > > > tradeoff between ease of implementation and speed,
>> for
>> > > >> > instance,
>> > > >> > > > since
>> > > >> > > > > > I
>> > > >> > > > > > > > expect it's easy to find an MD5 library in most
>> > > programming
>> > > >> > > > languages
>> > > >> > > > > > and
>> > > >> > > > > > > > operating systems, yet MD5 is very slow compared to
>> > > >> > > > non-cryptographic
>> > > >> > > > > > > hash
>> > > >> > > > > > > > functions designed for speed like xxhash.
>> > > >> > > > > > > >
>> > > >> > > > > > > > There's also a significant amount of variability
>> across
>> > > >> > processor
>> > > >> > > > > > > families
>> > > >> > > > > > > > (64-bit multiply-shift in ARM vs x86-64) or even
>> > different
>> > > >> > > > versions of
>> > > >> > > > > > > the
>> > > >> > > > > > > > same processor family (CLHash in Haswell vs. Sandy
>> > Lake).
>> > > >> There
>> > > >> > > are
>> > > >> > > > > > also
>> > > >> > > > > > > > quality tradeoffs that depend on the average bye
>> length
>> > of
>> > > >> the
>> > > >> > > > input
>> > > >> > > > > > (FNV
>> > > >> > > > > > > > vs vhash) or how much L1 cache the user wants to use
>> for
>> > > the
>> > > >> > hash
>> > > >> > > > > > > function
>> > > >> > > > > > > > (tabulation hashing vs. multiply-shift).
>> > > >> > > > > > > >
>> > > >> > > > > > > > To deal with this level of ambiguity, I'd suggest
>> that
>> > v1
>> > > >> > should
>> > > >> > > > > > include
>> > > >> > > > > > > a
>> > > >> > > > > > > > hash function that works well for certain common
>> > > >> environments.
>> > > >> > As
>> > > >> > > > far
>> > > >> > > > > > as
>> > > >> > > > > > > I
>> > > >> > > > > > > > know, murmur and xxhash would both fit that bill.
>> > > >> > > > > > > >
>> > > >> > > > > > >
>> > > >> > > > > >
>> > > >> > > > > >
>> > > >> > > > > > --
>> > > >> > > > > > Thanks & Best Regards
>> > > >> > > > > >
>> > > >> > > >
>> > > >> > > >
>> > > >> > > >
>> > > >> > > > --
>> > > >> > > > Thanks & Best Regards
>> > > >> > > >
>> > > >> > >
>> > > >> >
>> > > >> >
>> > > >> > --
>> > > >> > Thanks & Best Regards
>> > > >> >
>> > > >>
>> > > >
>> > > >
>> > > > --
>> > > > Thanks & Best Regards
>> > > >
>> > >
>> > >
>> > > --
>> > > Thanks & Best Regards
>> > >
>> >
>>
>>
>> --
>> Thanks & Best Regards
>>
>