You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@parquet.apache.org by Zoltan Ivanfi <zi...@cloudera.com.INVALID> on 2019/07/01 12:20:01 UTC

Re: [DISCUSS] Prepare release for parquet-format 2.7.0?

Hi,

I would like to clarify one point of my previous e-mail: While I reasoned
that for compressions and encodings we should avoid picking algorithms
superseded by better ones, I also reasoned that for bloom filters we do not
necessarily have to be as strict, because a reader with missing
implementation will still be able to read data from files that contain
unsupported bloom filter data structures.

Personally I'm fine with moving forward with the current hash proposal,
even if the chosen algorithm is not considered to be the best of its class.

Br,

Zoltan

On Sun, Jun 30, 2019 at 11:02 PM Jim Apple <jb...@apache.org> wrote:

> On 2019/06/28 16:43:23, Ryan Blue <rb...@netflix.com.INVALID> wrote:
> > I agree with Zoltan. Since we want to ensure compatibility, it would be
> > better to choose the best option now instead of making everyone support
> two
> > options forever.
>
> I'd guess there probably isn't a single best option. I suspect there's a
> tradeoff between ease of implementation and speed, for instance, since I
> expect it's easy to find an MD5 library in most programming languages and
> operating systems, yet MD5 is very slow compared to non-cryptographic hash
> functions designed for speed like xxhash.
>
> There's also a significant amount of variability across processor families
> (64-bit multiply-shift in ARM vs x86-64) or even different versions of the
> same processor family (CLHash in Haswell vs. Sandy Lake). There are also
> quality tradeoffs that depend on the average bye length of the input (FNV
> vs vhash) or how much L1 cache the user wants to use for the hash function
> (tabulation hashing vs. multiply-shift).
>
> To deal with this level of ambiguity, I'd suggest that v1 should include a
> hash function that works well for certain common environments. As far as I
> know, murmur and xxhash would both fit that bill.
>

Re: [DISCUSS] Prepare release for parquet-format 2.7.0?

Posted by Gidon Gershinsky <gg...@gmail.com>.
Hi Zoltan,

This has been brought up at the sync today, there was a general consensus
the encryption (spec and Thrift structures) should be released with the
parquet-format 2.7.

Cheers, Gidon.


On Fri, Jul 5, 2019 at 3:35 PM Zoltan Ivanfi <zi...@cloudera.com.invalid>
wrote:

> Hi,
>
> I just wanted to leave a comment on the pull request to update
> Encryption.md as well, but to my suprise it is not in master yet despite
> the vote for the encryption feature having passed 6 months ago. What are
> the plans for merging that? Should it be included in parquet-format 2.7?
>
> Thanks,
>
> Zoltan
>
> On Fri, Jul 5, 2019 at 1:33 PM Zoltan Ivanfi <zi...@cloudera.com> wrote:
>
> > Hi,
> >
> > I just noticed that yesterday I misunderstood that the Bloom filter is a
> > part of the column chunk metadata, when in fact it is only the offset of
> it
> > that is stored there. In this case we definitely need to pay more
> attention
> > to the encryption aspect because it won't happen automatically.
> >
> > Br,
> >
> > Zoltan
> >
> > On Fri, Jul 5, 2019 at 1:09 PM 俊杰陈 <cj...@gmail.com> wrote:
> >
> >> That would be great, thank you.
> >>
> >> On Fri, Jul 5, 2019 at 5:40 PM Gidon Gershinsky <gg...@gmail.com>
> wrote:
> >>
> >> > Hi Junjie,
> >> >
> >> > I'd be glad to have a look at the encryption part. Will add my
> comments
> >> > early next week.
> >> >
> >> > Cheers, Gidon.
> >> >
> >> > On Fri, Jul 5, 2019 at 12:16 PM 俊杰陈 <cj...@gmail.com> wrote:
> >> >
> >> > > Sorry, the latest file is
> >> > >
> >> > >
> >> >
> >>
> https://github.com/chenjunjiedada/parquet-format/blob/PARQUET-1617/BloomFilter.md
> >> > > .
> >> > >
> >> > > On Fri, Jul 5, 2019 at 5:14 PM 俊杰陈 <cj...@gmail.com> wrote:
> >> > >
> >> > > > Sure, please see this PR
> >> > > > <https://github.com/apache/parquet-format/pull/140> or update
> file
> >> > here
> >> > > > <
> >> > >
> >> >
> >>
> https://github.com/chenjunjiedada/parquet-format/blob/master/BloomFilter.md
> >> > > >
> >> > > > .
> >> > > >
> >> > > > Thanks for reviewing spec.
> >> > > >
> >> > > > On Thu, Jul 4, 2019 at 11:57 PM Zoltan Ivanfi
> >> <zi@cloudera.com.invalid
> >> > >
> >> > > > wrote:
> >> > > >
> >> > > >> Hi Junjie,
> >> > > >>
> >> > > >> I read through the specification and while I support the feature
> in
> >> > > >> general, I find that the documentation may not be detailed enough
> >> to
> >> > > allow
> >> > > >> developers of  different language bindings to implement it.
> >> > > Specifically,
> >> > > >> the Technical Approach section of the docs is very short and
> refers
> >> > the
> >> > > >> reader to two publications for details. I think the specification
> >> > would
> >> > > >> greatly benefit from including an explanation or a summary of the
> >> > > approach
> >> > > >> in this section.
> >> > > >>
> >> > > >> The "Build a Bloom filter" section contains a formula for
> >> calculating
> >> > > the
> >> > > >> optimal filter size for a desired false positive rate, but does
> not
> >> > > >> specify
> >> > > >> what false positive rates implementations should target by
> default
> >> and
> >> > > >> through what ways should they make it configurable by users. I
> >> > > understand
> >> > > >> that this may be an intentional omission, since targeting any
> false
> >> > > >> positive rate will result in a specification-compliant result,
> >> still I
> >> > > >> think it would be best to provide some recommendation for the
> >> > different
> >> > > >> language bindings.
> >> > > >>
> >> > > >> Since this feature is getting added after encryption, it should
> be
> >> > > briefly
> >> > > >> but explicitly mentioned how it interacts with that (basically
> >> that it
> >> > > has
> >> > > >> to be encrypted, otherwise it would leak sensitive information,
> >> but by
> >> > > >> placing it inside the column chunk metadata, this is
> automatically
> >> > taken
> >> > > >> care of).
> >> > > >>
> >> > > >> Finally, as a nitpick, I would prefer in-line links to related
> >> > materials
> >> > > >> instead of numeric references that one must manually look up at
> the
> >> > > bottom
> >> > > >> of the page.
> >> > > >>
> >> > > >> Could you please add these improvements to the specification?
> >> > > >>
> >> > > >> Thanks,
> >> > > >>
> >> > > >> Zoltan
> >> > > >>
> >> > > >> On Wed, Jul 3, 2019 at 4:03 PM 俊杰陈 <cj...@gmail.com> wrote:
> >> > > >>
> >> > > >> > You are welcome, it 's my honor.
> >> > > >> >
> >> > > >> > I think the PR <
> >> https://github.com/apache/parquet-format/pull/139>
> >> > > just
> >> > > >> > remove murmur3, that should express what I want.
> >> > > >> >
> >> > > >> >
> >> > > >> >
> >> > > >> >
> >> > > >> > On Wed, Jul 3, 2019 at 9:53 PM Zoltan Ivanfi
> >> > <zi@cloudera.com.invalid
> >> > > >
> >> > > >> > wrote:
> >> > > >> >
> >> > > >> > > Hi Junjie,
> >> > > >> > >
> >> > > >> > > Thanks for the update and also for your endruance in going
> >> through
> >> > > >> this
> >> > > >> > > tedious process in order to add bloom filtering to Parquet.
> >> > > >> > >
> >> > > >> > > I understand that your proposal is to go forward with xxHash
> >> > instead
> >> > > >> of
> >> > > >> > the
> >> > > >> > > eralier murmur3, which you suggest to deprecate. Since the
> >> murmur3
> >> > > >> hash
> >> > > >> > was
> >> > > >> > > never released, I think it could be completely removed from
> the
> >> > spec
> >> > > >> > > instead of just getting deprecated. What is your opinion on
> >> this?
> >> > > >> > >
> >> > > >> > > Thanks,
> >> > > >> > >
> >> > > >> > > Zoltan
> >> > > >> > >
> >> > > >> > > On Wed, Jul 3, 2019 at 3:31 PM 俊杰陈 <cj...@gmail.com>
> wrote:
> >> > > >> > >
> >> > > >> > > > I see, thanks for guiding on this.
> >> > > >> > > >
> >> > > >> > > > Per discussion in this thread and some investigation about
> >> > changes
> >> > > >> on
> >> > > >> > > > current java and c++ implementation, and I think that is
> not
> >> > hard
> >> > > to
> >> > > >> > > > handle. So I propose to use xxHash (the XXH64 version) as
> the
> >> > > >> default
> >> > > >> > > > hash strategy and deprecate previous murmur3 hash.
> >> > > >> > > >
> >> > > >> > > > I will update vote thread as well to make it clearer to
> all.
> >> > > >> > > >
> >> > > >> > > >
> >> > > >> > > > On Wed, Jul 3, 2019 at 6:08 PM Zoltan Ivanfi
> >> > > >> <zi...@cloudera.com.invalid>
> >> > > >> > > > wrote:
> >> > > >> > > > >
> >> > > >> > > > > Hi Junjie,
> >> > > >> > > > >
> >> > > >> > > > > I think the vote is ambigous in its current form (can
> >> people
> >> > > vote
> >> > > >> on
> >> > > >> > > one
> >> > > >> > > > > option only or can they vote on both?) and has a low
> >> chance of
> >> > > >> > getting
> >> > > >> > > > > votes in general because it's not a yes/no question but a
> >> > > >> > > > > choose-an-approach question instead. I think most
> >> contributors
> >> > > >> would
> >> > > >> > > > accept
> >> > > >> > > > > the hash chosen based on a community discussion but would
> >> be
> >> > > >> > reluctant
> >> > > >> > > to
> >> > > >> > > > > make that choice themselves in the form a vote because it
> >> > > >> requires a
> >> > > >> > > much
> >> > > >> > > > > deeper dive into the technical intricacies involved. The
> >> > > >> committers
> >> > > >> > are
> >> > > >> > > > > experienced in the parquet code base but may not be as
> >> > > >> experienced in
> >> > > >> > > > bloom
> >> > > >> > > > > filters as you are.
> >> > > >> > > > >
> >> > > >> > > > > In my opinion, to get bloom filtering into parquet-mr,
> you
> >> > > should
> >> > > >> > > > convince
> >> > > >> > > > > the committers that the proposal is viable by addressing
> >> their
> >> > > >> > concerns
> >> > > >> > > > > (which I believe you have done), and not by delegating
> the
> >> > task
> >> > > of
> >> > > >> > > making
> >> > > >> > > > > choices to them. I would suggest that you propose which
> one
> >> > (or
> >> > > >> both)
> >> > > >> > > of
> >> > > >> > > > > the hashes should be included, summarize your motivations
> >> in
> >> > > this
> >> > > >> > > thread
> >> > > >> > > > > and if you don't get any objections for a day or two,
> call
> >> a
> >> > > >> YES/NO
> >> > > >> > > vote
> >> > > >> > > > > for that specific proposal in a separate thread.
> >> > > >> > > > >
> >> > > >> > > > > Thanks,
> >> > > >> > > > >
> >> > > >> > > > > Zoltan
> >> > > >> > > > >
> >> > > >> > > > > On Tue, Jul 2, 2019 at 3:52 AM 俊杰陈 <cj...@gmail.com>
> >> > wrote:
> >> > > >> > > > >
> >> > > >> > > > > > Any thoughts from other committers and developers?
> >> > > >> > > > > >
> >> > > >> > > > > > I 'd like to start a vote firstly, you could either
> >> provide
> >> > > your
> >> > > >> > > input
> >> > > >> > > > here
> >> > > >> > > > > > or on vote thread.
> >> > > >> > > > > >
> >> > > >> > > > > >
> >> > > >> > > > > >
> >> > > >> > > > > > On Mon, Jul 1, 2019 at 8:20 PM Zoltan Ivanfi
> >> > > >> > <zi@cloudera.com.invalid
> >> > > >> > > >
> >> > > >> > > > > > wrote:
> >> > > >> > > > > >
> >> > > >> > > > > > > Hi,
> >> > > >> > > > > > >
> >> > > >> > > > > > > I would like to clarify one point of my previous
> >> e-mail:
> >> > > >> While I
> >> > > >> > > > reasoned
> >> > > >> > > > > > > that for compressions and encodings we should avoid
> >> > picking
> >> > > >> > > > algorithms
> >> > > >> > > > > > > superseded by better ones, I also reasoned that for
> >> bloom
> >> > > >> filters
> >> > > >> > > we
> >> > > >> > > > do
> >> > > >> > > > > > not
> >> > > >> > > > > > > necessarily have to be as strict, because a reader
> with
> >> > > >> missing
> >> > > >> > > > > > > implementation will still be able to read data from
> >> files
> >> > > that
> >> > > >> > > > contain
> >> > > >> > > > > > > unsupported bloom filter data structures.
> >> > > >> > > > > > >
> >> > > >> > > > > > > Personally I'm fine with moving forward with the
> >> current
> >> > > hash
> >> > > >> > > > proposal,
> >> > > >> > > > > > > even if the chosen algorithm is not considered to be
> >> the
> >> > > best
> >> > > >> of
> >> > > >> > > its
> >> > > >> > > > > > class.
> >> > > >> > > > > > >
> >> > > >> > > > > > > Br,
> >> > > >> > > > > > >
> >> > > >> > > > > > > Zoltan
> >> > > >> > > > > > >
> >> > > >> > > > > > > On Sun, Jun 30, 2019 at 11:02 PM Jim Apple <
> >> > > >> jbapple@apache.org>
> >> > > >> > > > wrote:
> >> > > >> > > > > > >
> >> > > >> > > > > > > > On 2019/06/28 16:43:23, Ryan Blue
> >> > > <rblue@netflix.com.INVALID
> >> > > >> >
> >> > > >> > > > wrote:
> >> > > >> > > > > > > > > I agree with Zoltan. Since we want to ensure
> >> > > >> compatibility,
> >> > > >> > it
> >> > > >> > > > would
> >> > > >> > > > > > be
> >> > > >> > > > > > > > > better to choose the best option now instead of
> >> making
> >> > > >> > everyone
> >> > > >> > > > > > support
> >> > > >> > > > > > > > two
> >> > > >> > > > > > > > > options forever.
> >> > > >> > > > > > > >
> >> > > >> > > > > > > > I'd guess there probably isn't a single best
> option.
> >> I
> >> > > >> suspect
> >> > > >> > > > there's
> >> > > >> > > > > > a
> >> > > >> > > > > > > > tradeoff between ease of implementation and speed,
> >> for
> >> > > >> > instance,
> >> > > >> > > > since
> >> > > >> > > > > > I
> >> > > >> > > > > > > > expect it's easy to find an MD5 library in most
> >> > > programming
> >> > > >> > > > languages
> >> > > >> > > > > > and
> >> > > >> > > > > > > > operating systems, yet MD5 is very slow compared to
> >> > > >> > > > non-cryptographic
> >> > > >> > > > > > > hash
> >> > > >> > > > > > > > functions designed for speed like xxhash.
> >> > > >> > > > > > > >
> >> > > >> > > > > > > > There's also a significant amount of variability
> >> across
> >> > > >> > processor
> >> > > >> > > > > > > families
> >> > > >> > > > > > > > (64-bit multiply-shift in ARM vs x86-64) or even
> >> > different
> >> > > >> > > > versions of
> >> > > >> > > > > > > the
> >> > > >> > > > > > > > same processor family (CLHash in Haswell vs. Sandy
> >> > Lake).
> >> > > >> There
> >> > > >> > > are
> >> > > >> > > > > > also
> >> > > >> > > > > > > > quality tradeoffs that depend on the average bye
> >> length
> >> > of
> >> > > >> the
> >> > > >> > > > input
> >> > > >> > > > > > (FNV
> >> > > >> > > > > > > > vs vhash) or how much L1 cache the user wants to
> use
> >> for
> >> > > the
> >> > > >> > hash
> >> > > >> > > > > > > function
> >> > > >> > > > > > > > (tabulation hashing vs. multiply-shift).
> >> > > >> > > > > > > >
> >> > > >> > > > > > > > To deal with this level of ambiguity, I'd suggest
> >> that
> >> > v1
> >> > > >> > should
> >> > > >> > > > > > include
> >> > > >> > > > > > > a
> >> > > >> > > > > > > > hash function that works well for certain common
> >> > > >> environments.
> >> > > >> > As
> >> > > >> > > > far
> >> > > >> > > > > > as
> >> > > >> > > > > > > I
> >> > > >> > > > > > > > know, murmur and xxhash would both fit that bill.
> >> > > >> > > > > > > >
> >> > > >> > > > > > >
> >> > > >> > > > > >
> >> > > >> > > > > >
> >> > > >> > > > > > --
> >> > > >> > > > > > Thanks & Best Regards
> >> > > >> > > > > >
> >> > > >> > > >
> >> > > >> > > >
> >> > > >> > > >
> >> > > >> > > > --
> >> > > >> > > > Thanks & Best Regards
> >> > > >> > > >
> >> > > >> > >
> >> > > >> >
> >> > > >> >
> >> > > >> > --
> >> > > >> > Thanks & Best Regards
> >> > > >> >
> >> > > >>
> >> > > >
> >> > > >
> >> > > > --
> >> > > > Thanks & Best Regards
> >> > > >
> >> > >
> >> > >
> >> > > --
> >> > > Thanks & Best Regards
> >> > >
> >> >
> >>
> >>
> >> --
> >> Thanks & Best Regards
> >>
> >
>

Re: [DISCUSS] Prepare release for parquet-format 2.7.0?

Posted by Zoltan Ivanfi <zi...@cloudera.com.INVALID>.
Hi,

I just wanted to leave a comment on the pull request to update
Encryption.md as well, but to my suprise it is not in master yet despite
the vote for the encryption feature having passed 6 months ago. What are
the plans for merging that? Should it be included in parquet-format 2.7?

Thanks,

Zoltan

On Fri, Jul 5, 2019 at 1:33 PM Zoltan Ivanfi <zi...@cloudera.com> wrote:

> Hi,
>
> I just noticed that yesterday I misunderstood that the Bloom filter is a
> part of the column chunk metadata, when in fact it is only the offset of it
> that is stored there. In this case we definitely need to pay more attention
> to the encryption aspect because it won't happen automatically.
>
> Br,
>
> Zoltan
>
> On Fri, Jul 5, 2019 at 1:09 PM 俊杰陈 <cj...@gmail.com> wrote:
>
>> That would be great, thank you.
>>
>> On Fri, Jul 5, 2019 at 5:40 PM Gidon Gershinsky <gg...@gmail.com> wrote:
>>
>> > Hi Junjie,
>> >
>> > I'd be glad to have a look at the encryption part. Will add my comments
>> > early next week.
>> >
>> > Cheers, Gidon.
>> >
>> > On Fri, Jul 5, 2019 at 12:16 PM 俊杰陈 <cj...@gmail.com> wrote:
>> >
>> > > Sorry, the latest file is
>> > >
>> > >
>> >
>> https://github.com/chenjunjiedada/parquet-format/blob/PARQUET-1617/BloomFilter.md
>> > > .
>> > >
>> > > On Fri, Jul 5, 2019 at 5:14 PM 俊杰陈 <cj...@gmail.com> wrote:
>> > >
>> > > > Sure, please see this PR
>> > > > <https://github.com/apache/parquet-format/pull/140> or update file
>> > here
>> > > > <
>> > >
>> >
>> https://github.com/chenjunjiedada/parquet-format/blob/master/BloomFilter.md
>> > > >
>> > > > .
>> > > >
>> > > > Thanks for reviewing spec.
>> > > >
>> > > > On Thu, Jul 4, 2019 at 11:57 PM Zoltan Ivanfi
>> <zi@cloudera.com.invalid
>> > >
>> > > > wrote:
>> > > >
>> > > >> Hi Junjie,
>> > > >>
>> > > >> I read through the specification and while I support the feature in
>> > > >> general, I find that the documentation may not be detailed enough
>> to
>> > > allow
>> > > >> developers of  different language bindings to implement it.
>> > > Specifically,
>> > > >> the Technical Approach section of the docs is very short and refers
>> > the
>> > > >> reader to two publications for details. I think the specification
>> > would
>> > > >> greatly benefit from including an explanation or a summary of the
>> > > approach
>> > > >> in this section.
>> > > >>
>> > > >> The "Build a Bloom filter" section contains a formula for
>> calculating
>> > > the
>> > > >> optimal filter size for a desired false positive rate, but does not
>> > > >> specify
>> > > >> what false positive rates implementations should target by default
>> and
>> > > >> through what ways should they make it configurable by users. I
>> > > understand
>> > > >> that this may be an intentional omission, since targeting any false
>> > > >> positive rate will result in a specification-compliant result,
>> still I
>> > > >> think it would be best to provide some recommendation for the
>> > different
>> > > >> language bindings.
>> > > >>
>> > > >> Since this feature is getting added after encryption, it should be
>> > > briefly
>> > > >> but explicitly mentioned how it interacts with that (basically
>> that it
>> > > has
>> > > >> to be encrypted, otherwise it would leak sensitive information,
>> but by
>> > > >> placing it inside the column chunk metadata, this is automatically
>> > taken
>> > > >> care of).
>> > > >>
>> > > >> Finally, as a nitpick, I would prefer in-line links to related
>> > materials
>> > > >> instead of numeric references that one must manually look up at the
>> > > bottom
>> > > >> of the page.
>> > > >>
>> > > >> Could you please add these improvements to the specification?
>> > > >>
>> > > >> Thanks,
>> > > >>
>> > > >> Zoltan
>> > > >>
>> > > >> On Wed, Jul 3, 2019 at 4:03 PM 俊杰陈 <cj...@gmail.com> wrote:
>> > > >>
>> > > >> > You are welcome, it 's my honor.
>> > > >> >
>> > > >> > I think the PR <
>> https://github.com/apache/parquet-format/pull/139>
>> > > just
>> > > >> > remove murmur3, that should express what I want.
>> > > >> >
>> > > >> >
>> > > >> >
>> > > >> >
>> > > >> > On Wed, Jul 3, 2019 at 9:53 PM Zoltan Ivanfi
>> > <zi@cloudera.com.invalid
>> > > >
>> > > >> > wrote:
>> > > >> >
>> > > >> > > Hi Junjie,
>> > > >> > >
>> > > >> > > Thanks for the update and also for your endruance in going
>> through
>> > > >> this
>> > > >> > > tedious process in order to add bloom filtering to Parquet.
>> > > >> > >
>> > > >> > > I understand that your proposal is to go forward with xxHash
>> > instead
>> > > >> of
>> > > >> > the
>> > > >> > > eralier murmur3, which you suggest to deprecate. Since the
>> murmur3
>> > > >> hash
>> > > >> > was
>> > > >> > > never released, I think it could be completely removed from the
>> > spec
>> > > >> > > instead of just getting deprecated. What is your opinion on
>> this?
>> > > >> > >
>> > > >> > > Thanks,
>> > > >> > >
>> > > >> > > Zoltan
>> > > >> > >
>> > > >> > > On Wed, Jul 3, 2019 at 3:31 PM 俊杰陈 <cj...@gmail.com> wrote:
>> > > >> > >
>> > > >> > > > I see, thanks for guiding on this.
>> > > >> > > >
>> > > >> > > > Per discussion in this thread and some investigation about
>> > changes
>> > > >> on
>> > > >> > > > current java and c++ implementation, and I think that is not
>> > hard
>> > > to
>> > > >> > > > handle. So I propose to use xxHash (the XXH64 version) as the
>> > > >> default
>> > > >> > > > hash strategy and deprecate previous murmur3 hash.
>> > > >> > > >
>> > > >> > > > I will update vote thread as well to make it clearer to all.
>> > > >> > > >
>> > > >> > > >
>> > > >> > > > On Wed, Jul 3, 2019 at 6:08 PM Zoltan Ivanfi
>> > > >> <zi...@cloudera.com.invalid>
>> > > >> > > > wrote:
>> > > >> > > > >
>> > > >> > > > > Hi Junjie,
>> > > >> > > > >
>> > > >> > > > > I think the vote is ambigous in its current form (can
>> people
>> > > vote
>> > > >> on
>> > > >> > > one
>> > > >> > > > > option only or can they vote on both?) and has a low
>> chance of
>> > > >> > getting
>> > > >> > > > > votes in general because it's not a yes/no question but a
>> > > >> > > > > choose-an-approach question instead. I think most
>> contributors
>> > > >> would
>> > > >> > > > accept
>> > > >> > > > > the hash chosen based on a community discussion but would
>> be
>> > > >> > reluctant
>> > > >> > > to
>> > > >> > > > > make that choice themselves in the form a vote because it
>> > > >> requires a
>> > > >> > > much
>> > > >> > > > > deeper dive into the technical intricacies involved. The
>> > > >> committers
>> > > >> > are
>> > > >> > > > > experienced in the parquet code base but may not be as
>> > > >> experienced in
>> > > >> > > > bloom
>> > > >> > > > > filters as you are.
>> > > >> > > > >
>> > > >> > > > > In my opinion, to get bloom filtering into parquet-mr, you
>> > > should
>> > > >> > > > convince
>> > > >> > > > > the committers that the proposal is viable by addressing
>> their
>> > > >> > concerns
>> > > >> > > > > (which I believe you have done), and not by delegating the
>> > task
>> > > of
>> > > >> > > making
>> > > >> > > > > choices to them. I would suggest that you propose which one
>> > (or
>> > > >> both)
>> > > >> > > of
>> > > >> > > > > the hashes should be included, summarize your motivations
>> in
>> > > this
>> > > >> > > thread
>> > > >> > > > > and if you don't get any objections for a day or two, call
>> a
>> > > >> YES/NO
>> > > >> > > vote
>> > > >> > > > > for that specific proposal in a separate thread.
>> > > >> > > > >
>> > > >> > > > > Thanks,
>> > > >> > > > >
>> > > >> > > > > Zoltan
>> > > >> > > > >
>> > > >> > > > > On Tue, Jul 2, 2019 at 3:52 AM 俊杰陈 <cj...@gmail.com>
>> > wrote:
>> > > >> > > > >
>> > > >> > > > > > Any thoughts from other committers and developers?
>> > > >> > > > > >
>> > > >> > > > > > I 'd like to start a vote firstly, you could either
>> provide
>> > > your
>> > > >> > > input
>> > > >> > > > here
>> > > >> > > > > > or on vote thread.
>> > > >> > > > > >
>> > > >> > > > > >
>> > > >> > > > > >
>> > > >> > > > > > On Mon, Jul 1, 2019 at 8:20 PM Zoltan Ivanfi
>> > > >> > <zi@cloudera.com.invalid
>> > > >> > > >
>> > > >> > > > > > wrote:
>> > > >> > > > > >
>> > > >> > > > > > > Hi,
>> > > >> > > > > > >
>> > > >> > > > > > > I would like to clarify one point of my previous
>> e-mail:
>> > > >> While I
>> > > >> > > > reasoned
>> > > >> > > > > > > that for compressions and encodings we should avoid
>> > picking
>> > > >> > > > algorithms
>> > > >> > > > > > > superseded by better ones, I also reasoned that for
>> bloom
>> > > >> filters
>> > > >> > > we
>> > > >> > > > do
>> > > >> > > > > > not
>> > > >> > > > > > > necessarily have to be as strict, because a reader with
>> > > >> missing
>> > > >> > > > > > > implementation will still be able to read data from
>> files
>> > > that
>> > > >> > > > contain
>> > > >> > > > > > > unsupported bloom filter data structures.
>> > > >> > > > > > >
>> > > >> > > > > > > Personally I'm fine with moving forward with the
>> current
>> > > hash
>> > > >> > > > proposal,
>> > > >> > > > > > > even if the chosen algorithm is not considered to be
>> the
>> > > best
>> > > >> of
>> > > >> > > its
>> > > >> > > > > > class.
>> > > >> > > > > > >
>> > > >> > > > > > > Br,
>> > > >> > > > > > >
>> > > >> > > > > > > Zoltan
>> > > >> > > > > > >
>> > > >> > > > > > > On Sun, Jun 30, 2019 at 11:02 PM Jim Apple <
>> > > >> jbapple@apache.org>
>> > > >> > > > wrote:
>> > > >> > > > > > >
>> > > >> > > > > > > > On 2019/06/28 16:43:23, Ryan Blue
>> > > <rblue@netflix.com.INVALID
>> > > >> >
>> > > >> > > > wrote:
>> > > >> > > > > > > > > I agree with Zoltan. Since we want to ensure
>> > > >> compatibility,
>> > > >> > it
>> > > >> > > > would
>> > > >> > > > > > be
>> > > >> > > > > > > > > better to choose the best option now instead of
>> making
>> > > >> > everyone
>> > > >> > > > > > support
>> > > >> > > > > > > > two
>> > > >> > > > > > > > > options forever.
>> > > >> > > > > > > >
>> > > >> > > > > > > > I'd guess there probably isn't a single best option.
>> I
>> > > >> suspect
>> > > >> > > > there's
>> > > >> > > > > > a
>> > > >> > > > > > > > tradeoff between ease of implementation and speed,
>> for
>> > > >> > instance,
>> > > >> > > > since
>> > > >> > > > > > I
>> > > >> > > > > > > > expect it's easy to find an MD5 library in most
>> > > programming
>> > > >> > > > languages
>> > > >> > > > > > and
>> > > >> > > > > > > > operating systems, yet MD5 is very slow compared to
>> > > >> > > > non-cryptographic
>> > > >> > > > > > > hash
>> > > >> > > > > > > > functions designed for speed like xxhash.
>> > > >> > > > > > > >
>> > > >> > > > > > > > There's also a significant amount of variability
>> across
>> > > >> > processor
>> > > >> > > > > > > families
>> > > >> > > > > > > > (64-bit multiply-shift in ARM vs x86-64) or even
>> > different
>> > > >> > > > versions of
>> > > >> > > > > > > the
>> > > >> > > > > > > > same processor family (CLHash in Haswell vs. Sandy
>> > Lake).
>> > > >> There
>> > > >> > > are
>> > > >> > > > > > also
>> > > >> > > > > > > > quality tradeoffs that depend on the average bye
>> length
>> > of
>> > > >> the
>> > > >> > > > input
>> > > >> > > > > > (FNV
>> > > >> > > > > > > > vs vhash) or how much L1 cache the user wants to use
>> for
>> > > the
>> > > >> > hash
>> > > >> > > > > > > function
>> > > >> > > > > > > > (tabulation hashing vs. multiply-shift).
>> > > >> > > > > > > >
>> > > >> > > > > > > > To deal with this level of ambiguity, I'd suggest
>> that
>> > v1
>> > > >> > should
>> > > >> > > > > > include
>> > > >> > > > > > > a
>> > > >> > > > > > > > hash function that works well for certain common
>> > > >> environments.
>> > > >> > As
>> > > >> > > > far
>> > > >> > > > > > as
>> > > >> > > > > > > I
>> > > >> > > > > > > > know, murmur and xxhash would both fit that bill.
>> > > >> > > > > > > >
>> > > >> > > > > > >
>> > > >> > > > > >
>> > > >> > > > > >
>> > > >> > > > > > --
>> > > >> > > > > > Thanks & Best Regards
>> > > >> > > > > >
>> > > >> > > >
>> > > >> > > >
>> > > >> > > >
>> > > >> > > > --
>> > > >> > > > Thanks & Best Regards
>> > > >> > > >
>> > > >> > >
>> > > >> >
>> > > >> >
>> > > >> > --
>> > > >> > Thanks & Best Regards
>> > > >> >
>> > > >>
>> > > >
>> > > >
>> > > > --
>> > > > Thanks & Best Regards
>> > > >
>> > >
>> > >
>> > > --
>> > > Thanks & Best Regards
>> > >
>> >
>>
>>
>> --
>> Thanks & Best Regards
>>
>

Re: [DISCUSS] Prepare release for parquet-format 2.7.0?

Posted by Zoltan Ivanfi <zi...@cloudera.com.INVALID>.
Hi,

I just noticed that yesterday I misunderstood that the Bloom filter is a
part of the column chunk metadata, when in fact it is only the offset of it
that is stored there. In this case we definitely need to pay more attention
to the encryption aspect because it won't happen automatically.

Br,

Zoltan

On Fri, Jul 5, 2019 at 1:09 PM 俊杰陈 <cj...@gmail.com> wrote:

> That would be great, thank you.
>
> On Fri, Jul 5, 2019 at 5:40 PM Gidon Gershinsky <gg...@gmail.com> wrote:
>
> > Hi Junjie,
> >
> > I'd be glad to have a look at the encryption part. Will add my comments
> > early next week.
> >
> > Cheers, Gidon.
> >
> > On Fri, Jul 5, 2019 at 12:16 PM 俊杰陈 <cj...@gmail.com> wrote:
> >
> > > Sorry, the latest file is
> > >
> > >
> >
> https://github.com/chenjunjiedada/parquet-format/blob/PARQUET-1617/BloomFilter.md
> > > .
> > >
> > > On Fri, Jul 5, 2019 at 5:14 PM 俊杰陈 <cj...@gmail.com> wrote:
> > >
> > > > Sure, please see this PR
> > > > <https://github.com/apache/parquet-format/pull/140> or update file
> > here
> > > > <
> > >
> >
> https://github.com/chenjunjiedada/parquet-format/blob/master/BloomFilter.md
> > > >
> > > > .
> > > >
> > > > Thanks for reviewing spec.
> > > >
> > > > On Thu, Jul 4, 2019 at 11:57 PM Zoltan Ivanfi
> <zi@cloudera.com.invalid
> > >
> > > > wrote:
> > > >
> > > >> Hi Junjie,
> > > >>
> > > >> I read through the specification and while I support the feature in
> > > >> general, I find that the documentation may not be detailed enough to
> > > allow
> > > >> developers of  different language bindings to implement it.
> > > Specifically,
> > > >> the Technical Approach section of the docs is very short and refers
> > the
> > > >> reader to two publications for details. I think the specification
> > would
> > > >> greatly benefit from including an explanation or a summary of the
> > > approach
> > > >> in this section.
> > > >>
> > > >> The "Build a Bloom filter" section contains a formula for
> calculating
> > > the
> > > >> optimal filter size for a desired false positive rate, but does not
> > > >> specify
> > > >> what false positive rates implementations should target by default
> and
> > > >> through what ways should they make it configurable by users. I
> > > understand
> > > >> that this may be an intentional omission, since targeting any false
> > > >> positive rate will result in a specification-compliant result,
> still I
> > > >> think it would be best to provide some recommendation for the
> > different
> > > >> language bindings.
> > > >>
> > > >> Since this feature is getting added after encryption, it should be
> > > briefly
> > > >> but explicitly mentioned how it interacts with that (basically that
> it
> > > has
> > > >> to be encrypted, otherwise it would leak sensitive information, but
> by
> > > >> placing it inside the column chunk metadata, this is automatically
> > taken
> > > >> care of).
> > > >>
> > > >> Finally, as a nitpick, I would prefer in-line links to related
> > materials
> > > >> instead of numeric references that one must manually look up at the
> > > bottom
> > > >> of the page.
> > > >>
> > > >> Could you please add these improvements to the specification?
> > > >>
> > > >> Thanks,
> > > >>
> > > >> Zoltan
> > > >>
> > > >> On Wed, Jul 3, 2019 at 4:03 PM 俊杰陈 <cj...@gmail.com> wrote:
> > > >>
> > > >> > You are welcome, it 's my honor.
> > > >> >
> > > >> > I think the PR <https://github.com/apache/parquet-format/pull/139
> >
> > > just
> > > >> > remove murmur3, that should express what I want.
> > > >> >
> > > >> >
> > > >> >
> > > >> >
> > > >> > On Wed, Jul 3, 2019 at 9:53 PM Zoltan Ivanfi
> > <zi@cloudera.com.invalid
> > > >
> > > >> > wrote:
> > > >> >
> > > >> > > Hi Junjie,
> > > >> > >
> > > >> > > Thanks for the update and also for your endruance in going
> through
> > > >> this
> > > >> > > tedious process in order to add bloom filtering to Parquet.
> > > >> > >
> > > >> > > I understand that your proposal is to go forward with xxHash
> > instead
> > > >> of
> > > >> > the
> > > >> > > eralier murmur3, which you suggest to deprecate. Since the
> murmur3
> > > >> hash
> > > >> > was
> > > >> > > never released, I think it could be completely removed from the
> > spec
> > > >> > > instead of just getting deprecated. What is your opinion on
> this?
> > > >> > >
> > > >> > > Thanks,
> > > >> > >
> > > >> > > Zoltan
> > > >> > >
> > > >> > > On Wed, Jul 3, 2019 at 3:31 PM 俊杰陈 <cj...@gmail.com> wrote:
> > > >> > >
> > > >> > > > I see, thanks for guiding on this.
> > > >> > > >
> > > >> > > > Per discussion in this thread and some investigation about
> > changes
> > > >> on
> > > >> > > > current java and c++ implementation, and I think that is not
> > hard
> > > to
> > > >> > > > handle. So I propose to use xxHash (the XXH64 version) as the
> > > >> default
> > > >> > > > hash strategy and deprecate previous murmur3 hash.
> > > >> > > >
> > > >> > > > I will update vote thread as well to make it clearer to all.
> > > >> > > >
> > > >> > > >
> > > >> > > > On Wed, Jul 3, 2019 at 6:08 PM Zoltan Ivanfi
> > > >> <zi...@cloudera.com.invalid>
> > > >> > > > wrote:
> > > >> > > > >
> > > >> > > > > Hi Junjie,
> > > >> > > > >
> > > >> > > > > I think the vote is ambigous in its current form (can people
> > > vote
> > > >> on
> > > >> > > one
> > > >> > > > > option only or can they vote on both?) and has a low chance
> of
> > > >> > getting
> > > >> > > > > votes in general because it's not a yes/no question but a
> > > >> > > > > choose-an-approach question instead. I think most
> contributors
> > > >> would
> > > >> > > > accept
> > > >> > > > > the hash chosen based on a community discussion but would be
> > > >> > reluctant
> > > >> > > to
> > > >> > > > > make that choice themselves in the form a vote because it
> > > >> requires a
> > > >> > > much
> > > >> > > > > deeper dive into the technical intricacies involved. The
> > > >> committers
> > > >> > are
> > > >> > > > > experienced in the parquet code base but may not be as
> > > >> experienced in
> > > >> > > > bloom
> > > >> > > > > filters as you are.
> > > >> > > > >
> > > >> > > > > In my opinion, to get bloom filtering into parquet-mr, you
> > > should
> > > >> > > > convince
> > > >> > > > > the committers that the proposal is viable by addressing
> their
> > > >> > concerns
> > > >> > > > > (which I believe you have done), and not by delegating the
> > task
> > > of
> > > >> > > making
> > > >> > > > > choices to them. I would suggest that you propose which one
> > (or
> > > >> both)
> > > >> > > of
> > > >> > > > > the hashes should be included, summarize your motivations in
> > > this
> > > >> > > thread
> > > >> > > > > and if you don't get any objections for a day or two, call a
> > > >> YES/NO
> > > >> > > vote
> > > >> > > > > for that specific proposal in a separate thread.
> > > >> > > > >
> > > >> > > > > Thanks,
> > > >> > > > >
> > > >> > > > > Zoltan
> > > >> > > > >
> > > >> > > > > On Tue, Jul 2, 2019 at 3:52 AM 俊杰陈 <cj...@gmail.com>
> > wrote:
> > > >> > > > >
> > > >> > > > > > Any thoughts from other committers and developers?
> > > >> > > > > >
> > > >> > > > > > I 'd like to start a vote firstly, you could either
> provide
> > > your
> > > >> > > input
> > > >> > > > here
> > > >> > > > > > or on vote thread.
> > > >> > > > > >
> > > >> > > > > >
> > > >> > > > > >
> > > >> > > > > > On Mon, Jul 1, 2019 at 8:20 PM Zoltan Ivanfi
> > > >> > <zi@cloudera.com.invalid
> > > >> > > >
> > > >> > > > > > wrote:
> > > >> > > > > >
> > > >> > > > > > > Hi,
> > > >> > > > > > >
> > > >> > > > > > > I would like to clarify one point of my previous e-mail:
> > > >> While I
> > > >> > > > reasoned
> > > >> > > > > > > that for compressions and encodings we should avoid
> > picking
> > > >> > > > algorithms
> > > >> > > > > > > superseded by better ones, I also reasoned that for
> bloom
> > > >> filters
> > > >> > > we
> > > >> > > > do
> > > >> > > > > > not
> > > >> > > > > > > necessarily have to be as strict, because a reader with
> > > >> missing
> > > >> > > > > > > implementation will still be able to read data from
> files
> > > that
> > > >> > > > contain
> > > >> > > > > > > unsupported bloom filter data structures.
> > > >> > > > > > >
> > > >> > > > > > > Personally I'm fine with moving forward with the current
> > > hash
> > > >> > > > proposal,
> > > >> > > > > > > even if the chosen algorithm is not considered to be the
> > > best
> > > >> of
> > > >> > > its
> > > >> > > > > > class.
> > > >> > > > > > >
> > > >> > > > > > > Br,
> > > >> > > > > > >
> > > >> > > > > > > Zoltan
> > > >> > > > > > >
> > > >> > > > > > > On Sun, Jun 30, 2019 at 11:02 PM Jim Apple <
> > > >> jbapple@apache.org>
> > > >> > > > wrote:
> > > >> > > > > > >
> > > >> > > > > > > > On 2019/06/28 16:43:23, Ryan Blue
> > > <rblue@netflix.com.INVALID
> > > >> >
> > > >> > > > wrote:
> > > >> > > > > > > > > I agree with Zoltan. Since we want to ensure
> > > >> compatibility,
> > > >> > it
> > > >> > > > would
> > > >> > > > > > be
> > > >> > > > > > > > > better to choose the best option now instead of
> making
> > > >> > everyone
> > > >> > > > > > support
> > > >> > > > > > > > two
> > > >> > > > > > > > > options forever.
> > > >> > > > > > > >
> > > >> > > > > > > > I'd guess there probably isn't a single best option. I
> > > >> suspect
> > > >> > > > there's
> > > >> > > > > > a
> > > >> > > > > > > > tradeoff between ease of implementation and speed, for
> > > >> > instance,
> > > >> > > > since
> > > >> > > > > > I
> > > >> > > > > > > > expect it's easy to find an MD5 library in most
> > > programming
> > > >> > > > languages
> > > >> > > > > > and
> > > >> > > > > > > > operating systems, yet MD5 is very slow compared to
> > > >> > > > non-cryptographic
> > > >> > > > > > > hash
> > > >> > > > > > > > functions designed for speed like xxhash.
> > > >> > > > > > > >
> > > >> > > > > > > > There's also a significant amount of variability
> across
> > > >> > processor
> > > >> > > > > > > families
> > > >> > > > > > > > (64-bit multiply-shift in ARM vs x86-64) or even
> > different
> > > >> > > > versions of
> > > >> > > > > > > the
> > > >> > > > > > > > same processor family (CLHash in Haswell vs. Sandy
> > Lake).
> > > >> There
> > > >> > > are
> > > >> > > > > > also
> > > >> > > > > > > > quality tradeoffs that depend on the average bye
> length
> > of
> > > >> the
> > > >> > > > input
> > > >> > > > > > (FNV
> > > >> > > > > > > > vs vhash) or how much L1 cache the user wants to use
> for
> > > the
> > > >> > hash
> > > >> > > > > > > function
> > > >> > > > > > > > (tabulation hashing vs. multiply-shift).
> > > >> > > > > > > >
> > > >> > > > > > > > To deal with this level of ambiguity, I'd suggest that
> > v1
> > > >> > should
> > > >> > > > > > include
> > > >> > > > > > > a
> > > >> > > > > > > > hash function that works well for certain common
> > > >> environments.
> > > >> > As
> > > >> > > > far
> > > >> > > > > > as
> > > >> > > > > > > I
> > > >> > > > > > > > know, murmur and xxhash would both fit that bill.
> > > >> > > > > > > >
> > > >> > > > > > >
> > > >> > > > > >
> > > >> > > > > >
> > > >> > > > > > --
> > > >> > > > > > Thanks & Best Regards
> > > >> > > > > >
> > > >> > > >
> > > >> > > >
> > > >> > > >
> > > >> > > > --
> > > >> > > > Thanks & Best Regards
> > > >> > > >
> > > >> > >
> > > >> >
> > > >> >
> > > >> > --
> > > >> > Thanks & Best Regards
> > > >> >
> > > >>
> > > >
> > > >
> > > > --
> > > > Thanks & Best Regards
> > > >
> > >
> > >
> > > --
> > > Thanks & Best Regards
> > >
> >
>
>
> --
> Thanks & Best Regards
>

Re: [DISCUSS] Prepare release for parquet-format 2.7.0?

Posted by 俊杰陈 <cj...@gmail.com>.
That would be great, thank you.

On Fri, Jul 5, 2019 at 5:40 PM Gidon Gershinsky <gg...@gmail.com> wrote:

> Hi Junjie,
>
> I'd be glad to have a look at the encryption part. Will add my comments
> early next week.
>
> Cheers, Gidon.
>
> On Fri, Jul 5, 2019 at 12:16 PM 俊杰陈 <cj...@gmail.com> wrote:
>
> > Sorry, the latest file is
> >
> >
> https://github.com/chenjunjiedada/parquet-format/blob/PARQUET-1617/BloomFilter.md
> > .
> >
> > On Fri, Jul 5, 2019 at 5:14 PM 俊杰陈 <cj...@gmail.com> wrote:
> >
> > > Sure, please see this PR
> > > <https://github.com/apache/parquet-format/pull/140> or update file
> here
> > > <
> >
> https://github.com/chenjunjiedada/parquet-format/blob/master/BloomFilter.md
> > >
> > > .
> > >
> > > Thanks for reviewing spec.
> > >
> > > On Thu, Jul 4, 2019 at 11:57 PM Zoltan Ivanfi <zi@cloudera.com.invalid
> >
> > > wrote:
> > >
> > >> Hi Junjie,
> > >>
> > >> I read through the specification and while I support the feature in
> > >> general, I find that the documentation may not be detailed enough to
> > allow
> > >> developers of  different language bindings to implement it.
> > Specifically,
> > >> the Technical Approach section of the docs is very short and refers
> the
> > >> reader to two publications for details. I think the specification
> would
> > >> greatly benefit from including an explanation or a summary of the
> > approach
> > >> in this section.
> > >>
> > >> The "Build a Bloom filter" section contains a formula for calculating
> > the
> > >> optimal filter size for a desired false positive rate, but does not
> > >> specify
> > >> what false positive rates implementations should target by default and
> > >> through what ways should they make it configurable by users. I
> > understand
> > >> that this may be an intentional omission, since targeting any false
> > >> positive rate will result in a specification-compliant result, still I
> > >> think it would be best to provide some recommendation for the
> different
> > >> language bindings.
> > >>
> > >> Since this feature is getting added after encryption, it should be
> > briefly
> > >> but explicitly mentioned how it interacts with that (basically that it
> > has
> > >> to be encrypted, otherwise it would leak sensitive information, but by
> > >> placing it inside the column chunk metadata, this is automatically
> taken
> > >> care of).
> > >>
> > >> Finally, as a nitpick, I would prefer in-line links to related
> materials
> > >> instead of numeric references that one must manually look up at the
> > bottom
> > >> of the page.
> > >>
> > >> Could you please add these improvements to the specification?
> > >>
> > >> Thanks,
> > >>
> > >> Zoltan
> > >>
> > >> On Wed, Jul 3, 2019 at 4:03 PM 俊杰陈 <cj...@gmail.com> wrote:
> > >>
> > >> > You are welcome, it 's my honor.
> > >> >
> > >> > I think the PR <https://github.com/apache/parquet-format/pull/139>
> > just
> > >> > remove murmur3, that should express what I want.
> > >> >
> > >> >
> > >> >
> > >> >
> > >> > On Wed, Jul 3, 2019 at 9:53 PM Zoltan Ivanfi
> <zi@cloudera.com.invalid
> > >
> > >> > wrote:
> > >> >
> > >> > > Hi Junjie,
> > >> > >
> > >> > > Thanks for the update and also for your endruance in going through
> > >> this
> > >> > > tedious process in order to add bloom filtering to Parquet.
> > >> > >
> > >> > > I understand that your proposal is to go forward with xxHash
> instead
> > >> of
> > >> > the
> > >> > > eralier murmur3, which you suggest to deprecate. Since the murmur3
> > >> hash
> > >> > was
> > >> > > never released, I think it could be completely removed from the
> spec
> > >> > > instead of just getting deprecated. What is your opinion on this?
> > >> > >
> > >> > > Thanks,
> > >> > >
> > >> > > Zoltan
> > >> > >
> > >> > > On Wed, Jul 3, 2019 at 3:31 PM 俊杰陈 <cj...@gmail.com> wrote:
> > >> > >
> > >> > > > I see, thanks for guiding on this.
> > >> > > >
> > >> > > > Per discussion in this thread and some investigation about
> changes
> > >> on
> > >> > > > current java and c++ implementation, and I think that is not
> hard
> > to
> > >> > > > handle. So I propose to use xxHash (the XXH64 version) as the
> > >> default
> > >> > > > hash strategy and deprecate previous murmur3 hash.
> > >> > > >
> > >> > > > I will update vote thread as well to make it clearer to all.
> > >> > > >
> > >> > > >
> > >> > > > On Wed, Jul 3, 2019 at 6:08 PM Zoltan Ivanfi
> > >> <zi...@cloudera.com.invalid>
> > >> > > > wrote:
> > >> > > > >
> > >> > > > > Hi Junjie,
> > >> > > > >
> > >> > > > > I think the vote is ambigous in its current form (can people
> > vote
> > >> on
> > >> > > one
> > >> > > > > option only or can they vote on both?) and has a low chance of
> > >> > getting
> > >> > > > > votes in general because it's not a yes/no question but a
> > >> > > > > choose-an-approach question instead. I think most contributors
> > >> would
> > >> > > > accept
> > >> > > > > the hash chosen based on a community discussion but would be
> > >> > reluctant
> > >> > > to
> > >> > > > > make that choice themselves in the form a vote because it
> > >> requires a
> > >> > > much
> > >> > > > > deeper dive into the technical intricacies involved. The
> > >> committers
> > >> > are
> > >> > > > > experienced in the parquet code base but may not be as
> > >> experienced in
> > >> > > > bloom
> > >> > > > > filters as you are.
> > >> > > > >
> > >> > > > > In my opinion, to get bloom filtering into parquet-mr, you
> > should
> > >> > > > convince
> > >> > > > > the committers that the proposal is viable by addressing their
> > >> > concerns
> > >> > > > > (which I believe you have done), and not by delegating the
> task
> > of
> > >> > > making
> > >> > > > > choices to them. I would suggest that you propose which one
> (or
> > >> both)
> > >> > > of
> > >> > > > > the hashes should be included, summarize your motivations in
> > this
> > >> > > thread
> > >> > > > > and if you don't get any objections for a day or two, call a
> > >> YES/NO
> > >> > > vote
> > >> > > > > for that specific proposal in a separate thread.
> > >> > > > >
> > >> > > > > Thanks,
> > >> > > > >
> > >> > > > > Zoltan
> > >> > > > >
> > >> > > > > On Tue, Jul 2, 2019 at 3:52 AM 俊杰陈 <cj...@gmail.com>
> wrote:
> > >> > > > >
> > >> > > > > > Any thoughts from other committers and developers?
> > >> > > > > >
> > >> > > > > > I 'd like to start a vote firstly, you could either provide
> > your
> > >> > > input
> > >> > > > here
> > >> > > > > > or on vote thread.
> > >> > > > > >
> > >> > > > > >
> > >> > > > > >
> > >> > > > > > On Mon, Jul 1, 2019 at 8:20 PM Zoltan Ivanfi
> > >> > <zi@cloudera.com.invalid
> > >> > > >
> > >> > > > > > wrote:
> > >> > > > > >
> > >> > > > > > > Hi,
> > >> > > > > > >
> > >> > > > > > > I would like to clarify one point of my previous e-mail:
> > >> While I
> > >> > > > reasoned
> > >> > > > > > > that for compressions and encodings we should avoid
> picking
> > >> > > > algorithms
> > >> > > > > > > superseded by better ones, I also reasoned that for bloom
> > >> filters
> > >> > > we
> > >> > > > do
> > >> > > > > > not
> > >> > > > > > > necessarily have to be as strict, because a reader with
> > >> missing
> > >> > > > > > > implementation will still be able to read data from files
> > that
> > >> > > > contain
> > >> > > > > > > unsupported bloom filter data structures.
> > >> > > > > > >
> > >> > > > > > > Personally I'm fine with moving forward with the current
> > hash
> > >> > > > proposal,
> > >> > > > > > > even if the chosen algorithm is not considered to be the
> > best
> > >> of
> > >> > > its
> > >> > > > > > class.
> > >> > > > > > >
> > >> > > > > > > Br,
> > >> > > > > > >
> > >> > > > > > > Zoltan
> > >> > > > > > >
> > >> > > > > > > On Sun, Jun 30, 2019 at 11:02 PM Jim Apple <
> > >> jbapple@apache.org>
> > >> > > > wrote:
> > >> > > > > > >
> > >> > > > > > > > On 2019/06/28 16:43:23, Ryan Blue
> > <rblue@netflix.com.INVALID
> > >> >
> > >> > > > wrote:
> > >> > > > > > > > > I agree with Zoltan. Since we want to ensure
> > >> compatibility,
> > >> > it
> > >> > > > would
> > >> > > > > > be
> > >> > > > > > > > > better to choose the best option now instead of making
> > >> > everyone
> > >> > > > > > support
> > >> > > > > > > > two
> > >> > > > > > > > > options forever.
> > >> > > > > > > >
> > >> > > > > > > > I'd guess there probably isn't a single best option. I
> > >> suspect
> > >> > > > there's
> > >> > > > > > a
> > >> > > > > > > > tradeoff between ease of implementation and speed, for
> > >> > instance,
> > >> > > > since
> > >> > > > > > I
> > >> > > > > > > > expect it's easy to find an MD5 library in most
> > programming
> > >> > > > languages
> > >> > > > > > and
> > >> > > > > > > > operating systems, yet MD5 is very slow compared to
> > >> > > > non-cryptographic
> > >> > > > > > > hash
> > >> > > > > > > > functions designed for speed like xxhash.
> > >> > > > > > > >
> > >> > > > > > > > There's also a significant amount of variability across
> > >> > processor
> > >> > > > > > > families
> > >> > > > > > > > (64-bit multiply-shift in ARM vs x86-64) or even
> different
> > >> > > > versions of
> > >> > > > > > > the
> > >> > > > > > > > same processor family (CLHash in Haswell vs. Sandy
> Lake).
> > >> There
> > >> > > are
> > >> > > > > > also
> > >> > > > > > > > quality tradeoffs that depend on the average bye length
> of
> > >> the
> > >> > > > input
> > >> > > > > > (FNV
> > >> > > > > > > > vs vhash) or how much L1 cache the user wants to use for
> > the
> > >> > hash
> > >> > > > > > > function
> > >> > > > > > > > (tabulation hashing vs. multiply-shift).
> > >> > > > > > > >
> > >> > > > > > > > To deal with this level of ambiguity, I'd suggest that
> v1
> > >> > should
> > >> > > > > > include
> > >> > > > > > > a
> > >> > > > > > > > hash function that works well for certain common
> > >> environments.
> > >> > As
> > >> > > > far
> > >> > > > > > as
> > >> > > > > > > I
> > >> > > > > > > > know, murmur and xxhash would both fit that bill.
> > >> > > > > > > >
> > >> > > > > > >
> > >> > > > > >
> > >> > > > > >
> > >> > > > > > --
> > >> > > > > > Thanks & Best Regards
> > >> > > > > >
> > >> > > >
> > >> > > >
> > >> > > >
> > >> > > > --
> > >> > > > Thanks & Best Regards
> > >> > > >
> > >> > >
> > >> >
> > >> >
> > >> > --
> > >> > Thanks & Best Regards
> > >> >
> > >>
> > >
> > >
> > > --
> > > Thanks & Best Regards
> > >
> >
> >
> > --
> > Thanks & Best Regards
> >
>


-- 
Thanks & Best Regards

Re: [DISCUSS] Prepare release for parquet-format 2.7.0?

Posted by Gidon Gershinsky <gg...@gmail.com>.
Hi Junjie,

I'd be glad to have a look at the encryption part. Will add my comments
early next week.

Cheers, Gidon.

On Fri, Jul 5, 2019 at 12:16 PM 俊杰陈 <cj...@gmail.com> wrote:

> Sorry, the latest file is
>
> https://github.com/chenjunjiedada/parquet-format/blob/PARQUET-1617/BloomFilter.md
> .
>
> On Fri, Jul 5, 2019 at 5:14 PM 俊杰陈 <cj...@gmail.com> wrote:
>
> > Sure, please see this PR
> > <https://github.com/apache/parquet-format/pull/140> or update file here
> > <
> https://github.com/chenjunjiedada/parquet-format/blob/master/BloomFilter.md
> >
> > .
> >
> > Thanks for reviewing spec.
> >
> > On Thu, Jul 4, 2019 at 11:57 PM Zoltan Ivanfi <zi...@cloudera.com.invalid>
> > wrote:
> >
> >> Hi Junjie,
> >>
> >> I read through the specification and while I support the feature in
> >> general, I find that the documentation may not be detailed enough to
> allow
> >> developers of  different language bindings to implement it.
> Specifically,
> >> the Technical Approach section of the docs is very short and refers the
> >> reader to two publications for details. I think the specification would
> >> greatly benefit from including an explanation or a summary of the
> approach
> >> in this section.
> >>
> >> The "Build a Bloom filter" section contains a formula for calculating
> the
> >> optimal filter size for a desired false positive rate, but does not
> >> specify
> >> what false positive rates implementations should target by default and
> >> through what ways should they make it configurable by users. I
> understand
> >> that this may be an intentional omission, since targeting any false
> >> positive rate will result in a specification-compliant result, still I
> >> think it would be best to provide some recommendation for the different
> >> language bindings.
> >>
> >> Since this feature is getting added after encryption, it should be
> briefly
> >> but explicitly mentioned how it interacts with that (basically that it
> has
> >> to be encrypted, otherwise it would leak sensitive information, but by
> >> placing it inside the column chunk metadata, this is automatically taken
> >> care of).
> >>
> >> Finally, as a nitpick, I would prefer in-line links to related materials
> >> instead of numeric references that one must manually look up at the
> bottom
> >> of the page.
> >>
> >> Could you please add these improvements to the specification?
> >>
> >> Thanks,
> >>
> >> Zoltan
> >>
> >> On Wed, Jul 3, 2019 at 4:03 PM 俊杰陈 <cj...@gmail.com> wrote:
> >>
> >> > You are welcome, it 's my honor.
> >> >
> >> > I think the PR <https://github.com/apache/parquet-format/pull/139>
> just
> >> > remove murmur3, that should express what I want.
> >> >
> >> >
> >> >
> >> >
> >> > On Wed, Jul 3, 2019 at 9:53 PM Zoltan Ivanfi <zi@cloudera.com.invalid
> >
> >> > wrote:
> >> >
> >> > > Hi Junjie,
> >> > >
> >> > > Thanks for the update and also for your endruance in going through
> >> this
> >> > > tedious process in order to add bloom filtering to Parquet.
> >> > >
> >> > > I understand that your proposal is to go forward with xxHash instead
> >> of
> >> > the
> >> > > eralier murmur3, which you suggest to deprecate. Since the murmur3
> >> hash
> >> > was
> >> > > never released, I think it could be completely removed from the spec
> >> > > instead of just getting deprecated. What is your opinion on this?
> >> > >
> >> > > Thanks,
> >> > >
> >> > > Zoltan
> >> > >
> >> > > On Wed, Jul 3, 2019 at 3:31 PM 俊杰陈 <cj...@gmail.com> wrote:
> >> > >
> >> > > > I see, thanks for guiding on this.
> >> > > >
> >> > > > Per discussion in this thread and some investigation about changes
> >> on
> >> > > > current java and c++ implementation, and I think that is not hard
> to
> >> > > > handle. So I propose to use xxHash (the XXH64 version) as the
> >> default
> >> > > > hash strategy and deprecate previous murmur3 hash.
> >> > > >
> >> > > > I will update vote thread as well to make it clearer to all.
> >> > > >
> >> > > >
> >> > > > On Wed, Jul 3, 2019 at 6:08 PM Zoltan Ivanfi
> >> <zi...@cloudera.com.invalid>
> >> > > > wrote:
> >> > > > >
> >> > > > > Hi Junjie,
> >> > > > >
> >> > > > > I think the vote is ambigous in its current form (can people
> vote
> >> on
> >> > > one
> >> > > > > option only or can they vote on both?) and has a low chance of
> >> > getting
> >> > > > > votes in general because it's not a yes/no question but a
> >> > > > > choose-an-approach question instead. I think most contributors
> >> would
> >> > > > accept
> >> > > > > the hash chosen based on a community discussion but would be
> >> > reluctant
> >> > > to
> >> > > > > make that choice themselves in the form a vote because it
> >> requires a
> >> > > much
> >> > > > > deeper dive into the technical intricacies involved. The
> >> committers
> >> > are
> >> > > > > experienced in the parquet code base but may not be as
> >> experienced in
> >> > > > bloom
> >> > > > > filters as you are.
> >> > > > >
> >> > > > > In my opinion, to get bloom filtering into parquet-mr, you
> should
> >> > > > convince
> >> > > > > the committers that the proposal is viable by addressing their
> >> > concerns
> >> > > > > (which I believe you have done), and not by delegating the task
> of
> >> > > making
> >> > > > > choices to them. I would suggest that you propose which one (or
> >> both)
> >> > > of
> >> > > > > the hashes should be included, summarize your motivations in
> this
> >> > > thread
> >> > > > > and if you don't get any objections for a day or two, call a
> >> YES/NO
> >> > > vote
> >> > > > > for that specific proposal in a separate thread.
> >> > > > >
> >> > > > > Thanks,
> >> > > > >
> >> > > > > Zoltan
> >> > > > >
> >> > > > > On Tue, Jul 2, 2019 at 3:52 AM 俊杰陈 <cj...@gmail.com> wrote:
> >> > > > >
> >> > > > > > Any thoughts from other committers and developers?
> >> > > > > >
> >> > > > > > I 'd like to start a vote firstly, you could either provide
> your
> >> > > input
> >> > > > here
> >> > > > > > or on vote thread.
> >> > > > > >
> >> > > > > >
> >> > > > > >
> >> > > > > > On Mon, Jul 1, 2019 at 8:20 PM Zoltan Ivanfi
> >> > <zi@cloudera.com.invalid
> >> > > >
> >> > > > > > wrote:
> >> > > > > >
> >> > > > > > > Hi,
> >> > > > > > >
> >> > > > > > > I would like to clarify one point of my previous e-mail:
> >> While I
> >> > > > reasoned
> >> > > > > > > that for compressions and encodings we should avoid picking
> >> > > > algorithms
> >> > > > > > > superseded by better ones, I also reasoned that for bloom
> >> filters
> >> > > we
> >> > > > do
> >> > > > > > not
> >> > > > > > > necessarily have to be as strict, because a reader with
> >> missing
> >> > > > > > > implementation will still be able to read data from files
> that
> >> > > > contain
> >> > > > > > > unsupported bloom filter data structures.
> >> > > > > > >
> >> > > > > > > Personally I'm fine with moving forward with the current
> hash
> >> > > > proposal,
> >> > > > > > > even if the chosen algorithm is not considered to be the
> best
> >> of
> >> > > its
> >> > > > > > class.
> >> > > > > > >
> >> > > > > > > Br,
> >> > > > > > >
> >> > > > > > > Zoltan
> >> > > > > > >
> >> > > > > > > On Sun, Jun 30, 2019 at 11:02 PM Jim Apple <
> >> jbapple@apache.org>
> >> > > > wrote:
> >> > > > > > >
> >> > > > > > > > On 2019/06/28 16:43:23, Ryan Blue
> <rblue@netflix.com.INVALID
> >> >
> >> > > > wrote:
> >> > > > > > > > > I agree with Zoltan. Since we want to ensure
> >> compatibility,
> >> > it
> >> > > > would
> >> > > > > > be
> >> > > > > > > > > better to choose the best option now instead of making
> >> > everyone
> >> > > > > > support
> >> > > > > > > > two
> >> > > > > > > > > options forever.
> >> > > > > > > >
> >> > > > > > > > I'd guess there probably isn't a single best option. I
> >> suspect
> >> > > > there's
> >> > > > > > a
> >> > > > > > > > tradeoff between ease of implementation and speed, for
> >> > instance,
> >> > > > since
> >> > > > > > I
> >> > > > > > > > expect it's easy to find an MD5 library in most
> programming
> >> > > > languages
> >> > > > > > and
> >> > > > > > > > operating systems, yet MD5 is very slow compared to
> >> > > > non-cryptographic
> >> > > > > > > hash
> >> > > > > > > > functions designed for speed like xxhash.
> >> > > > > > > >
> >> > > > > > > > There's also a significant amount of variability across
> >> > processor
> >> > > > > > > families
> >> > > > > > > > (64-bit multiply-shift in ARM vs x86-64) or even different
> >> > > > versions of
> >> > > > > > > the
> >> > > > > > > > same processor family (CLHash in Haswell vs. Sandy Lake).
> >> There
> >> > > are
> >> > > > > > also
> >> > > > > > > > quality tradeoffs that depend on the average bye length of
> >> the
> >> > > > input
> >> > > > > > (FNV
> >> > > > > > > > vs vhash) or how much L1 cache the user wants to use for
> the
> >> > hash
> >> > > > > > > function
> >> > > > > > > > (tabulation hashing vs. multiply-shift).
> >> > > > > > > >
> >> > > > > > > > To deal with this level of ambiguity, I'd suggest that v1
> >> > should
> >> > > > > > include
> >> > > > > > > a
> >> > > > > > > > hash function that works well for certain common
> >> environments.
> >> > As
> >> > > > far
> >> > > > > > as
> >> > > > > > > I
> >> > > > > > > > know, murmur and xxhash would both fit that bill.
> >> > > > > > > >
> >> > > > > > >
> >> > > > > >
> >> > > > > >
> >> > > > > > --
> >> > > > > > Thanks & Best Regards
> >> > > > > >
> >> > > >
> >> > > >
> >> > > >
> >> > > > --
> >> > > > Thanks & Best Regards
> >> > > >
> >> > >
> >> >
> >> >
> >> > --
> >> > Thanks & Best Regards
> >> >
> >>
> >
> >
> > --
> > Thanks & Best Regards
> >
>
>
> --
> Thanks & Best Regards
>

Re: [DISCUSS] Prepare release for parquet-format 2.7.0?

Posted by 俊杰陈 <cj...@gmail.com>.
Sorry, the latest file is
https://github.com/chenjunjiedada/parquet-format/blob/PARQUET-1617/BloomFilter.md
.

On Fri, Jul 5, 2019 at 5:14 PM 俊杰陈 <cj...@gmail.com> wrote:

> Sure, please see this PR
> <https://github.com/apache/parquet-format/pull/140> or update file here
> <https://github.com/chenjunjiedada/parquet-format/blob/master/BloomFilter.md>
> .
>
> Thanks for reviewing spec.
>
> On Thu, Jul 4, 2019 at 11:57 PM Zoltan Ivanfi <zi...@cloudera.com.invalid>
> wrote:
>
>> Hi Junjie,
>>
>> I read through the specification and while I support the feature in
>> general, I find that the documentation may not be detailed enough to allow
>> developers of  different language bindings to implement it. Specifically,
>> the Technical Approach section of the docs is very short and refers the
>> reader to two publications for details. I think the specification would
>> greatly benefit from including an explanation or a summary of the approach
>> in this section.
>>
>> The "Build a Bloom filter" section contains a formula for calculating the
>> optimal filter size for a desired false positive rate, but does not
>> specify
>> what false positive rates implementations should target by default and
>> through what ways should they make it configurable by users. I understand
>> that this may be an intentional omission, since targeting any false
>> positive rate will result in a specification-compliant result, still I
>> think it would be best to provide some recommendation for the different
>> language bindings.
>>
>> Since this feature is getting added after encryption, it should be briefly
>> but explicitly mentioned how it interacts with that (basically that it has
>> to be encrypted, otherwise it would leak sensitive information, but by
>> placing it inside the column chunk metadata, this is automatically taken
>> care of).
>>
>> Finally, as a nitpick, I would prefer in-line links to related materials
>> instead of numeric references that one must manually look up at the bottom
>> of the page.
>>
>> Could you please add these improvements to the specification?
>>
>> Thanks,
>>
>> Zoltan
>>
>> On Wed, Jul 3, 2019 at 4:03 PM 俊杰陈 <cj...@gmail.com> wrote:
>>
>> > You are welcome, it 's my honor.
>> >
>> > I think the PR <https://github.com/apache/parquet-format/pull/139> just
>> > remove murmur3, that should express what I want.
>> >
>> >
>> >
>> >
>> > On Wed, Jul 3, 2019 at 9:53 PM Zoltan Ivanfi <zi...@cloudera.com.invalid>
>> > wrote:
>> >
>> > > Hi Junjie,
>> > >
>> > > Thanks for the update and also for your endruance in going through
>> this
>> > > tedious process in order to add bloom filtering to Parquet.
>> > >
>> > > I understand that your proposal is to go forward with xxHash instead
>> of
>> > the
>> > > eralier murmur3, which you suggest to deprecate. Since the murmur3
>> hash
>> > was
>> > > never released, I think it could be completely removed from the spec
>> > > instead of just getting deprecated. What is your opinion on this?
>> > >
>> > > Thanks,
>> > >
>> > > Zoltan
>> > >
>> > > On Wed, Jul 3, 2019 at 3:31 PM 俊杰陈 <cj...@gmail.com> wrote:
>> > >
>> > > > I see, thanks for guiding on this.
>> > > >
>> > > > Per discussion in this thread and some investigation about changes
>> on
>> > > > current java and c++ implementation, and I think that is not hard to
>> > > > handle. So I propose to use xxHash (the XXH64 version) as the
>> default
>> > > > hash strategy and deprecate previous murmur3 hash.
>> > > >
>> > > > I will update vote thread as well to make it clearer to all.
>> > > >
>> > > >
>> > > > On Wed, Jul 3, 2019 at 6:08 PM Zoltan Ivanfi
>> <zi...@cloudera.com.invalid>
>> > > > wrote:
>> > > > >
>> > > > > Hi Junjie,
>> > > > >
>> > > > > I think the vote is ambigous in its current form (can people vote
>> on
>> > > one
>> > > > > option only or can they vote on both?) and has a low chance of
>> > getting
>> > > > > votes in general because it's not a yes/no question but a
>> > > > > choose-an-approach question instead. I think most contributors
>> would
>> > > > accept
>> > > > > the hash chosen based on a community discussion but would be
>> > reluctant
>> > > to
>> > > > > make that choice themselves in the form a vote because it
>> requires a
>> > > much
>> > > > > deeper dive into the technical intricacies involved. The
>> committers
>> > are
>> > > > > experienced in the parquet code base but may not be as
>> experienced in
>> > > > bloom
>> > > > > filters as you are.
>> > > > >
>> > > > > In my opinion, to get bloom filtering into parquet-mr, you should
>> > > > convince
>> > > > > the committers that the proposal is viable by addressing their
>> > concerns
>> > > > > (which I believe you have done), and not by delegating the task of
>> > > making
>> > > > > choices to them. I would suggest that you propose which one (or
>> both)
>> > > of
>> > > > > the hashes should be included, summarize your motivations in this
>> > > thread
>> > > > > and if you don't get any objections for a day or two, call a
>> YES/NO
>> > > vote
>> > > > > for that specific proposal in a separate thread.
>> > > > >
>> > > > > Thanks,
>> > > > >
>> > > > > Zoltan
>> > > > >
>> > > > > On Tue, Jul 2, 2019 at 3:52 AM 俊杰陈 <cj...@gmail.com> wrote:
>> > > > >
>> > > > > > Any thoughts from other committers and developers?
>> > > > > >
>> > > > > > I 'd like to start a vote firstly, you could either provide your
>> > > input
>> > > > here
>> > > > > > or on vote thread.
>> > > > > >
>> > > > > >
>> > > > > >
>> > > > > > On Mon, Jul 1, 2019 at 8:20 PM Zoltan Ivanfi
>> > <zi@cloudera.com.invalid
>> > > >
>> > > > > > wrote:
>> > > > > >
>> > > > > > > Hi,
>> > > > > > >
>> > > > > > > I would like to clarify one point of my previous e-mail:
>> While I
>> > > > reasoned
>> > > > > > > that for compressions and encodings we should avoid picking
>> > > > algorithms
>> > > > > > > superseded by better ones, I also reasoned that for bloom
>> filters
>> > > we
>> > > > do
>> > > > > > not
>> > > > > > > necessarily have to be as strict, because a reader with
>> missing
>> > > > > > > implementation will still be able to read data from files that
>> > > > contain
>> > > > > > > unsupported bloom filter data structures.
>> > > > > > >
>> > > > > > > Personally I'm fine with moving forward with the current hash
>> > > > proposal,
>> > > > > > > even if the chosen algorithm is not considered to be the best
>> of
>> > > its
>> > > > > > class.
>> > > > > > >
>> > > > > > > Br,
>> > > > > > >
>> > > > > > > Zoltan
>> > > > > > >
>> > > > > > > On Sun, Jun 30, 2019 at 11:02 PM Jim Apple <
>> jbapple@apache.org>
>> > > > wrote:
>> > > > > > >
>> > > > > > > > On 2019/06/28 16:43:23, Ryan Blue <rblue@netflix.com.INVALID
>> >
>> > > > wrote:
>> > > > > > > > > I agree with Zoltan. Since we want to ensure
>> compatibility,
>> > it
>> > > > would
>> > > > > > be
>> > > > > > > > > better to choose the best option now instead of making
>> > everyone
>> > > > > > support
>> > > > > > > > two
>> > > > > > > > > options forever.
>> > > > > > > >
>> > > > > > > > I'd guess there probably isn't a single best option. I
>> suspect
>> > > > there's
>> > > > > > a
>> > > > > > > > tradeoff between ease of implementation and speed, for
>> > instance,
>> > > > since
>> > > > > > I
>> > > > > > > > expect it's easy to find an MD5 library in most programming
>> > > > languages
>> > > > > > and
>> > > > > > > > operating systems, yet MD5 is very slow compared to
>> > > > non-cryptographic
>> > > > > > > hash
>> > > > > > > > functions designed for speed like xxhash.
>> > > > > > > >
>> > > > > > > > There's also a significant amount of variability across
>> > processor
>> > > > > > > families
>> > > > > > > > (64-bit multiply-shift in ARM vs x86-64) or even different
>> > > > versions of
>> > > > > > > the
>> > > > > > > > same processor family (CLHash in Haswell vs. Sandy Lake).
>> There
>> > > are
>> > > > > > also
>> > > > > > > > quality tradeoffs that depend on the average bye length of
>> the
>> > > > input
>> > > > > > (FNV
>> > > > > > > > vs vhash) or how much L1 cache the user wants to use for the
>> > hash
>> > > > > > > function
>> > > > > > > > (tabulation hashing vs. multiply-shift).
>> > > > > > > >
>> > > > > > > > To deal with this level of ambiguity, I'd suggest that v1
>> > should
>> > > > > > include
>> > > > > > > a
>> > > > > > > > hash function that works well for certain common
>> environments.
>> > As
>> > > > far
>> > > > > > as
>> > > > > > > I
>> > > > > > > > know, murmur and xxhash would both fit that bill.
>> > > > > > > >
>> > > > > > >
>> > > > > >
>> > > > > >
>> > > > > > --
>> > > > > > Thanks & Best Regards
>> > > > > >
>> > > >
>> > > >
>> > > >
>> > > > --
>> > > > Thanks & Best Regards
>> > > >
>> > >
>> >
>> >
>> > --
>> > Thanks & Best Regards
>> >
>>
>
>
> --
> Thanks & Best Regards
>


-- 
Thanks & Best Regards

Re: [DISCUSS] Prepare release for parquet-format 2.7.0?

Posted by 俊杰陈 <cj...@gmail.com>.
Sure, please see this PR <https://github.com/apache/parquet-format/pull/140> or
update file here
<https://github.com/chenjunjiedada/parquet-format/blob/master/BloomFilter.md>
.

Thanks for reviewing spec.

On Thu, Jul 4, 2019 at 11:57 PM Zoltan Ivanfi <zi...@cloudera.com.invalid>
wrote:

> Hi Junjie,
>
> I read through the specification and while I support the feature in
> general, I find that the documentation may not be detailed enough to allow
> developers of  different language bindings to implement it. Specifically,
> the Technical Approach section of the docs is very short and refers the
> reader to two publications for details. I think the specification would
> greatly benefit from including an explanation or a summary of the approach
> in this section.
>
> The "Build a Bloom filter" section contains a formula for calculating the
> optimal filter size for a desired false positive rate, but does not specify
> what false positive rates implementations should target by default and
> through what ways should they make it configurable by users. I understand
> that this may be an intentional omission, since targeting any false
> positive rate will result in a specification-compliant result, still I
> think it would be best to provide some recommendation for the different
> language bindings.
>
> Since this feature is getting added after encryption, it should be briefly
> but explicitly mentioned how it interacts with that (basically that it has
> to be encrypted, otherwise it would leak sensitive information, but by
> placing it inside the column chunk metadata, this is automatically taken
> care of).
>
> Finally, as a nitpick, I would prefer in-line links to related materials
> instead of numeric references that one must manually look up at the bottom
> of the page.
>
> Could you please add these improvements to the specification?
>
> Thanks,
>
> Zoltan
>
> On Wed, Jul 3, 2019 at 4:03 PM 俊杰陈 <cj...@gmail.com> wrote:
>
> > You are welcome, it 's my honor.
> >
> > I think the PR <https://github.com/apache/parquet-format/pull/139> just
> > remove murmur3, that should express what I want.
> >
> >
> >
> >
> > On Wed, Jul 3, 2019 at 9:53 PM Zoltan Ivanfi <zi...@cloudera.com.invalid>
> > wrote:
> >
> > > Hi Junjie,
> > >
> > > Thanks for the update and also for your endruance in going through this
> > > tedious process in order to add bloom filtering to Parquet.
> > >
> > > I understand that your proposal is to go forward with xxHash instead of
> > the
> > > eralier murmur3, which you suggest to deprecate. Since the murmur3 hash
> > was
> > > never released, I think it could be completely removed from the spec
> > > instead of just getting deprecated. What is your opinion on this?
> > >
> > > Thanks,
> > >
> > > Zoltan
> > >
> > > On Wed, Jul 3, 2019 at 3:31 PM 俊杰陈 <cj...@gmail.com> wrote:
> > >
> > > > I see, thanks for guiding on this.
> > > >
> > > > Per discussion in this thread and some investigation about changes on
> > > > current java and c++ implementation, and I think that is not hard to
> > > > handle. So I propose to use xxHash (the XXH64 version) as the default
> > > > hash strategy and deprecate previous murmur3 hash.
> > > >
> > > > I will update vote thread as well to make it clearer to all.
> > > >
> > > >
> > > > On Wed, Jul 3, 2019 at 6:08 PM Zoltan Ivanfi <zi@cloudera.com.invalid
> >
> > > > wrote:
> > > > >
> > > > > Hi Junjie,
> > > > >
> > > > > I think the vote is ambigous in its current form (can people vote
> on
> > > one
> > > > > option only or can they vote on both?) and has a low chance of
> > getting
> > > > > votes in general because it's not a yes/no question but a
> > > > > choose-an-approach question instead. I think most contributors
> would
> > > > accept
> > > > > the hash chosen based on a community discussion but would be
> > reluctant
> > > to
> > > > > make that choice themselves in the form a vote because it requires
> a
> > > much
> > > > > deeper dive into the technical intricacies involved. The committers
> > are
> > > > > experienced in the parquet code base but may not be as experienced
> in
> > > > bloom
> > > > > filters as you are.
> > > > >
> > > > > In my opinion, to get bloom filtering into parquet-mr, you should
> > > > convince
> > > > > the committers that the proposal is viable by addressing their
> > concerns
> > > > > (which I believe you have done), and not by delegating the task of
> > > making
> > > > > choices to them. I would suggest that you propose which one (or
> both)
> > > of
> > > > > the hashes should be included, summarize your motivations in this
> > > thread
> > > > > and if you don't get any objections for a day or two, call a YES/NO
> > > vote
> > > > > for that specific proposal in a separate thread.
> > > > >
> > > > > Thanks,
> > > > >
> > > > > Zoltan
> > > > >
> > > > > On Tue, Jul 2, 2019 at 3:52 AM 俊杰陈 <cj...@gmail.com> wrote:
> > > > >
> > > > > > Any thoughts from other committers and developers?
> > > > > >
> > > > > > I 'd like to start a vote firstly, you could either provide your
> > > input
> > > > here
> > > > > > or on vote thread.
> > > > > >
> > > > > >
> > > > > >
> > > > > > On Mon, Jul 1, 2019 at 8:20 PM Zoltan Ivanfi
> > <zi@cloudera.com.invalid
> > > >
> > > > > > wrote:
> > > > > >
> > > > > > > Hi,
> > > > > > >
> > > > > > > I would like to clarify one point of my previous e-mail: While
> I
> > > > reasoned
> > > > > > > that for compressions and encodings we should avoid picking
> > > > algorithms
> > > > > > > superseded by better ones, I also reasoned that for bloom
> filters
> > > we
> > > > do
> > > > > > not
> > > > > > > necessarily have to be as strict, because a reader with missing
> > > > > > > implementation will still be able to read data from files that
> > > > contain
> > > > > > > unsupported bloom filter data structures.
> > > > > > >
> > > > > > > Personally I'm fine with moving forward with the current hash
> > > > proposal,
> > > > > > > even if the chosen algorithm is not considered to be the best
> of
> > > its
> > > > > > class.
> > > > > > >
> > > > > > > Br,
> > > > > > >
> > > > > > > Zoltan
> > > > > > >
> > > > > > > On Sun, Jun 30, 2019 at 11:02 PM Jim Apple <jbapple@apache.org
> >
> > > > wrote:
> > > > > > >
> > > > > > > > On 2019/06/28 16:43:23, Ryan Blue <rblue@netflix.com.INVALID
> >
> > > > wrote:
> > > > > > > > > I agree with Zoltan. Since we want to ensure compatibility,
> > it
> > > > would
> > > > > > be
> > > > > > > > > better to choose the best option now instead of making
> > everyone
> > > > > > support
> > > > > > > > two
> > > > > > > > > options forever.
> > > > > > > >
> > > > > > > > I'd guess there probably isn't a single best option. I
> suspect
> > > > there's
> > > > > > a
> > > > > > > > tradeoff between ease of implementation and speed, for
> > instance,
> > > > since
> > > > > > I
> > > > > > > > expect it's easy to find an MD5 library in most programming
> > > > languages
> > > > > > and
> > > > > > > > operating systems, yet MD5 is very slow compared to
> > > > non-cryptographic
> > > > > > > hash
> > > > > > > > functions designed for speed like xxhash.
> > > > > > > >
> > > > > > > > There's also a significant amount of variability across
> > processor
> > > > > > > families
> > > > > > > > (64-bit multiply-shift in ARM vs x86-64) or even different
> > > > versions of
> > > > > > > the
> > > > > > > > same processor family (CLHash in Haswell vs. Sandy Lake).
> There
> > > are
> > > > > > also
> > > > > > > > quality tradeoffs that depend on the average bye length of
> the
> > > > input
> > > > > > (FNV
> > > > > > > > vs vhash) or how much L1 cache the user wants to use for the
> > hash
> > > > > > > function
> > > > > > > > (tabulation hashing vs. multiply-shift).
> > > > > > > >
> > > > > > > > To deal with this level of ambiguity, I'd suggest that v1
> > should
> > > > > > include
> > > > > > > a
> > > > > > > > hash function that works well for certain common
> environments.
> > As
> > > > far
> > > > > > as
> > > > > > > I
> > > > > > > > know, murmur and xxhash would both fit that bill.
> > > > > > > >
> > > > > > >
> > > > > >
> > > > > >
> > > > > > --
> > > > > > Thanks & Best Regards
> > > > > >
> > > >
> > > >
> > > >
> > > > --
> > > > Thanks & Best Regards
> > > >
> > >
> >
> >
> > --
> > Thanks & Best Regards
> >
>


-- 
Thanks & Best Regards

Re: [DISCUSS] Prepare release for parquet-format 2.7.0?

Posted by Zoltan Ivanfi <zi...@cloudera.com.INVALID>.
Hi Junjie,

I read through the specification and while I support the feature in
general, I find that the documentation may not be detailed enough to allow
developers of  different language bindings to implement it. Specifically,
the Technical Approach section of the docs is very short and refers the
reader to two publications for details. I think the specification would
greatly benefit from including an explanation or a summary of the approach
in this section.

The "Build a Bloom filter" section contains a formula for calculating the
optimal filter size for a desired false positive rate, but does not specify
what false positive rates implementations should target by default and
through what ways should they make it configurable by users. I understand
that this may be an intentional omission, since targeting any false
positive rate will result in a specification-compliant result, still I
think it would be best to provide some recommendation for the different
language bindings.

Since this feature is getting added after encryption, it should be briefly
but explicitly mentioned how it interacts with that (basically that it has
to be encrypted, otherwise it would leak sensitive information, but by
placing it inside the column chunk metadata, this is automatically taken
care of).

Finally, as a nitpick, I would prefer in-line links to related materials
instead of numeric references that one must manually look up at the bottom
of the page.

Could you please add these improvements to the specification?

Thanks,

Zoltan

On Wed, Jul 3, 2019 at 4:03 PM 俊杰陈 <cj...@gmail.com> wrote:

> You are welcome, it 's my honor.
>
> I think the PR <https://github.com/apache/parquet-format/pull/139> just
> remove murmur3, that should express what I want.
>
>
>
>
> On Wed, Jul 3, 2019 at 9:53 PM Zoltan Ivanfi <zi...@cloudera.com.invalid>
> wrote:
>
> > Hi Junjie,
> >
> > Thanks for the update and also for your endruance in going through this
> > tedious process in order to add bloom filtering to Parquet.
> >
> > I understand that your proposal is to go forward with xxHash instead of
> the
> > eralier murmur3, which you suggest to deprecate. Since the murmur3 hash
> was
> > never released, I think it could be completely removed from the spec
> > instead of just getting deprecated. What is your opinion on this?
> >
> > Thanks,
> >
> > Zoltan
> >
> > On Wed, Jul 3, 2019 at 3:31 PM 俊杰陈 <cj...@gmail.com> wrote:
> >
> > > I see, thanks for guiding on this.
> > >
> > > Per discussion in this thread and some investigation about changes on
> > > current java and c++ implementation, and I think that is not hard to
> > > handle. So I propose to use xxHash (the XXH64 version) as the default
> > > hash strategy and deprecate previous murmur3 hash.
> > >
> > > I will update vote thread as well to make it clearer to all.
> > >
> > >
> > > On Wed, Jul 3, 2019 at 6:08 PM Zoltan Ivanfi <zi...@cloudera.com.invalid>
> > > wrote:
> > > >
> > > > Hi Junjie,
> > > >
> > > > I think the vote is ambigous in its current form (can people vote on
> > one
> > > > option only or can they vote on both?) and has a low chance of
> getting
> > > > votes in general because it's not a yes/no question but a
> > > > choose-an-approach question instead. I think most contributors would
> > > accept
> > > > the hash chosen based on a community discussion but would be
> reluctant
> > to
> > > > make that choice themselves in the form a vote because it requires a
> > much
> > > > deeper dive into the technical intricacies involved. The committers
> are
> > > > experienced in the parquet code base but may not be as experienced in
> > > bloom
> > > > filters as you are.
> > > >
> > > > In my opinion, to get bloom filtering into parquet-mr, you should
> > > convince
> > > > the committers that the proposal is viable by addressing their
> concerns
> > > > (which I believe you have done), and not by delegating the task of
> > making
> > > > choices to them. I would suggest that you propose which one (or both)
> > of
> > > > the hashes should be included, summarize your motivations in this
> > thread
> > > > and if you don't get any objections for a day or two, call a YES/NO
> > vote
> > > > for that specific proposal in a separate thread.
> > > >
> > > > Thanks,
> > > >
> > > > Zoltan
> > > >
> > > > On Tue, Jul 2, 2019 at 3:52 AM 俊杰陈 <cj...@gmail.com> wrote:
> > > >
> > > > > Any thoughts from other committers and developers?
> > > > >
> > > > > I 'd like to start a vote firstly, you could either provide your
> > input
> > > here
> > > > > or on vote thread.
> > > > >
> > > > >
> > > > >
> > > > > On Mon, Jul 1, 2019 at 8:20 PM Zoltan Ivanfi
> <zi@cloudera.com.invalid
> > >
> > > > > wrote:
> > > > >
> > > > > > Hi,
> > > > > >
> > > > > > I would like to clarify one point of my previous e-mail: While I
> > > reasoned
> > > > > > that for compressions and encodings we should avoid picking
> > > algorithms
> > > > > > superseded by better ones, I also reasoned that for bloom filters
> > we
> > > do
> > > > > not
> > > > > > necessarily have to be as strict, because a reader with missing
> > > > > > implementation will still be able to read data from files that
> > > contain
> > > > > > unsupported bloom filter data structures.
> > > > > >
> > > > > > Personally I'm fine with moving forward with the current hash
> > > proposal,
> > > > > > even if the chosen algorithm is not considered to be the best of
> > its
> > > > > class.
> > > > > >
> > > > > > Br,
> > > > > >
> > > > > > Zoltan
> > > > > >
> > > > > > On Sun, Jun 30, 2019 at 11:02 PM Jim Apple <jb...@apache.org>
> > > wrote:
> > > > > >
> > > > > > > On 2019/06/28 16:43:23, Ryan Blue <rb...@netflix.com.INVALID>
> > > wrote:
> > > > > > > > I agree with Zoltan. Since we want to ensure compatibility,
> it
> > > would
> > > > > be
> > > > > > > > better to choose the best option now instead of making
> everyone
> > > > > support
> > > > > > > two
> > > > > > > > options forever.
> > > > > > >
> > > > > > > I'd guess there probably isn't a single best option. I suspect
> > > there's
> > > > > a
> > > > > > > tradeoff between ease of implementation and speed, for
> instance,
> > > since
> > > > > I
> > > > > > > expect it's easy to find an MD5 library in most programming
> > > languages
> > > > > and
> > > > > > > operating systems, yet MD5 is very slow compared to
> > > non-cryptographic
> > > > > > hash
> > > > > > > functions designed for speed like xxhash.
> > > > > > >
> > > > > > > There's also a significant amount of variability across
> processor
> > > > > > families
> > > > > > > (64-bit multiply-shift in ARM vs x86-64) or even different
> > > versions of
> > > > > > the
> > > > > > > same processor family (CLHash in Haswell vs. Sandy Lake). There
> > are
> > > > > also
> > > > > > > quality tradeoffs that depend on the average bye length of the
> > > input
> > > > > (FNV
> > > > > > > vs vhash) or how much L1 cache the user wants to use for the
> hash
> > > > > > function
> > > > > > > (tabulation hashing vs. multiply-shift).
> > > > > > >
> > > > > > > To deal with this level of ambiguity, I'd suggest that v1
> should
> > > > > include
> > > > > > a
> > > > > > > hash function that works well for certain common environments.
> As
> > > far
> > > > > as
> > > > > > I
> > > > > > > know, murmur and xxhash would both fit that bill.
> > > > > > >
> > > > > >
> > > > >
> > > > >
> > > > > --
> > > > > Thanks & Best Regards
> > > > >
> > >
> > >
> > >
> > > --
> > > Thanks & Best Regards
> > >
> >
>
>
> --
> Thanks & Best Regards
>

Re: [DISCUSS] Prepare release for parquet-format 2.7.0?

Posted by 俊杰陈 <cj...@gmail.com>.
You are welcome, it 's my honor.

I think the PR <https://github.com/apache/parquet-format/pull/139> just
remove murmur3, that should express what I want.




On Wed, Jul 3, 2019 at 9:53 PM Zoltan Ivanfi <zi...@cloudera.com.invalid>
wrote:

> Hi Junjie,
>
> Thanks for the update and also for your endruance in going through this
> tedious process in order to add bloom filtering to Parquet.
>
> I understand that your proposal is to go forward with xxHash instead of the
> eralier murmur3, which you suggest to deprecate. Since the murmur3 hash was
> never released, I think it could be completely removed from the spec
> instead of just getting deprecated. What is your opinion on this?
>
> Thanks,
>
> Zoltan
>
> On Wed, Jul 3, 2019 at 3:31 PM 俊杰陈 <cj...@gmail.com> wrote:
>
> > I see, thanks for guiding on this.
> >
> > Per discussion in this thread and some investigation about changes on
> > current java and c++ implementation, and I think that is not hard to
> > handle. So I propose to use xxHash (the XXH64 version) as the default
> > hash strategy and deprecate previous murmur3 hash.
> >
> > I will update vote thread as well to make it clearer to all.
> >
> >
> > On Wed, Jul 3, 2019 at 6:08 PM Zoltan Ivanfi <zi...@cloudera.com.invalid>
> > wrote:
> > >
> > > Hi Junjie,
> > >
> > > I think the vote is ambigous in its current form (can people vote on
> one
> > > option only or can they vote on both?) and has a low chance of getting
> > > votes in general because it's not a yes/no question but a
> > > choose-an-approach question instead. I think most contributors would
> > accept
> > > the hash chosen based on a community discussion but would be reluctant
> to
> > > make that choice themselves in the form a vote because it requires a
> much
> > > deeper dive into the technical intricacies involved. The committers are
> > > experienced in the parquet code base but may not be as experienced in
> > bloom
> > > filters as you are.
> > >
> > > In my opinion, to get bloom filtering into parquet-mr, you should
> > convince
> > > the committers that the proposal is viable by addressing their concerns
> > > (which I believe you have done), and not by delegating the task of
> making
> > > choices to them. I would suggest that you propose which one (or both)
> of
> > > the hashes should be included, summarize your motivations in this
> thread
> > > and if you don't get any objections for a day or two, call a YES/NO
> vote
> > > for that specific proposal in a separate thread.
> > >
> > > Thanks,
> > >
> > > Zoltan
> > >
> > > On Tue, Jul 2, 2019 at 3:52 AM 俊杰陈 <cj...@gmail.com> wrote:
> > >
> > > > Any thoughts from other committers and developers?
> > > >
> > > > I 'd like to start a vote firstly, you could either provide your
> input
> > here
> > > > or on vote thread.
> > > >
> > > >
> > > >
> > > > On Mon, Jul 1, 2019 at 8:20 PM Zoltan Ivanfi <zi@cloudera.com.invalid
> >
> > > > wrote:
> > > >
> > > > > Hi,
> > > > >
> > > > > I would like to clarify one point of my previous e-mail: While I
> > reasoned
> > > > > that for compressions and encodings we should avoid picking
> > algorithms
> > > > > superseded by better ones, I also reasoned that for bloom filters
> we
> > do
> > > > not
> > > > > necessarily have to be as strict, because a reader with missing
> > > > > implementation will still be able to read data from files that
> > contain
> > > > > unsupported bloom filter data structures.
> > > > >
> > > > > Personally I'm fine with moving forward with the current hash
> > proposal,
> > > > > even if the chosen algorithm is not considered to be the best of
> its
> > > > class.
> > > > >
> > > > > Br,
> > > > >
> > > > > Zoltan
> > > > >
> > > > > On Sun, Jun 30, 2019 at 11:02 PM Jim Apple <jb...@apache.org>
> > wrote:
> > > > >
> > > > > > On 2019/06/28 16:43:23, Ryan Blue <rb...@netflix.com.INVALID>
> > wrote:
> > > > > > > I agree with Zoltan. Since we want to ensure compatibility, it
> > would
> > > > be
> > > > > > > better to choose the best option now instead of making everyone
> > > > support
> > > > > > two
> > > > > > > options forever.
> > > > > >
> > > > > > I'd guess there probably isn't a single best option. I suspect
> > there's
> > > > a
> > > > > > tradeoff between ease of implementation and speed, for instance,
> > since
> > > > I
> > > > > > expect it's easy to find an MD5 library in most programming
> > languages
> > > > and
> > > > > > operating systems, yet MD5 is very slow compared to
> > non-cryptographic
> > > > > hash
> > > > > > functions designed for speed like xxhash.
> > > > > >
> > > > > > There's also a significant amount of variability across processor
> > > > > families
> > > > > > (64-bit multiply-shift in ARM vs x86-64) or even different
> > versions of
> > > > > the
> > > > > > same processor family (CLHash in Haswell vs. Sandy Lake). There
> are
> > > > also
> > > > > > quality tradeoffs that depend on the average bye length of the
> > input
> > > > (FNV
> > > > > > vs vhash) or how much L1 cache the user wants to use for the hash
> > > > > function
> > > > > > (tabulation hashing vs. multiply-shift).
> > > > > >
> > > > > > To deal with this level of ambiguity, I'd suggest that v1 should
> > > > include
> > > > > a
> > > > > > hash function that works well for certain common environments. As
> > far
> > > > as
> > > > > I
> > > > > > know, murmur and xxhash would both fit that bill.
> > > > > >
> > > > >
> > > >
> > > >
> > > > --
> > > > Thanks & Best Regards
> > > >
> >
> >
> >
> > --
> > Thanks & Best Regards
> >
>


-- 
Thanks & Best Regards

Re: [DISCUSS] Prepare release for parquet-format 2.7.0?

Posted by Zoltan Ivanfi <zi...@cloudera.com.INVALID>.
Hi Junjie,

Thanks for the update and also for your endruance in going through this
tedious process in order to add bloom filtering to Parquet.

I understand that your proposal is to go forward with xxHash instead of the
eralier murmur3, which you suggest to deprecate. Since the murmur3 hash was
never released, I think it could be completely removed from the spec
instead of just getting deprecated. What is your opinion on this?

Thanks,

Zoltan

On Wed, Jul 3, 2019 at 3:31 PM 俊杰陈 <cj...@gmail.com> wrote:

> I see, thanks for guiding on this.
>
> Per discussion in this thread and some investigation about changes on
> current java and c++ implementation, and I think that is not hard to
> handle. So I propose to use xxHash (the XXH64 version) as the default
> hash strategy and deprecate previous murmur3 hash.
>
> I will update vote thread as well to make it clearer to all.
>
>
> On Wed, Jul 3, 2019 at 6:08 PM Zoltan Ivanfi <zi...@cloudera.com.invalid>
> wrote:
> >
> > Hi Junjie,
> >
> > I think the vote is ambigous in its current form (can people vote on one
> > option only or can they vote on both?) and has a low chance of getting
> > votes in general because it's not a yes/no question but a
> > choose-an-approach question instead. I think most contributors would
> accept
> > the hash chosen based on a community discussion but would be reluctant to
> > make that choice themselves in the form a vote because it requires a much
> > deeper dive into the technical intricacies involved. The committers are
> > experienced in the parquet code base but may not be as experienced in
> bloom
> > filters as you are.
> >
> > In my opinion, to get bloom filtering into parquet-mr, you should
> convince
> > the committers that the proposal is viable by addressing their concerns
> > (which I believe you have done), and not by delegating the task of making
> > choices to them. I would suggest that you propose which one (or both) of
> > the hashes should be included, summarize your motivations in this thread
> > and if you don't get any objections for a day or two, call a YES/NO vote
> > for that specific proposal in a separate thread.
> >
> > Thanks,
> >
> > Zoltan
> >
> > On Tue, Jul 2, 2019 at 3:52 AM 俊杰陈 <cj...@gmail.com> wrote:
> >
> > > Any thoughts from other committers and developers?
> > >
> > > I 'd like to start a vote firstly, you could either provide your input
> here
> > > or on vote thread.
> > >
> > >
> > >
> > > On Mon, Jul 1, 2019 at 8:20 PM Zoltan Ivanfi <zi...@cloudera.com.invalid>
> > > wrote:
> > >
> > > > Hi,
> > > >
> > > > I would like to clarify one point of my previous e-mail: While I
> reasoned
> > > > that for compressions and encodings we should avoid picking
> algorithms
> > > > superseded by better ones, I also reasoned that for bloom filters we
> do
> > > not
> > > > necessarily have to be as strict, because a reader with missing
> > > > implementation will still be able to read data from files that
> contain
> > > > unsupported bloom filter data structures.
> > > >
> > > > Personally I'm fine with moving forward with the current hash
> proposal,
> > > > even if the chosen algorithm is not considered to be the best of its
> > > class.
> > > >
> > > > Br,
> > > >
> > > > Zoltan
> > > >
> > > > On Sun, Jun 30, 2019 at 11:02 PM Jim Apple <jb...@apache.org>
> wrote:
> > > >
> > > > > On 2019/06/28 16:43:23, Ryan Blue <rb...@netflix.com.INVALID>
> wrote:
> > > > > > I agree with Zoltan. Since we want to ensure compatibility, it
> would
> > > be
> > > > > > better to choose the best option now instead of making everyone
> > > support
> > > > > two
> > > > > > options forever.
> > > > >
> > > > > I'd guess there probably isn't a single best option. I suspect
> there's
> > > a
> > > > > tradeoff between ease of implementation and speed, for instance,
> since
> > > I
> > > > > expect it's easy to find an MD5 library in most programming
> languages
> > > and
> > > > > operating systems, yet MD5 is very slow compared to
> non-cryptographic
> > > > hash
> > > > > functions designed for speed like xxhash.
> > > > >
> > > > > There's also a significant amount of variability across processor
> > > > families
> > > > > (64-bit multiply-shift in ARM vs x86-64) or even different
> versions of
> > > > the
> > > > > same processor family (CLHash in Haswell vs. Sandy Lake). There are
> > > also
> > > > > quality tradeoffs that depend on the average bye length of the
> input
> > > (FNV
> > > > > vs vhash) or how much L1 cache the user wants to use for the hash
> > > > function
> > > > > (tabulation hashing vs. multiply-shift).
> > > > >
> > > > > To deal with this level of ambiguity, I'd suggest that v1 should
> > > include
> > > > a
> > > > > hash function that works well for certain common environments. As
> far
> > > as
> > > > I
> > > > > know, murmur and xxhash would both fit that bill.
> > > > >
> > > >
> > >
> > >
> > > --
> > > Thanks & Best Regards
> > >
>
>
>
> --
> Thanks & Best Regards
>

Re: [DISCUSS] Prepare release for parquet-format 2.7.0?

Posted by 俊杰陈 <cj...@gmail.com>.
I see, thanks for guiding on this.

Per discussion in this thread and some investigation about changes on
current java and c++ implementation, and I think that is not hard to
handle. So I propose to use xxHash (the XXH64 version) as the default
hash strategy and deprecate previous murmur3 hash.

I will update vote thread as well to make it clearer to all.


On Wed, Jul 3, 2019 at 6:08 PM Zoltan Ivanfi <zi...@cloudera.com.invalid> wrote:
>
> Hi Junjie,
>
> I think the vote is ambigous in its current form (can people vote on one
> option only or can they vote on both?) and has a low chance of getting
> votes in general because it's not a yes/no question but a
> choose-an-approach question instead. I think most contributors would accept
> the hash chosen based on a community discussion but would be reluctant to
> make that choice themselves in the form a vote because it requires a much
> deeper dive into the technical intricacies involved. The committers are
> experienced in the parquet code base but may not be as experienced in bloom
> filters as you are.
>
> In my opinion, to get bloom filtering into parquet-mr, you should convince
> the committers that the proposal is viable by addressing their concerns
> (which I believe you have done), and not by delegating the task of making
> choices to them. I would suggest that you propose which one (or both) of
> the hashes should be included, summarize your motivations in this thread
> and if you don't get any objections for a day or two, call a YES/NO vote
> for that specific proposal in a separate thread.
>
> Thanks,
>
> Zoltan
>
> On Tue, Jul 2, 2019 at 3:52 AM 俊杰陈 <cj...@gmail.com> wrote:
>
> > Any thoughts from other committers and developers?
> >
> > I 'd like to start a vote firstly, you could either provide your input here
> > or on vote thread.
> >
> >
> >
> > On Mon, Jul 1, 2019 at 8:20 PM Zoltan Ivanfi <zi...@cloudera.com.invalid>
> > wrote:
> >
> > > Hi,
> > >
> > > I would like to clarify one point of my previous e-mail: While I reasoned
> > > that for compressions and encodings we should avoid picking algorithms
> > > superseded by better ones, I also reasoned that for bloom filters we do
> > not
> > > necessarily have to be as strict, because a reader with missing
> > > implementation will still be able to read data from files that contain
> > > unsupported bloom filter data structures.
> > >
> > > Personally I'm fine with moving forward with the current hash proposal,
> > > even if the chosen algorithm is not considered to be the best of its
> > class.
> > >
> > > Br,
> > >
> > > Zoltan
> > >
> > > On Sun, Jun 30, 2019 at 11:02 PM Jim Apple <jb...@apache.org> wrote:
> > >
> > > > On 2019/06/28 16:43:23, Ryan Blue <rb...@netflix.com.INVALID> wrote:
> > > > > I agree with Zoltan. Since we want to ensure compatibility, it would
> > be
> > > > > better to choose the best option now instead of making everyone
> > support
> > > > two
> > > > > options forever.
> > > >
> > > > I'd guess there probably isn't a single best option. I suspect there's
> > a
> > > > tradeoff between ease of implementation and speed, for instance, since
> > I
> > > > expect it's easy to find an MD5 library in most programming languages
> > and
> > > > operating systems, yet MD5 is very slow compared to non-cryptographic
> > > hash
> > > > functions designed for speed like xxhash.
> > > >
> > > > There's also a significant amount of variability across processor
> > > families
> > > > (64-bit multiply-shift in ARM vs x86-64) or even different versions of
> > > the
> > > > same processor family (CLHash in Haswell vs. Sandy Lake). There are
> > also
> > > > quality tradeoffs that depend on the average bye length of the input
> > (FNV
> > > > vs vhash) or how much L1 cache the user wants to use for the hash
> > > function
> > > > (tabulation hashing vs. multiply-shift).
> > > >
> > > > To deal with this level of ambiguity, I'd suggest that v1 should
> > include
> > > a
> > > > hash function that works well for certain common environments. As far
> > as
> > > I
> > > > know, murmur and xxhash would both fit that bill.
> > > >
> > >
> >
> >
> > --
> > Thanks & Best Regards
> >



-- 
Thanks & Best Regards

Re: [DISCUSS] Prepare release for parquet-format 2.7.0?

Posted by Zoltan Ivanfi <zi...@cloudera.com.INVALID>.
Hi Junjie,

I think the vote is ambigous in its current form (can people vote on one
option only or can they vote on both?) and has a low chance of getting
votes in general because it's not a yes/no question but a
choose-an-approach question instead. I think most contributors would accept
the hash chosen based on a community discussion but would be reluctant to
make that choice themselves in the form a vote because it requires a much
deeper dive into the technical intricacies involved. The committers are
experienced in the parquet code base but may not be as experienced in bloom
filters as you are.

In my opinion, to get bloom filtering into parquet-mr, you should convince
the committers that the proposal is viable by addressing their concerns
(which I believe you have done), and not by delegating the task of making
choices to them. I would suggest that you propose which one (or both) of
the hashes should be included, summarize your motivations in this thread
and if you don't get any objections for a day or two, call a YES/NO vote
for that specific proposal in a separate thread.

Thanks,

Zoltan

On Tue, Jul 2, 2019 at 3:52 AM 俊杰陈 <cj...@gmail.com> wrote:

> Any thoughts from other committers and developers?
>
> I 'd like to start a vote firstly, you could either provide your input here
> or on vote thread.
>
>
>
> On Mon, Jul 1, 2019 at 8:20 PM Zoltan Ivanfi <zi...@cloudera.com.invalid>
> wrote:
>
> > Hi,
> >
> > I would like to clarify one point of my previous e-mail: While I reasoned
> > that for compressions and encodings we should avoid picking algorithms
> > superseded by better ones, I also reasoned that for bloom filters we do
> not
> > necessarily have to be as strict, because a reader with missing
> > implementation will still be able to read data from files that contain
> > unsupported bloom filter data structures.
> >
> > Personally I'm fine with moving forward with the current hash proposal,
> > even if the chosen algorithm is not considered to be the best of its
> class.
> >
> > Br,
> >
> > Zoltan
> >
> > On Sun, Jun 30, 2019 at 11:02 PM Jim Apple <jb...@apache.org> wrote:
> >
> > > On 2019/06/28 16:43:23, Ryan Blue <rb...@netflix.com.INVALID> wrote:
> > > > I agree with Zoltan. Since we want to ensure compatibility, it would
> be
> > > > better to choose the best option now instead of making everyone
> support
> > > two
> > > > options forever.
> > >
> > > I'd guess there probably isn't a single best option. I suspect there's
> a
> > > tradeoff between ease of implementation and speed, for instance, since
> I
> > > expect it's easy to find an MD5 library in most programming languages
> and
> > > operating systems, yet MD5 is very slow compared to non-cryptographic
> > hash
> > > functions designed for speed like xxhash.
> > >
> > > There's also a significant amount of variability across processor
> > families
> > > (64-bit multiply-shift in ARM vs x86-64) or even different versions of
> > the
> > > same processor family (CLHash in Haswell vs. Sandy Lake). There are
> also
> > > quality tradeoffs that depend on the average bye length of the input
> (FNV
> > > vs vhash) or how much L1 cache the user wants to use for the hash
> > function
> > > (tabulation hashing vs. multiply-shift).
> > >
> > > To deal with this level of ambiguity, I'd suggest that v1 should
> include
> > a
> > > hash function that works well for certain common environments. As far
> as
> > I
> > > know, murmur and xxhash would both fit that bill.
> > >
> >
>
>
> --
> Thanks & Best Regards
>

Re: [DISCUSS] Prepare release for parquet-format 2.7.0?

Posted by 俊杰陈 <cj...@gmail.com>.
Any thoughts from other committers and developers?

I 'd like to start a vote firstly, you could either provide your input here
or on vote thread.



On Mon, Jul 1, 2019 at 8:20 PM Zoltan Ivanfi <zi...@cloudera.com.invalid>
wrote:

> Hi,
>
> I would like to clarify one point of my previous e-mail: While I reasoned
> that for compressions and encodings we should avoid picking algorithms
> superseded by better ones, I also reasoned that for bloom filters we do not
> necessarily have to be as strict, because a reader with missing
> implementation will still be able to read data from files that contain
> unsupported bloom filter data structures.
>
> Personally I'm fine with moving forward with the current hash proposal,
> even if the chosen algorithm is not considered to be the best of its class.
>
> Br,
>
> Zoltan
>
> On Sun, Jun 30, 2019 at 11:02 PM Jim Apple <jb...@apache.org> wrote:
>
> > On 2019/06/28 16:43:23, Ryan Blue <rb...@netflix.com.INVALID> wrote:
> > > I agree with Zoltan. Since we want to ensure compatibility, it would be
> > > better to choose the best option now instead of making everyone support
> > two
> > > options forever.
> >
> > I'd guess there probably isn't a single best option. I suspect there's a
> > tradeoff between ease of implementation and speed, for instance, since I
> > expect it's easy to find an MD5 library in most programming languages and
> > operating systems, yet MD5 is very slow compared to non-cryptographic
> hash
> > functions designed for speed like xxhash.
> >
> > There's also a significant amount of variability across processor
> families
> > (64-bit multiply-shift in ARM vs x86-64) or even different versions of
> the
> > same processor family (CLHash in Haswell vs. Sandy Lake). There are also
> > quality tradeoffs that depend on the average bye length of the input (FNV
> > vs vhash) or how much L1 cache the user wants to use for the hash
> function
> > (tabulation hashing vs. multiply-shift).
> >
> > To deal with this level of ambiguity, I'd suggest that v1 should include
> a
> > hash function that works well for certain common environments. As far as
> I
> > know, murmur and xxhash would both fit that bill.
> >
>


-- 
Thanks & Best Regards