You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@parquet.apache.org by Jim Apple <jb...@apache.org> on 2019/06/20 00:00:14 UTC

[DISCUSS] Prepare release for parquet-format 2.7.0?

This is a thread for discussing a release of parquet-format. The last release appears to be 2.6.0 from September 2018:

https://github.com/apache/parquet-format/releases

The diff from then until now is

https://github.com/apache/parquet-format/compare/df6132b94f273521a418a74442085fdd5a0aa009...4157b4c6132086e318943f1898523f7dcb013f35

It's my understanding we'll need a parquet-format release before a parquet-mr and/or parquet-cpp release[0].

In the most recent discussion thread on this, there were two concerns raised that are not yet addressed:

1. Should Bloom filters use a different hash function by default?
2. Should we devise an automated way to test cross-language compatibility of parquet files, especially for the new Bloom filter spec?

I am suggesting a release despite these open issues based on my belief that it's reasonable to handle these after a parquet-format release.

A final note: although I am suggesting the release, it looks to me like the release recipe[1] can only be executed by a committer, which I am not. This means even if there is consensus on a release, someone else would need to do the legwork.

Thanks,
Jim

[0] This code now lives in https://github.com/apache/arrow/tree/master/cpp/src/parquet, I believe.

[1] https://parquet.apache.org/documentation/how-to-release/

Re: [DISCUSS] Prepare release for parquet-format 2.7.0?

Posted by 俊杰陈 <cj...@gmail.com>.

Thank you all for making the next step clear. I also agree to choose the
best option.

The changes to parquet.thrift and BloomFilter.md is ready now in this PR
<https://github.com/apache/parquet-format/pull/139>.  Please take a look
and then we can move forward to VOTE.


On Sat, Jun 29, 2019 at 12:44 AM Ryan Blue <rb...@netflix.com.invalid>
wrote:

> I agree with Zoltan. Since we want to ensure compatibility, it would be
> better to choose the best option now instead of making everyone support two
> options forever.
>
> In terms of next steps, I think that getting a clean write-up of the design
> and changes and starting a VOTE thread that points to them are the next
> steps. The write-up is already done, but needs to be updated for xxHash,
> right?
>
> rb
>
> On Fri, Jun 28, 2019 at 3:59 AM Zoltan Ivanfi <zi...@cloudera.com.invalid>
> wrote:
>
> > Hi,
> >
> > I think the concern was not about the lack of any specific hash
> algorithm,
> > but about the choice of the one that got added. Generally for
> compressions
> > and encodings, we are very picky about which ones to add to
> specification,
> > because it has to be implemented in every language binding. This is not
> > only a considerable effort, but is also error-prone (see LZ4 for an
> > example, which was added to both the Java and the C++ implementation of
> > Parquet, yet they are incompatible with each other). And lack of support
> is
> > not only a minor annoyance in this case: if one is forced to use an older
> > reader that does not support the new encoding yet (or a language binding
> > that  does not support it at all), the data simply can not be read.
> >
> > For this reason, if we already know that an algorithm is suboptimal and
> > there are better ones available, we prefer not to add it at all.
> However, I
> > don't think that the reasoning above applies here though, because the
> bloom
> > filter is an optional metadata and the data is perfectly readable without
> > supporting it. Even if it is very likely that we will want to move to a
> > better hash algorithm later, we already know that we won't have to keep
> > supporting the current one forever, since removing support is not a
> > breaking change (at least functionally, performance-wise it will result
> in
> > a regression for old files).
> >
> > Br,
> >
> > Zoltan
> >
> > On Fri, Jun 28, 2019 at 11:12 AM 俊杰陈 <cj...@gmail.com> wrote:
> >
> > > Thanks,
> > >
> > > The naming issue had been fixed, I also created a PR
> > > <https://github.com/apache/parquet-format/pull/139>to add xxHash as an
> > > alternative option for Todd's concern. Is that OK for concerns? If that
> > is
> > > OK, we can create a VOTE against the spec  (the bloom filter diff in
> > > parquet-format repo).
> > >
> > > On Fri, Jun 28, 2019 at 4:03 PM Driesprong, Fokko <fokko@driesprong.frl
> >
> > > wrote:
> > >
> > > > Ryan has a valid point here. Once the Bloom filters get released, it
> > > won't
> > > > be as easy anymore to change it because we will break an already
> > released
> > > > API.
> > > >
> > > > There was a related discussion a while ago:
> > > >
> > > >
> > >
> >
> https://lists.apache.org/thread.html/027e9d73093df84448e07d8514b9d669906cd5b83ae59a76f38aaa55@%3Cdev.parquet.apache.org%3E
> > > >
> > > > My suggestion would be to create a VOTE to formally adopt the vote
> and
> > > fix
> > > > the remaining concerns. For example, the one that Zoltan raised in
> the
> > > list
> > > > above.
> > > >
> > > > Cheers, Fokko
> > > >
> > > > Op vr 28 jun. 2019 om 01:13 schreef Jim Apple <jb...@apache.org>:
> > > >
> > > > > > I think we need to have a vote on the bloom filter
> > > > > > structures first. We need to make sure that the community has
> > vetted
> > > > the
> > > > > > design and is comfortable with adding this, just like we did with
> > the
> > > > > > Parquet encryption design and the page index design.
> > > > >
> > > > > Thank you for the note, Ryan. Based on my experience on Apache
> > Impala,
> > > I
> > > > > was under the impression that a git commit signified at least a
> > > temporary
> > > > > agreement that the commit should make it into a future release. I
> > > > > understand you to be saying that in parquet-format, a vote on
> format
> > > > > additions is standard, whether or not a commit made it into HEAD.
> > > > >
> > > > > There have been previous discussions of Bloom filters in the pull
> > > > > requests, on this list, and in live videochat meetups (from quite a
> > > while
> > > > > ago). In your opinion, should we start a new discussion, or start a
> > > > [VOTE]
> > > > > thread with pointers to the old discussions, or some third option?
> > > > >
> > > >
> > >
> > >
> > > --
> > > Thanks & Best Regards
> > >
> >
>
>
> --
> Ryan Blue
> Software Engineer
> Netflix
>


-- 
Thanks & Best Regards

Re: [DISCUSS] Prepare release for parquet-format 2.7.0?

Posted by Gidon Gershinsky <gg...@gmail.com>.

Hi Zoltan,

This has been brought up at the sync today, there was a general consensus
the encryption (spec and Thrift structures) should be released with the
parquet-format 2.7.

Cheers, Gidon.


On Fri, Jul 5, 2019 at 3:35 PM Zoltan Ivanfi <zi...@cloudera.com.invalid>
wrote:

> Hi,
>
> I just wanted to leave a comment on the pull request to update
> Encryption.md as well, but to my suprise it is not in master yet despite
> the vote for the encryption feature having passed 6 months ago. What are
> the plans for merging that? Should it be included in parquet-format 2.7?
>
> Thanks,
>
> Zoltan
>
> On Fri, Jul 5, 2019 at 1:33 PM Zoltan Ivanfi <zi...@cloudera.com> wrote:
>
> > Hi,
> >
> > I just noticed that yesterday I misunderstood that the Bloom filter is a
> > part of the column chunk metadata, when in fact it is only the offset of
> it
> > that is stored there. In this case we definitely need to pay more
> attention
> > to the encryption aspect because it won't happen automatically.
> >
> > Br,
> >
> > Zoltan
> >
> > On Fri, Jul 5, 2019 at 1:09 PM 俊杰陈 <cj...@gmail.com> wrote:
> >
> >> That would be great, thank you.
> >>
> >> On Fri, Jul 5, 2019 at 5:40 PM Gidon Gershinsky <gg...@gmail.com>
> wrote:
> >>
> >> > Hi Junjie,
> >> >
> >> > I'd be glad to have a look at the encryption part. Will add my
> comments
> >> > early next week.
> >> >
> >> > Cheers, Gidon.
> >> >
> >> > On Fri, Jul 5, 2019 at 12:16 PM 俊杰陈 <cj...@gmail.com> wrote:
> >> >
> >> > > Sorry, the latest file is
> >> > >
> >> > >
> >> >
> >>
> https://github.com/chenjunjiedada/parquet-format/blob/PARQUET-1617/BloomFilter.md
> >> > > .
> >> > >
> >> > > On Fri, Jul 5, 2019 at 5:14 PM 俊杰陈 <cj...@gmail.com> wrote:
> >> > >
> >> > > > Sure, please see this PR
> >> > > > <https://github.com/apache/parquet-format/pull/140> or update
> file
> >> > here
> >> > > > <
> >> > >
> >> >
> >>
> https://github.com/chenjunjiedada/parquet-format/blob/master/BloomFilter.md
> >> > > >
> >> > > > .
> >> > > >
> >> > > > Thanks for reviewing spec.
> >> > > >
> >> > > > On Thu, Jul 4, 2019 at 11:57 PM Zoltan Ivanfi
> >> <zi@cloudera.com.invalid
> >> > >
> >> > > > wrote:
> >> > > >
> >> > > >> Hi Junjie,
> >> > > >>
> >> > > >> I read through the specification and while I support the feature
> in
> >> > > >> general, I find that the documentation may not be detailed enough
> >> to
> >> > > allow
> >> > > >> developers of  different language bindings to implement it.
> >> > > Specifically,
> >> > > >> the Technical Approach section of the docs is very short and
> refers
> >> > the
> >> > > >> reader to two publications for details. I think the specification
> >> > would
> >> > > >> greatly benefit from including an explanation or a summary of the
> >> > > approach
> >> > > >> in this section.
> >> > > >>
> >> > > >> The "Build a Bloom filter" section contains a formula for
> >> calculating
> >> > > the
> >> > > >> optimal filter size for a desired false positive rate, but does
> not
> >> > > >> specify
> >> > > >> what false positive rates implementations should target by
> default
> >> and
> >> > > >> through what ways should they make it configurable by users. I
> >> > > understand
> >> > > >> that this may be an intentional omission, since targeting any
> false
> >> > > >> positive rate will result in a specification-compliant result,
> >> still I
> >> > > >> think it would be best to provide some recommendation for the
> >> > different
> >> > > >> language bindings.
> >> > > >>
> >> > > >> Since this feature is getting added after encryption, it should
> be
> >> > > briefly
> >> > > >> but explicitly mentioned how it interacts with that (basically
> >> that it
> >> > > has
> >> > > >> to be encrypted, otherwise it would leak sensitive information,
> >> but by
> >> > > >> placing it inside the column chunk metadata, this is
> automatically
> >> > taken
> >> > > >> care of).
> >> > > >>
> >> > > >> Finally, as a nitpick, I would prefer in-line links to related
> >> > materials
> >> > > >> instead of numeric references that one must manually look up at
> the
> >> > > bottom
> >> > > >> of the page.
> >> > > >>
> >> > > >> Could you please add these improvements to the specification?
> >> > > >>
> >> > > >> Thanks,
> >> > > >>
> >> > > >> Zoltan
> >> > > >>
> >> > > >> On Wed, Jul 3, 2019 at 4:03 PM 俊杰陈 <cj...@gmail.com> wrote:
> >> > > >>
> >> > > >> > You are welcome, it 's my honor.
> >> > > >> >
> >> > > >> > I think the PR <
> >> https://github.com/apache/parquet-format/pull/139>
> >> > > just
> >> > > >> > remove murmur3, that should express what I want.
> >> > > >> >
> >> > > >> >
> >> > > >> >
> >> > > >> >
> >> > > >> > On Wed, Jul 3, 2019 at 9:53 PM Zoltan Ivanfi
> >> > <zi@cloudera.com.invalid
> >> > > >
> >> > > >> > wrote:
> >> > > >> >
> >> > > >> > > Hi Junjie,
> >> > > >> > >
> >> > > >> > > Thanks for the update and also for your endruance in going
> >> through
> >> > > >> this
> >> > > >> > > tedious process in order to add bloom filtering to Parquet.
> >> > > >> > >
> >> > > >> > > I understand that your proposal is to go forward with xxHash
> >> > instead
> >> > > >> of
> >> > > >> > the
> >> > > >> > > eralier murmur3, which you suggest to deprecate. Since the
> >> murmur3
> >> > > >> hash
> >> > > >> > was
> >> > > >> > > never released, I think it could be completely removed from
> the
> >> > spec
> >> > > >> > > instead of just getting deprecated. What is your opinion on
> >> this?
> >> > > >> > >
> >> > > >> > > Thanks,
> >> > > >> > >
> >> > > >> > > Zoltan
> >> > > >> > >
> >> > > >> > > On Wed, Jul 3, 2019 at 3:31 PM 俊杰陈 <cj...@gmail.com>
> wrote:
> >> > > >> > >
> >> > > >> > > > I see, thanks for guiding on this.
> >> > > >> > > >
> >> > > >> > > > Per discussion in this thread and some investigation about
> >> > changes
> >> > > >> on
> >> > > >> > > > current java and c++ implementation, and I think that is
> not
> >> > hard
> >> > > to
> >> > > >> > > > handle. So I propose to use xxHash (the XXH64 version) as
> the
> >> > > >> default
> >> > > >> > > > hash strategy and deprecate previous murmur3 hash.
> >> > > >> > > >
> >> > > >> > > > I will update vote thread as well to make it clearer to
> all.
> >> > > >> > > >
> >> > > >> > > >
> >> > > >> > > > On Wed, Jul 3, 2019 at 6:08 PM Zoltan Ivanfi
> >> > > >> <zi...@cloudera.com.invalid>
> >> > > >> > > > wrote:
> >> > > >> > > > >
> >> > > >> > > > > Hi Junjie,
> >> > > >> > > > >
> >> > > >> > > > > I think the vote is ambigous in its current form (can
> >> people
> >> > > vote
> >> > > >> on
> >> > > >> > > one
> >> > > >> > > > > option only or can they vote on both?) and has a low
> >> chance of
> >> > > >> > getting
> >> > > >> > > > > votes in general because it's not a yes/no question but a
> >> > > >> > > > > choose-an-approach question instead. I think most
> >> contributors
> >> > > >> would
> >> > > >> > > > accept
> >> > > >> > > > > the hash chosen based on a community discussion but would
> >> be
> >> > > >> > reluctant
> >> > > >> > > to
> >> > > >> > > > > make that choice themselves in the form a vote because it
> >> > > >> requires a
> >> > > >> > > much
> >> > > >> > > > > deeper dive into the technical intricacies involved. The
> >> > > >> committers
> >> > > >> > are
> >> > > >> > > > > experienced in the parquet code base but may not be as
> >> > > >> experienced in
> >> > > >> > > > bloom
> >> > > >> > > > > filters as you are.
> >> > > >> > > > >
> >> > > >> > > > > In my opinion, to get bloom filtering into parquet-mr,
> you
> >> > > should
> >> > > >> > > > convince
> >> > > >> > > > > the committers that the proposal is viable by addressing
> >> their
> >> > > >> > concerns
> >> > > >> > > > > (which I believe you have done), and not by delegating
> the
> >> > task
> >> > > of
> >> > > >> > > making
> >> > > >> > > > > choices to them. I would suggest that you propose which
> one
> >> > (or
> >> > > >> both)
> >> > > >> > > of
> >> > > >> > > > > the hashes should be included, summarize your motivations
> >> in
> >> > > this
> >> > > >> > > thread
> >> > > >> > > > > and if you don't get any objections for a day or two,
> call
> >> a
> >> > > >> YES/NO
> >> > > >> > > vote
> >> > > >> > > > > for that specific proposal in a separate thread.
> >> > > >> > > > >
> >> > > >> > > > > Thanks,
> >> > > >> > > > >
> >> > > >> > > > > Zoltan
> >> > > >> > > > >
> >> > > >> > > > > On Tue, Jul 2, 2019 at 3:52 AM 俊杰陈 <cj...@gmail.com>
> >> > wrote:
> >> > > >> > > > >
> >> > > >> > > > > > Any thoughts from other committers and developers?
> >> > > >> > > > > >
> >> > > >> > > > > > I 'd like to start a vote firstly, you could either
> >> provide
> >> > > your
> >> > > >> > > input
> >> > > >> > > > here
> >> > > >> > > > > > or on vote thread.
> >> > > >> > > > > >
> >> > > >> > > > > >
> >> > > >> > > > > >
> >> > > >> > > > > > On Mon, Jul 1, 2019 at 8:20 PM Zoltan Ivanfi
> >> > > >> > <zi@cloudera.com.invalid
> >> > > >> > > >
> >> > > >> > > > > > wrote:
> >> > > >> > > > > >
> >> > > >> > > > > > > Hi,
> >> > > >> > > > > > >
> >> > > >> > > > > > > I would like to clarify one point of my previous
> >> e-mail:
> >> > > >> While I
> >> > > >> > > > reasoned
> >> > > >> > > > > > > that for compressions and encodings we should avoid
> >> > picking
> >> > > >> > > > algorithms
> >> > > >> > > > > > > superseded by better ones, I also reasoned that for
> >> bloom
> >> > > >> filters
> >> > > >> > > we
> >> > > >> > > > do
> >> > > >> > > > > > not
> >> > > >> > > > > > > necessarily have to be as strict, because a reader
> with
> >> > > >> missing
> >> > > >> > > > > > > implementation will still be able to read data from
> >> files
> >> > > that
> >> > > >> > > > contain
> >> > > >> > > > > > > unsupported bloom filter data structures.
> >> > > >> > > > > > >
> >> > > >> > > > > > > Personally I'm fine with moving forward with the
> >> current
> >> > > hash
> >> > > >> > > > proposal,
> >> > > >> > > > > > > even if the chosen algorithm is not considered to be
> >> the
> >> > > best
> >> > > >> of
> >> > > >> > > its
> >> > > >> > > > > > class.
> >> > > >> > > > > > >
> >> > > >> > > > > > > Br,
> >> > > >> > > > > > >
> >> > > >> > > > > > > Zoltan
> >> > > >> > > > > > >
> >> > > >> > > > > > > On Sun, Jun 30, 2019 at 11:02 PM Jim Apple <
> >> > > >> jbapple@apache.org>
> >> > > >> > > > wrote:
> >> > > >> > > > > > >
> >> > > >> > > > > > > > On 2019/06/28 16:43:23, Ryan Blue
> >> > > <rblue@netflix.com.INVALID
> >> > > >> >
> >> > > >> > > > wrote:
> >> > > >> > > > > > > > > I agree with Zoltan. Since we want to ensure
> >> > > >> compatibility,
> >> > > >> > it
> >> > > >> > > > would
> >> > > >> > > > > > be
> >> > > >> > > > > > > > > better to choose the best option now instead of
> >> making
> >> > > >> > everyone
> >> > > >> > > > > > support
> >> > > >> > > > > > > > two
> >> > > >> > > > > > > > > options forever.
> >> > > >> > > > > > > >
> >> > > >> > > > > > > > I'd guess there probably isn't a single best
> option.
> >> I
> >> > > >> suspect
> >> > > >> > > > there's
> >> > > >> > > > > > a
> >> > > >> > > > > > > > tradeoff between ease of implementation and speed,
> >> for
> >> > > >> > instance,
> >> > > >> > > > since
> >> > > >> > > > > > I
> >> > > >> > > > > > > > expect it's easy to find an MD5 library in most
> >> > > programming
> >> > > >> > > > languages
> >> > > >> > > > > > and
> >> > > >> > > > > > > > operating systems, yet MD5 is very slow compared to
> >> > > >> > > > non-cryptographic
> >> > > >> > > > > > > hash
> >> > > >> > > > > > > > functions designed for speed like xxhash.
> >> > > >> > > > > > > >
> >> > > >> > > > > > > > There's also a significant amount of variability
> >> across
> >> > > >> > processor
> >> > > >> > > > > > > families
> >> > > >> > > > > > > > (64-bit multiply-shift in ARM vs x86-64) or even
> >> > different
> >> > > >> > > > versions of
> >> > > >> > > > > > > the
> >> > > >> > > > > > > > same processor family (CLHash in Haswell vs. Sandy
> >> > Lake).
> >> > > >> There
> >> > > >> > > are
> >> > > >> > > > > > also
> >> > > >> > > > > > > > quality tradeoffs that depend on the average bye
> >> length
> >> > of
> >> > > >> the
> >> > > >> > > > input
> >> > > >> > > > > > (FNV
> >> > > >> > > > > > > > vs vhash) or how much L1 cache the user wants to
> use
> >> for
> >> > > the
> >> > > >> > hash
> >> > > >> > > > > > > function
> >> > > >> > > > > > > > (tabulation hashing vs. multiply-shift).
> >> > > >> > > > > > > >
> >> > > >> > > > > > > > To deal with this level of ambiguity, I'd suggest
> >> that
> >> > v1
> >> > > >> > should
> >> > > >> > > > > > include
> >> > > >> > > > > > > a
> >> > > >> > > > > > > > hash function that works well for certain common
> >> > > >> environments.
> >> > > >> > As
> >> > > >> > > > far
> >> > > >> > > > > > as
> >> > > >> > > > > > > I
> >> > > >> > > > > > > > know, murmur and xxhash would both fit that bill.
> >> > > >> > > > > > > >
> >> > > >> > > > > > >
> >> > > >> > > > > >
> >> > > >> > > > > >
> >> > > >> > > > > > --
> >> > > >> > > > > > Thanks & Best Regards
> >> > > >> > > > > >
> >> > > >> > > >
> >> > > >> > > >
> >> > > >> > > >
> >> > > >> > > > --
> >> > > >> > > > Thanks & Best Regards
> >> > > >> > > >
> >> > > >> > >
> >> > > >> >
> >> > > >> >
> >> > > >> > --
> >> > > >> > Thanks & Best Regards
> >> > > >> >
> >> > > >>
> >> > > >
> >> > > >
> >> > > > --
> >> > > > Thanks & Best Regards
> >> > > >
> >> > >
> >> > >
> >> > > --
> >> > > Thanks & Best Regards
> >> > >
> >> >
> >>
> >>
> >> --
> >> Thanks & Best Regards
> >>
> >
>

Re: [DISCUSS] Prepare release for parquet-format 2.7.0?

Posted by Zoltan Ivanfi <zi...@cloudera.com.INVALID>.

Hi,

I just wanted to leave a comment on the pull request to update
Encryption.md as well, but to my suprise it is not in master yet despite
the vote for the encryption feature having passed 6 months ago. What are
the plans for merging that? Should it be included in parquet-format 2.7?

Thanks,

Zoltan

On Fri, Jul 5, 2019 at 1:33 PM Zoltan Ivanfi <zi...@cloudera.com> wrote:

> Hi,
>
> I just noticed that yesterday I misunderstood that the Bloom filter is a
> part of the column chunk metadata, when in fact it is only the offset of it
> that is stored there. In this case we definitely need to pay more attention
> to the encryption aspect because it won't happen automatically.
>
> Br,
>
> Zoltan
>
> On Fri, Jul 5, 2019 at 1:09 PM 俊杰陈 <cj...@gmail.com> wrote:
>
>> That would be great, thank you.
>>
>> On Fri, Jul 5, 2019 at 5:40 PM Gidon Gershinsky <gg...@gmail.com> wrote:
>>
>> > Hi Junjie,
>> >
>> > I'd be glad to have a look at the encryption part. Will add my comments
>> > early next week.
>> >
>> > Cheers, Gidon.
>> >
>> > On Fri, Jul 5, 2019 at 12:16 PM 俊杰陈 <cj...@gmail.com> wrote:
>> >
>> > > Sorry, the latest file is
>> > >
>> > >
>> >
>> https://github.com/chenjunjiedada/parquet-format/blob/PARQUET-1617/BloomFilter.md
>> > > .
>> > >
>> > > On Fri, Jul 5, 2019 at 5:14 PM 俊杰陈 <cj...@gmail.com> wrote:
>> > >
>> > > > Sure, please see this PR
>> > > > <https://github.com/apache/parquet-format/pull/140> or update file
>> > here
>> > > > <
>> > >
>> >
>> https://github.com/chenjunjiedada/parquet-format/blob/master/BloomFilter.md
>> > > >
>> > > > .
>> > > >
>> > > > Thanks for reviewing spec.
>> > > >
>> > > > On Thu, Jul 4, 2019 at 11:57 PM Zoltan Ivanfi
>> <zi@cloudera.com.invalid
>> > >
>> > > > wrote:
>> > > >
>> > > >> Hi Junjie,
>> > > >>
>> > > >> I read through the specification and while I support the feature in
>> > > >> general, I find that the documentation may not be detailed enough
>> to
>> > > allow
>> > > >> developers of  different language bindings to implement it.
>> > > Specifically,
>> > > >> the Technical Approach section of the docs is very short and refers
>> > the
>> > > >> reader to two publications for details. I think the specification
>> > would
>> > > >> greatly benefit from including an explanation or a summary of the
>> > > approach
>> > > >> in this section.
>> > > >>
>> > > >> The "Build a Bloom filter" section contains a formula for
>> calculating
>> > > the
>> > > >> optimal filter size for a desired false positive rate, but does not
>> > > >> specify
>> > > >> what false positive rates implementations should target by default
>> and
>> > > >> through what ways should they make it configurable by users. I
>> > > understand
>> > > >> that this may be an intentional omission, since targeting any false
>> > > >> positive rate will result in a specification-compliant result,
>> still I
>> > > >> think it would be best to provide some recommendation for the
>> > different
>> > > >> language bindings.
>> > > >>
>> > > >> Since this feature is getting added after encryption, it should be
>> > > briefly
>> > > >> but explicitly mentioned how it interacts with that (basically
>> that it
>> > > has
>> > > >> to be encrypted, otherwise it would leak sensitive information,
>> but by
>> > > >> placing it inside the column chunk metadata, this is automatically
>> > taken
>> > > >> care of).
>> > > >>
>> > > >> Finally, as a nitpick, I would prefer in-line links to related
>> > materials
>> > > >> instead of numeric references that one must manually look up at the
>> > > bottom
>> > > >> of the page.
>> > > >>
>> > > >> Could you please add these improvements to the specification?
>> > > >>
>> > > >> Thanks,
>> > > >>
>> > > >> Zoltan
>> > > >>
>> > > >> On Wed, Jul 3, 2019 at 4:03 PM 俊杰陈 <cj...@gmail.com> wrote:
>> > > >>
>> > > >> > You are welcome, it 's my honor.
>> > > >> >
>> > > >> > I think the PR <
>> https://github.com/apache/parquet-format/pull/139>
>> > > just
>> > > >> > remove murmur3, that should express what I want.
>> > > >> >
>> > > >> >
>> > > >> >
>> > > >> >
>> > > >> > On Wed, Jul 3, 2019 at 9:53 PM Zoltan Ivanfi
>> > <zi@cloudera.com.invalid
>> > > >
>> > > >> > wrote:
>> > > >> >
>> > > >> > > Hi Junjie,
>> > > >> > >
>> > > >> > > Thanks for the update and also for your endruance in going
>> through
>> > > >> this
>> > > >> > > tedious process in order to add bloom filtering to Parquet.
>> > > >> > >
>> > > >> > > I understand that your proposal is to go forward with xxHash
>> > instead
>> > > >> of
>> > > >> > the
>> > > >> > > eralier murmur3, which you suggest to deprecate. Since the
>> murmur3
>> > > >> hash
>> > > >> > was
>> > > >> > > never released, I think it could be completely removed from the
>> > spec
>> > > >> > > instead of just getting deprecated. What is your opinion on
>> this?
>> > > >> > >
>> > > >> > > Thanks,
>> > > >> > >
>> > > >> > > Zoltan
>> > > >> > >
>> > > >> > > On Wed, Jul 3, 2019 at 3:31 PM 俊杰陈 <cj...@gmail.com> wrote:
>> > > >> > >
>> > > >> > > > I see, thanks for guiding on this.
>> > > >> > > >
>> > > >> > > > Per discussion in this thread and some investigation about
>> > changes
>> > > >> on
>> > > >> > > > current java and c++ implementation, and I think that is not
>> > hard
>> > > to
>> > > >> > > > handle. So I propose to use xxHash (the XXH64 version) as the
>> > > >> default
>> > > >> > > > hash strategy and deprecate previous murmur3 hash.
>> > > >> > > >
>> > > >> > > > I will update vote thread as well to make it clearer to all.
>> > > >> > > >
>> > > >> > > >
>> > > >> > > > On Wed, Jul 3, 2019 at 6:08 PM Zoltan Ivanfi
>> > > >> <zi...@cloudera.com.invalid>
>> > > >> > > > wrote:
>> > > >> > > > >
>> > > >> > > > > Hi Junjie,
>> > > >> > > > >
>> > > >> > > > > I think the vote is ambigous in its current form (can
>> people
>> > > vote
>> > > >> on
>> > > >> > > one
>> > > >> > > > > option only or can they vote on both?) and has a low
>> chance of
>> > > >> > getting
>> > > >> > > > > votes in general because it's not a yes/no question but a
>> > > >> > > > > choose-an-approach question instead. I think most
>> contributors
>> > > >> would
>> > > >> > > > accept
>> > > >> > > > > the hash chosen based on a community discussion but would
>> be
>> > > >> > reluctant
>> > > >> > > to
>> > > >> > > > > make that choice themselves in the form a vote because it
>> > > >> requires a
>> > > >> > > much
>> > > >> > > > > deeper dive into the technical intricacies involved. The
>> > > >> committers
>> > > >> > are
>> > > >> > > > > experienced in the parquet code base but may not be as
>> > > >> experienced in
>> > > >> > > > bloom
>> > > >> > > > > filters as you are.
>> > > >> > > > >
>> > > >> > > > > In my opinion, to get bloom filtering into parquet-mr, you
>> > > should
>> > > >> > > > convince
>> > > >> > > > > the committers that the proposal is viable by addressing
>> their
>> > > >> > concerns
>> > > >> > > > > (which I believe you have done), and not by delegating the
>> > task
>> > > of
>> > > >> > > making
>> > > >> > > > > choices to them. I would suggest that you propose which one
>> > (or
>> > > >> both)
>> > > >> > > of
>> > > >> > > > > the hashes should be included, summarize your motivations
>> in
>> > > this
>> > > >> > > thread
>> > > >> > > > > and if you don't get any objections for a day or two, call
>> a
>> > > >> YES/NO
>> > > >> > > vote
>> > > >> > > > > for that specific proposal in a separate thread.
>> > > >> > > > >
>> > > >> > > > > Thanks,
>> > > >> > > > >
>> > > >> > > > > Zoltan
>> > > >> > > > >
>> > > >> > > > > On Tue, Jul 2, 2019 at 3:52 AM 俊杰陈 <cj...@gmail.com>
>> > wrote:
>> > > >> > > > >
>> > > >> > > > > > Any thoughts from other committers and developers?
>> > > >> > > > > >
>> > > >> > > > > > I 'd like to start a vote firstly, you could either
>> provide
>> > > your
>> > > >> > > input
>> > > >> > > > here
>> > > >> > > > > > or on vote thread.
>> > > >> > > > > >
>> > > >> > > > > >
>> > > >> > > > > >
>> > > >> > > > > > On Mon, Jul 1, 2019 at 8:20 PM Zoltan Ivanfi
>> > > >> > <zi@cloudera.com.invalid
>> > > >> > > >
>> > > >> > > > > > wrote:
>> > > >> > > > > >
>> > > >> > > > > > > Hi,
>> > > >> > > > > > >
>> > > >> > > > > > > I would like to clarify one point of my previous
>> e-mail:
>> > > >> While I
>> > > >> > > > reasoned
>> > > >> > > > > > > that for compressions and encodings we should avoid
>> > picking
>> > > >> > > > algorithms
>> > > >> > > > > > > superseded by better ones, I also reasoned that for
>> bloom
>> > > >> filters
>> > > >> > > we
>> > > >> > > > do
>> > > >> > > > > > not
>> > > >> > > > > > > necessarily have to be as strict, because a reader with
>> > > >> missing
>> > > >> > > > > > > implementation will still be able to read data from
>> files
>> > > that
>> > > >> > > > contain
>> > > >> > > > > > > unsupported bloom filter data structures.
>> > > >> > > > > > >
>> > > >> > > > > > > Personally I'm fine with moving forward with the
>> current
>> > > hash
>> > > >> > > > proposal,
>> > > >> > > > > > > even if the chosen algorithm is not considered to be
>> the
>> > > best
>> > > >> of
>> > > >> > > its
>> > > >> > > > > > class.
>> > > >> > > > > > >
>> > > >> > > > > > > Br,
>> > > >> > > > > > >
>> > > >> > > > > > > Zoltan
>> > > >> > > > > > >
>> > > >> > > > > > > On Sun, Jun 30, 2019 at 11:02 PM Jim Apple <
>> > > >> jbapple@apache.org>
>> > > >> > > > wrote:
>> > > >> > > > > > >
>> > > >> > > > > > > > On 2019/06/28 16:43:23, Ryan Blue
>> > > <rblue@netflix.com.INVALID
>> > > >> >
>> > > >> > > > wrote:
>> > > >> > > > > > > > > I agree with Zoltan. Since we want to ensure
>> > > >> compatibility,
>> > > >> > it
>> > > >> > > > would
>> > > >> > > > > > be
>> > > >> > > > > > > > > better to choose the best option now instead of
>> making
>> > > >> > everyone
>> > > >> > > > > > support
>> > > >> > > > > > > > two
>> > > >> > > > > > > > > options forever.
>> > > >> > > > > > > >
>> > > >> > > > > > > > I'd guess there probably isn't a single best option.
>> I
>> > > >> suspect
>> > > >> > > > there's
>> > > >> > > > > > a
>> > > >> > > > > > > > tradeoff between ease of implementation and speed,
>> for
>> > > >> > instance,
>> > > >> > > > since
>> > > >> > > > > > I
>> > > >> > > > > > > > expect it's easy to find an MD5 library in most
>> > > programming
>> > > >> > > > languages
>> > > >> > > > > > and
>> > > >> > > > > > > > operating systems, yet MD5 is very slow compared to
>> > > >> > > > non-cryptographic
>> > > >> > > > > > > hash
>> > > >> > > > > > > > functions designed for speed like xxhash.
>> > > >> > > > > > > >
>> > > >> > > > > > > > There's also a significant amount of variability
>> across
>> > > >> > processor
>> > > >> > > > > > > families
>> > > >> > > > > > > > (64-bit multiply-shift in ARM vs x86-64) or even
>> > different
>> > > >> > > > versions of
>> > > >> > > > > > > the
>> > > >> > > > > > > > same processor family (CLHash in Haswell vs. Sandy
>> > Lake).
>> > > >> There
>> > > >> > > are
>> > > >> > > > > > also
>> > > >> > > > > > > > quality tradeoffs that depend on the average bye
>> length
>> > of
>> > > >> the
>> > > >> > > > input
>> > > >> > > > > > (FNV
>> > > >> > > > > > > > vs vhash) or how much L1 cache the user wants to use
>> for
>> > > the
>> > > >> > hash
>> > > >> > > > > > > function
>> > > >> > > > > > > > (tabulation hashing vs. multiply-shift).
>> > > >> > > > > > > >
>> > > >> > > > > > > > To deal with this level of ambiguity, I'd suggest
>> that
>> > v1
>> > > >> > should
>> > > >> > > > > > include
>> > > >> > > > > > > a
>> > > >> > > > > > > > hash function that works well for certain common
>> > > >> environments.
>> > > >> > As
>> > > >> > > > far
>> > > >> > > > > > as
>> > > >> > > > > > > I
>> > > >> > > > > > > > know, murmur and xxhash would both fit that bill.
>> > > >> > > > > > > >
>> > > >> > > > > > >
>> > > >> > > > > >
>> > > >> > > > > >
>> > > >> > > > > > --
>> > > >> > > > > > Thanks & Best Regards
>> > > >> > > > > >
>> > > >> > > >
>> > > >> > > >
>> > > >> > > >
>> > > >> > > > --
>> > > >> > > > Thanks & Best Regards
>> > > >> > > >
>> > > >> > >
>> > > >> >
>> > > >> >
>> > > >> > --
>> > > >> > Thanks & Best Regards
>> > > >> >
>> > > >>
>> > > >
>> > > >
>> > > > --
>> > > > Thanks & Best Regards
>> > > >
>> > >
>> > >
>> > > --
>> > > Thanks & Best Regards
>> > >
>> >
>>
>>
>> --
>> Thanks & Best Regards
>>
>

Re: [DISCUSS] Prepare release for parquet-format 2.7.0?

Posted by Zoltan Ivanfi <zi...@cloudera.com.INVALID>.

Hi,

I just noticed that yesterday I misunderstood that the Bloom filter is a
part of the column chunk metadata, when in fact it is only the offset of it
that is stored there. In this case we definitely need to pay more attention
to the encryption aspect because it won't happen automatically.

Br,

Zoltan

On Fri, Jul 5, 2019 at 1:09 PM 俊杰陈 <cj...@gmail.com> wrote:

> That would be great, thank you.
>
> On Fri, Jul 5, 2019 at 5:40 PM Gidon Gershinsky <gg...@gmail.com> wrote:
>
> > Hi Junjie,
> >
> > I'd be glad to have a look at the encryption part. Will add my comments
> > early next week.
> >
> > Cheers, Gidon.
> >
> > On Fri, Jul 5, 2019 at 12:16 PM 俊杰陈 <cj...@gmail.com> wrote:
> >
> > > Sorry, the latest file is
> > >
> > >
> >
> https://github.com/chenjunjiedada/parquet-format/blob/PARQUET-1617/BloomFilter.md
> > > .
> > >
> > > On Fri, Jul 5, 2019 at 5:14 PM 俊杰陈 <cj...@gmail.com> wrote:
> > >
> > > > Sure, please see this PR
> > > > <https://github.com/apache/parquet-format/pull/140> or update file
> > here
> > > > <
> > >
> >
> https://github.com/chenjunjiedada/parquet-format/blob/master/BloomFilter.md
> > > >
> > > > .
> > > >
> > > > Thanks for reviewing spec.
> > > >
> > > > On Thu, Jul 4, 2019 at 11:57 PM Zoltan Ivanfi
> <zi@cloudera.com.invalid
> > >
> > > > wrote:
> > > >
> > > >> Hi Junjie,
> > > >>
> > > >> I read through the specification and while I support the feature in
> > > >> general, I find that the documentation may not be detailed enough to
> > > allow
> > > >> developers of  different language bindings to implement it.
> > > Specifically,
> > > >> the Technical Approach section of the docs is very short and refers
> > the
> > > >> reader to two publications for details. I think the specification
> > would
> > > >> greatly benefit from including an explanation or a summary of the
> > > approach
> > > >> in this section.
> > > >>
> > > >> The "Build a Bloom filter" section contains a formula for
> calculating
> > > the
> > > >> optimal filter size for a desired false positive rate, but does not
> > > >> specify
> > > >> what false positive rates implementations should target by default
> and
> > > >> through what ways should they make it configurable by users. I
> > > understand
> > > >> that this may be an intentional omission, since targeting any false
> > > >> positive rate will result in a specification-compliant result,
> still I
> > > >> think it would be best to provide some recommendation for the
> > different
> > > >> language bindings.
> > > >>
> > > >> Since this feature is getting added after encryption, it should be
> > > briefly
> > > >> but explicitly mentioned how it interacts with that (basically that
> it
> > > has
> > > >> to be encrypted, otherwise it would leak sensitive information, but
> by
> > > >> placing it inside the column chunk metadata, this is automatically
> > taken
> > > >> care of).
> > > >>
> > > >> Finally, as a nitpick, I would prefer in-line links to related
> > materials
> > > >> instead of numeric references that one must manually look up at the
> > > bottom
> > > >> of the page.
> > > >>
> > > >> Could you please add these improvements to the specification?
> > > >>
> > > >> Thanks,
> > > >>
> > > >> Zoltan
> > > >>
> > > >> On Wed, Jul 3, 2019 at 4:03 PM 俊杰陈 <cj...@gmail.com> wrote:
> > > >>
> > > >> > You are welcome, it 's my honor.
> > > >> >
> > > >> > I think the PR <https://github.com/apache/parquet-format/pull/139
> >
> > > just
> > > >> > remove murmur3, that should express what I want.
> > > >> >
> > > >> >
> > > >> >
> > > >> >
> > > >> > On Wed, Jul 3, 2019 at 9:53 PM Zoltan Ivanfi
> > <zi@cloudera.com.invalid
> > > >
> > > >> > wrote:
> > > >> >
> > > >> > > Hi Junjie,
> > > >> > >
> > > >> > > Thanks for the update and also for your endruance in going
> through
> > > >> this
> > > >> > > tedious process in order to add bloom filtering to Parquet.
> > > >> > >
> > > >> > > I understand that your proposal is to go forward with xxHash
> > instead
> > > >> of
> > > >> > the
> > > >> > > eralier murmur3, which you suggest to deprecate. Since the
> murmur3
> > > >> hash
> > > >> > was
> > > >> > > never released, I think it could be completely removed from the
> > spec
> > > >> > > instead of just getting deprecated. What is your opinion on
> this?
> > > >> > >
> > > >> > > Thanks,
> > > >> > >
> > > >> > > Zoltan
> > > >> > >
> > > >> > > On Wed, Jul 3, 2019 at 3:31 PM 俊杰陈 <cj...@gmail.com> wrote:
> > > >> > >
> > > >> > > > I see, thanks for guiding on this.
> > > >> > > >
> > > >> > > > Per discussion in this thread and some investigation about
> > changes
> > > >> on
> > > >> > > > current java and c++ implementation, and I think that is not
> > hard
> > > to
> > > >> > > > handle. So I propose to use xxHash (the XXH64 version) as the
> > > >> default
> > > >> > > > hash strategy and deprecate previous murmur3 hash.
> > > >> > > >
> > > >> > > > I will update vote thread as well to make it clearer to all.
> > > >> > > >
> > > >> > > >
> > > >> > > > On Wed, Jul 3, 2019 at 6:08 PM Zoltan Ivanfi
> > > >> <zi...@cloudera.com.invalid>
> > > >> > > > wrote:
> > > >> > > > >
> > > >> > > > > Hi Junjie,
> > > >> > > > >
> > > >> > > > > I think the vote is ambigous in its current form (can people
> > > vote
> > > >> on
> > > >> > > one
> > > >> > > > > option only or can they vote on both?) and has a low chance
> of
> > > >> > getting
> > > >> > > > > votes in general because it's not a yes/no question but a
> > > >> > > > > choose-an-approach question instead. I think most
> contributors
> > > >> would
> > > >> > > > accept
> > > >> > > > > the hash chosen based on a community discussion but would be
> > > >> > reluctant
> > > >> > > to
> > > >> > > > > make that choice themselves in the form a vote because it
> > > >> requires a
> > > >> > > much
> > > >> > > > > deeper dive into the technical intricacies involved. The
> > > >> committers
> > > >> > are
> > > >> > > > > experienced in the parquet code base but may not be as
> > > >> experienced in
> > > >> > > > bloom
> > > >> > > > > filters as you are.
> > > >> > > > >
> > > >> > > > > In my opinion, to get bloom filtering into parquet-mr, you
> > > should
> > > >> > > > convince
> > > >> > > > > the committers that the proposal is viable by addressing
> their
> > > >> > concerns
> > > >> > > > > (which I believe you have done), and not by delegating the
> > task
> > > of
> > > >> > > making
> > > >> > > > > choices to them. I would suggest that you propose which one
> > (or
> > > >> both)
> > > >> > > of
> > > >> > > > > the hashes should be included, summarize your motivations in
> > > this
> > > >> > > thread
> > > >> > > > > and if you don't get any objections for a day or two, call a
> > > >> YES/NO
> > > >> > > vote
> > > >> > > > > for that specific proposal in a separate thread.
> > > >> > > > >
> > > >> > > > > Thanks,
> > > >> > > > >
> > > >> > > > > Zoltan
> > > >> > > > >
> > > >> > > > > On Tue, Jul 2, 2019 at 3:52 AM 俊杰陈 <cj...@gmail.com>
> > wrote:
> > > >> > > > >
> > > >> > > > > > Any thoughts from other committers and developers?
> > > >> > > > > >
> > > >> > > > > > I 'd like to start a vote firstly, you could either
> provide
> > > your
> > > >> > > input
> > > >> > > > here
> > > >> > > > > > or on vote thread.
> > > >> > > > > >
> > > >> > > > > >
> > > >> > > > > >
> > > >> > > > > > On Mon, Jul 1, 2019 at 8:20 PM Zoltan Ivanfi
> > > >> > <zi@cloudera.com.invalid
> > > >> > > >
> > > >> > > > > > wrote:
> > > >> > > > > >
> > > >> > > > > > > Hi,
> > > >> > > > > > >
> > > >> > > > > > > I would like to clarify one point of my previous e-mail:
> > > >> While I
> > > >> > > > reasoned
> > > >> > > > > > > that for compressions and encodings we should avoid
> > picking
> > > >> > > > algorithms
> > > >> > > > > > > superseded by better ones, I also reasoned that for
> bloom
> > > >> filters
> > > >> > > we
> > > >> > > > do
> > > >> > > > > > not
> > > >> > > > > > > necessarily have to be as strict, because a reader with
> > > >> missing
> > > >> > > > > > > implementation will still be able to read data from
> files
> > > that
> > > >> > > > contain
> > > >> > > > > > > unsupported bloom filter data structures.
> > > >> > > > > > >
> > > >> > > > > > > Personally I'm fine with moving forward with the current
> > > hash
> > > >> > > > proposal,
> > > >> > > > > > > even if the chosen algorithm is not considered to be the
> > > best
> > > >> of
> > > >> > > its
> > > >> > > > > > class.
> > > >> > > > > > >
> > > >> > > > > > > Br,
> > > >> > > > > > >
> > > >> > > > > > > Zoltan
> > > >> > > > > > >
> > > >> > > > > > > On Sun, Jun 30, 2019 at 11:02 PM Jim Apple <
> > > >> jbapple@apache.org>
> > > >> > > > wrote:
> > > >> > > > > > >
> > > >> > > > > > > > On 2019/06/28 16:43:23, Ryan Blue
> > > <rblue@netflix.com.INVALID
> > > >> >
> > > >> > > > wrote:
> > > >> > > > > > > > > I agree with Zoltan. Since we want to ensure
> > > >> compatibility,
> > > >> > it
> > > >> > > > would
> > > >> > > > > > be
> > > >> > > > > > > > > better to choose the best option now instead of
> making
> > > >> > everyone
> > > >> > > > > > support
> > > >> > > > > > > > two
> > > >> > > > > > > > > options forever.
> > > >> > > > > > > >
> > > >> > > > > > > > I'd guess there probably isn't a single best option. I
> > > >> suspect
> > > >> > > > there's
> > > >> > > > > > a
> > > >> > > > > > > > tradeoff between ease of implementation and speed, for
> > > >> > instance,
> > > >> > > > since
> > > >> > > > > > I
> > > >> > > > > > > > expect it's easy to find an MD5 library in most
> > > programming
> > > >> > > > languages
> > > >> > > > > > and
> > > >> > > > > > > > operating systems, yet MD5 is very slow compared to
> > > >> > > > non-cryptographic
> > > >> > > > > > > hash
> > > >> > > > > > > > functions designed for speed like xxhash.
> > > >> > > > > > > >
> > > >> > > > > > > > There's also a significant amount of variability
> across
> > > >> > processor
> > > >> > > > > > > families
> > > >> > > > > > > > (64-bit multiply-shift in ARM vs x86-64) or even
> > different
> > > >> > > > versions of
> > > >> > > > > > > the
> > > >> > > > > > > > same processor family (CLHash in Haswell vs. Sandy
> > Lake).
> > > >> There
> > > >> > > are
> > > >> > > > > > also
> > > >> > > > > > > > quality tradeoffs that depend on the average bye
> length
> > of
> > > >> the
> > > >> > > > input
> > > >> > > > > > (FNV
> > > >> > > > > > > > vs vhash) or how much L1 cache the user wants to use
> for
> > > the
> > > >> > hash
> > > >> > > > > > > function
> > > >> > > > > > > > (tabulation hashing vs. multiply-shift).
> > > >> > > > > > > >
> > > >> > > > > > > > To deal with this level of ambiguity, I'd suggest that
> > v1
> > > >> > should
> > > >> > > > > > include
> > > >> > > > > > > a
> > > >> > > > > > > > hash function that works well for certain common
> > > >> environments.
> > > >> > As
> > > >> > > > far
> > > >> > > > > > as
> > > >> > > > > > > I
> > > >> > > > > > > > know, murmur and xxhash would both fit that bill.
> > > >> > > > > > > >
> > > >> > > > > > >
> > > >> > > > > >
> > > >> > > > > >
> > > >> > > > > > --
> > > >> > > > > > Thanks & Best Regards
> > > >> > > > > >
> > > >> > > >
> > > >> > > >
> > > >> > > >
> > > >> > > > --
> > > >> > > > Thanks & Best Regards
> > > >> > > >
> > > >> > >
> > > >> >
> > > >> >
> > > >> > --
> > > >> > Thanks & Best Regards
> > > >> >
> > > >>
> > > >
> > > >
> > > > --
> > > > Thanks & Best Regards
> > > >
> > >
> > >
> > > --
> > > Thanks & Best Regards
> > >
> >
>
>
> --
> Thanks & Best Regards
>

Re: [DISCUSS] Prepare release for parquet-format 2.7.0?

Posted by 俊杰陈 <cj...@gmail.com>.

That would be great, thank you.

On Fri, Jul 5, 2019 at 5:40 PM Gidon Gershinsky <gg...@gmail.com> wrote:

> Hi Junjie,
>
> I'd be glad to have a look at the encryption part. Will add my comments
> early next week.
>
> Cheers, Gidon.
>
> On Fri, Jul 5, 2019 at 12:16 PM 俊杰陈 <cj...@gmail.com> wrote:
>
> > Sorry, the latest file is
> >
> >
> https://github.com/chenjunjiedada/parquet-format/blob/PARQUET-1617/BloomFilter.md
> > .
> >
> > On Fri, Jul 5, 2019 at 5:14 PM 俊杰陈 <cj...@gmail.com> wrote:
> >
> > > Sure, please see this PR
> > > <https://github.com/apache/parquet-format/pull/140> or update file
> here
> > > <
> >
> https://github.com/chenjunjiedada/parquet-format/blob/master/BloomFilter.md
> > >
> > > .
> > >
> > > Thanks for reviewing spec.
> > >
> > > On Thu, Jul 4, 2019 at 11:57 PM Zoltan Ivanfi <zi@cloudera.com.invalid
> >
> > > wrote:
> > >
> > >> Hi Junjie,
> > >>
> > >> I read through the specification and while I support the feature in
> > >> general, I find that the documentation may not be detailed enough to
> > allow
> > >> developers of  different language bindings to implement it.
> > Specifically,
> > >> the Technical Approach section of the docs is very short and refers
> the
> > >> reader to two publications for details. I think the specification
> would
> > >> greatly benefit from including an explanation or a summary of the
> > approach
> > >> in this section.
> > >>
> > >> The "Build a Bloom filter" section contains a formula for calculating
> > the
> > >> optimal filter size for a desired false positive rate, but does not
> > >> specify
> > >> what false positive rates implementations should target by default and
> > >> through what ways should they make it configurable by users. I
> > understand
> > >> that this may be an intentional omission, since targeting any false
> > >> positive rate will result in a specification-compliant result, still I
> > >> think it would be best to provide some recommendation for the
> different
> > >> language bindings.
> > >>
> > >> Since this feature is getting added after encryption, it should be
> > briefly
> > >> but explicitly mentioned how it interacts with that (basically that it
> > has
> > >> to be encrypted, otherwise it would leak sensitive information, but by
> > >> placing it inside the column chunk metadata, this is automatically
> taken
> > >> care of).
> > >>
> > >> Finally, as a nitpick, I would prefer in-line links to related
> materials
> > >> instead of numeric references that one must manually look up at the
> > bottom
> > >> of the page.
> > >>
> > >> Could you please add these improvements to the specification?
> > >>
> > >> Thanks,
> > >>
> > >> Zoltan
> > >>
> > >> On Wed, Jul 3, 2019 at 4:03 PM 俊杰陈 <cj...@gmail.com> wrote:
> > >>
> > >> > You are welcome, it 's my honor.
> > >> >
> > >> > I think the PR <https://github.com/apache/parquet-format/pull/139>
> > just
> > >> > remove murmur3, that should express what I want.
> > >> >
> > >> >
> > >> >
> > >> >
> > >> > On Wed, Jul 3, 2019 at 9:53 PM Zoltan Ivanfi
> <zi@cloudera.com.invalid
> > >
> > >> > wrote:
> > >> >
> > >> > > Hi Junjie,
> > >> > >
> > >> > > Thanks for the update and also for your endruance in going through
> > >> this
> > >> > > tedious process in order to add bloom filtering to Parquet.
> > >> > >
> > >> > > I understand that your proposal is to go forward with xxHash
> instead
> > >> of
> > >> > the
> > >> > > eralier murmur3, which you suggest to deprecate. Since the murmur3
> > >> hash
> > >> > was
> > >> > > never released, I think it could be completely removed from the
> spec
> > >> > > instead of just getting deprecated. What is your opinion on this?
> > >> > >
> > >> > > Thanks,
> > >> > >
> > >> > > Zoltan
> > >> > >
> > >> > > On Wed, Jul 3, 2019 at 3:31 PM 俊杰陈 <cj...@gmail.com> wrote:
> > >> > >
> > >> > > > I see, thanks for guiding on this.
> > >> > > >
> > >> > > > Per discussion in this thread and some investigation about
> changes
> > >> on
> > >> > > > current java and c++ implementation, and I think that is not
> hard
> > to
> > >> > > > handle. So I propose to use xxHash (the XXH64 version) as the
> > >> default
> > >> > > > hash strategy and deprecate previous murmur3 hash.
> > >> > > >
> > >> > > > I will update vote thread as well to make it clearer to all.
> > >> > > >
> > >> > > >
> > >> > > > On Wed, Jul 3, 2019 at 6:08 PM Zoltan Ivanfi
> > >> <zi...@cloudera.com.invalid>
> > >> > > > wrote:
> > >> > > > >
> > >> > > > > Hi Junjie,
> > >> > > > >
> > >> > > > > I think the vote is ambigous in its current form (can people
> > vote
> > >> on
> > >> > > one
> > >> > > > > option only or can they vote on both?) and has a low chance of
> > >> > getting
> > >> > > > > votes in general because it's not a yes/no question but a
> > >> > > > > choose-an-approach question instead. I think most contributors
> > >> would
> > >> > > > accept
> > >> > > > > the hash chosen based on a community discussion but would be
> > >> > reluctant
> > >> > > to
> > >> > > > > make that choice themselves in the form a vote because it
> > >> requires a
> > >> > > much
> > >> > > > > deeper dive into the technical intricacies involved. The
> > >> committers
> > >> > are
> > >> > > > > experienced in the parquet code base but may not be as
> > >> experienced in
> > >> > > > bloom
> > >> > > > > filters as you are.
> > >> > > > >
> > >> > > > > In my opinion, to get bloom filtering into parquet-mr, you
> > should
> > >> > > > convince
> > >> > > > > the committers that the proposal is viable by addressing their
> > >> > concerns
> > >> > > > > (which I believe you have done), and not by delegating the
> task
> > of
> > >> > > making
> > >> > > > > choices to them. I would suggest that you propose which one
> (or
> > >> both)
> > >> > > of
> > >> > > > > the hashes should be included, summarize your motivations in
> > this
> > >> > > thread
> > >> > > > > and if you don't get any objections for a day or two, call a
> > >> YES/NO
> > >> > > vote
> > >> > > > > for that specific proposal in a separate thread.
> > >> > > > >
> > >> > > > > Thanks,
> > >> > > > >
> > >> > > > > Zoltan
> > >> > > > >
> > >> > > > > On Tue, Jul 2, 2019 at 3:52 AM 俊杰陈 <cj...@gmail.com>
> wrote:
> > >> > > > >
> > >> > > > > > Any thoughts from other committers and developers?
> > >> > > > > >
> > >> > > > > > I 'd like to start a vote firstly, you could either provide
> > your
> > >> > > input
> > >> > > > here
> > >> > > > > > or on vote thread.
> > >> > > > > >
> > >> > > > > >
> > >> > > > > >
> > >> > > > > > On Mon, Jul 1, 2019 at 8:20 PM Zoltan Ivanfi
> > >> > <zi@cloudera.com.invalid
> > >> > > >
> > >> > > > > > wrote:
> > >> > > > > >
> > >> > > > > > > Hi,
> > >> > > > > > >
> > >> > > > > > > I would like to clarify one point of my previous e-mail:
> > >> While I
> > >> > > > reasoned
> > >> > > > > > > that for compressions and encodings we should avoid
> picking
> > >> > > > algorithms
> > >> > > > > > > superseded by better ones, I also reasoned that for bloom
> > >> filters
> > >> > > we
> > >> > > > do
> > >> > > > > > not
> > >> > > > > > > necessarily have to be as strict, because a reader with
> > >> missing
> > >> > > > > > > implementation will still be able to read data from files
> > that
> > >> > > > contain
> > >> > > > > > > unsupported bloom filter data structures.
> > >> > > > > > >
> > >> > > > > > > Personally I'm fine with moving forward with the current
> > hash
> > >> > > > proposal,
> > >> > > > > > > even if the chosen algorithm is not considered to be the
> > best
> > >> of
> > >> > > its
> > >> > > > > > class.
> > >> > > > > > >
> > >> > > > > > > Br,
> > >> > > > > > >
> > >> > > > > > > Zoltan
> > >> > > > > > >
> > >> > > > > > > On Sun, Jun 30, 2019 at 11:02 PM Jim Apple <
> > >> jbapple@apache.org>
> > >> > > > wrote:
> > >> > > > > > >
> > >> > > > > > > > On 2019/06/28 16:43:23, Ryan Blue
> > <rblue@netflix.com.INVALID
> > >> >
> > >> > > > wrote:
> > >> > > > > > > > > I agree with Zoltan. Since we want to ensure
> > >> compatibility,
> > >> > it
> > >> > > > would
> > >> > > > > > be
> > >> > > > > > > > > better to choose the best option now instead of making
> > >> > everyone
> > >> > > > > > support
> > >> > > > > > > > two
> > >> > > > > > > > > options forever.
> > >> > > > > > > >
> > >> > > > > > > > I'd guess there probably isn't a single best option. I
> > >> suspect
> > >> > > > there's
> > >> > > > > > a
> > >> > > > > > > > tradeoff between ease of implementation and speed, for
> > >> > instance,
> > >> > > > since
> > >> > > > > > I
> > >> > > > > > > > expect it's easy to find an MD5 library in most
> > programming
> > >> > > > languages
> > >> > > > > > and
> > >> > > > > > > > operating systems, yet MD5 is very slow compared to
> > >> > > > non-cryptographic
> > >> > > > > > > hash
> > >> > > > > > > > functions designed for speed like xxhash.
> > >> > > > > > > >
> > >> > > > > > > > There's also a significant amount of variability across
> > >> > processor
> > >> > > > > > > families
> > >> > > > > > > > (64-bit multiply-shift in ARM vs x86-64) or even
> different
> > >> > > > versions of
> > >> > > > > > > the
> > >> > > > > > > > same processor family (CLHash in Haswell vs. Sandy
> Lake).
> > >> There
> > >> > > are
> > >> > > > > > also
> > >> > > > > > > > quality tradeoffs that depend on the average bye length
> of
> > >> the
> > >> > > > input
> > >> > > > > > (FNV
> > >> > > > > > > > vs vhash) or how much L1 cache the user wants to use for
> > the
> > >> > hash
> > >> > > > > > > function
> > >> > > > > > > > (tabulation hashing vs. multiply-shift).
> > >> > > > > > > >
> > >> > > > > > > > To deal with this level of ambiguity, I'd suggest that
> v1
> > >> > should
> > >> > > > > > include
> > >> > > > > > > a
> > >> > > > > > > > hash function that works well for certain common
> > >> environments.
> > >> > As
> > >> > > > far
> > >> > > > > > as
> > >> > > > > > > I
> > >> > > > > > > > know, murmur and xxhash would both fit that bill.
> > >> > > > > > > >
> > >> > > > > > >
> > >> > > > > >
> > >> > > > > >
> > >> > > > > > --
> > >> > > > > > Thanks & Best Regards
> > >> > > > > >
> > >> > > >
> > >> > > >
> > >> > > >
> > >> > > > --
> > >> > > > Thanks & Best Regards
> > >> > > >
> > >> > >
> > >> >
> > >> >
> > >> > --
> > >> > Thanks & Best Regards
> > >> >
> > >>
> > >
> > >
> > > --
> > > Thanks & Best Regards
> > >
> >
> >
> > --
> > Thanks & Best Regards
> >
>


-- 
Thanks & Best Regards

Re: [DISCUSS] Prepare release for parquet-format 2.7.0?

Posted by Gidon Gershinsky <gg...@gmail.com>.

Hi Junjie,

I'd be glad to have a look at the encryption part. Will add my comments
early next week.

Cheers, Gidon.

On Fri, Jul 5, 2019 at 12:16 PM 俊杰陈 <cj...@gmail.com> wrote:

> Sorry, the latest file is
>
> https://github.com/chenjunjiedada/parquet-format/blob/PARQUET-1617/BloomFilter.md
> .
>
> On Fri, Jul 5, 2019 at 5:14 PM 俊杰陈 <cj...@gmail.com> wrote:
>
> > Sure, please see this PR
> > <https://github.com/apache/parquet-format/pull/140> or update file here
> > <
> https://github.com/chenjunjiedada/parquet-format/blob/master/BloomFilter.md
> >
> > .
> >
> > Thanks for reviewing spec.
> >
> > On Thu, Jul 4, 2019 at 11:57 PM Zoltan Ivanfi <zi...@cloudera.com.invalid>
> > wrote:
> >
> >> Hi Junjie,
> >>
> >> I read through the specification and while I support the feature in
> >> general, I find that the documentation may not be detailed enough to
> allow
> >> developers of  different language bindings to implement it.
> Specifically,
> >> the Technical Approach section of the docs is very short and refers the
> >> reader to two publications for details. I think the specification would
> >> greatly benefit from including an explanation or a summary of the
> approach
> >> in this section.
> >>
> >> The "Build a Bloom filter" section contains a formula for calculating
> the
> >> optimal filter size for a desired false positive rate, but does not
> >> specify
> >> what false positive rates implementations should target by default and
> >> through what ways should they make it configurable by users. I
> understand
> >> that this may be an intentional omission, since targeting any false
> >> positive rate will result in a specification-compliant result, still I
> >> think it would be best to provide some recommendation for the different
> >> language bindings.
> >>
> >> Since this feature is getting added after encryption, it should be
> briefly
> >> but explicitly mentioned how it interacts with that (basically that it
> has
> >> to be encrypted, otherwise it would leak sensitive information, but by
> >> placing it inside the column chunk metadata, this is automatically taken
> >> care of).
> >>
> >> Finally, as a nitpick, I would prefer in-line links to related materials
> >> instead of numeric references that one must manually look up at the
> bottom
> >> of the page.
> >>
> >> Could you please add these improvements to the specification?
> >>
> >> Thanks,
> >>
> >> Zoltan
> >>
> >> On Wed, Jul 3, 2019 at 4:03 PM 俊杰陈 <cj...@gmail.com> wrote:
> >>
> >> > You are welcome, it 's my honor.
> >> >
> >> > I think the PR <https://github.com/apache/parquet-format/pull/139>
> just
> >> > remove murmur3, that should express what I want.
> >> >
> >> >
> >> >
> >> >
> >> > On Wed, Jul 3, 2019 at 9:53 PM Zoltan Ivanfi <zi@cloudera.com.invalid
> >
> >> > wrote:
> >> >
> >> > > Hi Junjie,
> >> > >
> >> > > Thanks for the update and also for your endruance in going through
> >> this
> >> > > tedious process in order to add bloom filtering to Parquet.
> >> > >
> >> > > I understand that your proposal is to go forward with xxHash instead
> >> of
> >> > the
> >> > > eralier murmur3, which you suggest to deprecate. Since the murmur3
> >> hash
> >> > was
> >> > > never released, I think it could be completely removed from the spec
> >> > > instead of just getting deprecated. What is your opinion on this?
> >> > >
> >> > > Thanks,
> >> > >
> >> > > Zoltan
> >> > >
> >> > > On Wed, Jul 3, 2019 at 3:31 PM 俊杰陈 <cj...@gmail.com> wrote:
> >> > >
> >> > > > I see, thanks for guiding on this.
> >> > > >
> >> > > > Per discussion in this thread and some investigation about changes
> >> on
> >> > > > current java and c++ implementation, and I think that is not hard
> to
> >> > > > handle. So I propose to use xxHash (the XXH64 version) as the
> >> default
> >> > > > hash strategy and deprecate previous murmur3 hash.
> >> > > >
> >> > > > I will update vote thread as well to make it clearer to all.
> >> > > >
> >> > > >
> >> > > > On Wed, Jul 3, 2019 at 6:08 PM Zoltan Ivanfi
> >> <zi...@cloudera.com.invalid>
> >> > > > wrote:
> >> > > > >
> >> > > > > Hi Junjie,
> >> > > > >
> >> > > > > I think the vote is ambigous in its current form (can people
> vote
> >> on
> >> > > one
> >> > > > > option only or can they vote on both?) and has a low chance of
> >> > getting
> >> > > > > votes in general because it's not a yes/no question but a
> >> > > > > choose-an-approach question instead. I think most contributors
> >> would
> >> > > > accept
> >> > > > > the hash chosen based on a community discussion but would be
> >> > reluctant
> >> > > to
> >> > > > > make that choice themselves in the form a vote because it
> >> requires a
> >> > > much
> >> > > > > deeper dive into the technical intricacies involved. The
> >> committers
> >> > are
> >> > > > > experienced in the parquet code base but may not be as
> >> experienced in
> >> > > > bloom
> >> > > > > filters as you are.
> >> > > > >
> >> > > > > In my opinion, to get bloom filtering into parquet-mr, you
> should
> >> > > > convince
> >> > > > > the committers that the proposal is viable by addressing their
> >> > concerns
> >> > > > > (which I believe you have done), and not by delegating the task
> of
> >> > > making
> >> > > > > choices to them. I would suggest that you propose which one (or
> >> both)
> >> > > of
> >> > > > > the hashes should be included, summarize your motivations in
> this
> >> > > thread
> >> > > > > and if you don't get any objections for a day or two, call a
> >> YES/NO
> >> > > vote
> >> > > > > for that specific proposal in a separate thread.
> >> > > > >
> >> > > > > Thanks,
> >> > > > >
> >> > > > > Zoltan
> >> > > > >
> >> > > > > On Tue, Jul 2, 2019 at 3:52 AM 俊杰陈 <cj...@gmail.com> wrote:
> >> > > > >
> >> > > > > > Any thoughts from other committers and developers?
> >> > > > > >
> >> > > > > > I 'd like to start a vote firstly, you could either provide
> your
> >> > > input
> >> > > > here
> >> > > > > > or on vote thread.
> >> > > > > >
> >> > > > > >
> >> > > > > >
> >> > > > > > On Mon, Jul 1, 2019 at 8:20 PM Zoltan Ivanfi
> >> > <zi@cloudera.com.invalid
> >> > > >
> >> > > > > > wrote:
> >> > > > > >
> >> > > > > > > Hi,
> >> > > > > > >
> >> > > > > > > I would like to clarify one point of my previous e-mail:
> >> While I
> >> > > > reasoned
> >> > > > > > > that for compressions and encodings we should avoid picking
> >> > > > algorithms
> >> > > > > > > superseded by better ones, I also reasoned that for bloom
> >> filters
> >> > > we
> >> > > > do
> >> > > > > > not
> >> > > > > > > necessarily have to be as strict, because a reader with
> >> missing
> >> > > > > > > implementation will still be able to read data from files
> that
> >> > > > contain
> >> > > > > > > unsupported bloom filter data structures.
> >> > > > > > >
> >> > > > > > > Personally I'm fine with moving forward with the current
> hash
> >> > > > proposal,
> >> > > > > > > even if the chosen algorithm is not considered to be the
> best
> >> of
> >> > > its
> >> > > > > > class.
> >> > > > > > >
> >> > > > > > > Br,
> >> > > > > > >
> >> > > > > > > Zoltan
> >> > > > > > >
> >> > > > > > > On Sun, Jun 30, 2019 at 11:02 PM Jim Apple <
> >> jbapple@apache.org>
> >> > > > wrote:
> >> > > > > > >
> >> > > > > > > > On 2019/06/28 16:43:23, Ryan Blue
> <rblue@netflix.com.INVALID
> >> >
> >> > > > wrote:
> >> > > > > > > > > I agree with Zoltan. Since we want to ensure
> >> compatibility,
> >> > it
> >> > > > would
> >> > > > > > be
> >> > > > > > > > > better to choose the best option now instead of making
> >> > everyone
> >> > > > > > support
> >> > > > > > > > two
> >> > > > > > > > > options forever.
> >> > > > > > > >
> >> > > > > > > > I'd guess there probably isn't a single best option. I
> >> suspect
> >> > > > there's
> >> > > > > > a
> >> > > > > > > > tradeoff between ease of implementation and speed, for
> >> > instance,
> >> > > > since
> >> > > > > > I
> >> > > > > > > > expect it's easy to find an MD5 library in most
> programming
> >> > > > languages
> >> > > > > > and
> >> > > > > > > > operating systems, yet MD5 is very slow compared to
> >> > > > non-cryptographic
> >> > > > > > > hash
> >> > > > > > > > functions designed for speed like xxhash.
> >> > > > > > > >
> >> > > > > > > > There's also a significant amount of variability across
> >> > processor
> >> > > > > > > families
> >> > > > > > > > (64-bit multiply-shift in ARM vs x86-64) or even different
> >> > > > versions of
> >> > > > > > > the
> >> > > > > > > > same processor family (CLHash in Haswell vs. Sandy Lake).
> >> There
> >> > > are
> >> > > > > > also
> >> > > > > > > > quality tradeoffs that depend on the average bye length of
> >> the
> >> > > > input
> >> > > > > > (FNV
> >> > > > > > > > vs vhash) or how much L1 cache the user wants to use for
> the
> >> > hash
> >> > > > > > > function
> >> > > > > > > > (tabulation hashing vs. multiply-shift).
> >> > > > > > > >
> >> > > > > > > > To deal with this level of ambiguity, I'd suggest that v1
> >> > should
> >> > > > > > include
> >> > > > > > > a
> >> > > > > > > > hash function that works well for certain common
> >> environments.
> >> > As
> >> > > > far
> >> > > > > > as
> >> > > > > > > I
> >> > > > > > > > know, murmur and xxhash would both fit that bill.
> >> > > > > > > >
> >> > > > > > >
> >> > > > > >
> >> > > > > >
> >> > > > > > --
> >> > > > > > Thanks & Best Regards
> >> > > > > >
> >> > > >
> >> > > >
> >> > > >
> >> > > > --
> >> > > > Thanks & Best Regards
> >> > > >
> >> > >
> >> >
> >> >
> >> > --
> >> > Thanks & Best Regards
> >> >
> >>
> >
> >
> > --
> > Thanks & Best Regards
> >
>
>
> --
> Thanks & Best Regards
>

Re: [DISCUSS] Prepare release for parquet-format 2.7.0?

Posted by 俊杰陈 <cj...@gmail.com>.

Sorry, the latest file is
https://github.com/chenjunjiedada/parquet-format/blob/PARQUET-1617/BloomFilter.md
.

On Fri, Jul 5, 2019 at 5:14 PM 俊杰陈 <cj...@gmail.com> wrote:

> Sure, please see this PR
> <https://github.com/apache/parquet-format/pull/140> or update file here
> <https://github.com/chenjunjiedada/parquet-format/blob/master/BloomFilter.md>
> .
>
> Thanks for reviewing spec.
>
> On Thu, Jul 4, 2019 at 11:57 PM Zoltan Ivanfi <zi...@cloudera.com.invalid>
> wrote:
>
>> Hi Junjie,
>>
>> I read through the specification and while I support the feature in
>> general, I find that the documentation may not be detailed enough to allow
>> developers of  different language bindings to implement it. Specifically,
>> the Technical Approach section of the docs is very short and refers the
>> reader to two publications for details. I think the specification would
>> greatly benefit from including an explanation or a summary of the approach
>> in this section.
>>
>> The "Build a Bloom filter" section contains a formula for calculating the
>> optimal filter size for a desired false positive rate, but does not
>> specify
>> what false positive rates implementations should target by default and
>> through what ways should they make it configurable by users. I understand
>> that this may be an intentional omission, since targeting any false
>> positive rate will result in a specification-compliant result, still I
>> think it would be best to provide some recommendation for the different
>> language bindings.
>>
>> Since this feature is getting added after encryption, it should be briefly
>> but explicitly mentioned how it interacts with that (basically that it has
>> to be encrypted, otherwise it would leak sensitive information, but by
>> placing it inside the column chunk metadata, this is automatically taken
>> care of).
>>
>> Finally, as a nitpick, I would prefer in-line links to related materials
>> instead of numeric references that one must manually look up at the bottom
>> of the page.
>>
>> Could you please add these improvements to the specification?
>>
>> Thanks,
>>
>> Zoltan
>>
>> On Wed, Jul 3, 2019 at 4:03 PM 俊杰陈 <cj...@gmail.com> wrote:
>>
>> > You are welcome, it 's my honor.
>> >
>> > I think the PR <https://github.com/apache/parquet-format/pull/139> just
>> > remove murmur3, that should express what I want.
>> >
>> >
>> >
>> >
>> > On Wed, Jul 3, 2019 at 9:53 PM Zoltan Ivanfi <zi...@cloudera.com.invalid>
>> > wrote:
>> >
>> > > Hi Junjie,
>> > >
>> > > Thanks for the update and also for your endruance in going through
>> this
>> > > tedious process in order to add bloom filtering to Parquet.
>> > >
>> > > I understand that your proposal is to go forward with xxHash instead
>> of
>> > the
>> > > eralier murmur3, which you suggest to deprecate. Since the murmur3
>> hash
>> > was
>> > > never released, I think it could be completely removed from the spec
>> > > instead of just getting deprecated. What is your opinion on this?
>> > >
>> > > Thanks,
>> > >
>> > > Zoltan
>> > >
>> > > On Wed, Jul 3, 2019 at 3:31 PM 俊杰陈 <cj...@gmail.com> wrote:
>> > >
>> > > > I see, thanks for guiding on this.
>> > > >
>> > > > Per discussion in this thread and some investigation about changes
>> on
>> > > > current java and c++ implementation, and I think that is not hard to
>> > > > handle. So I propose to use xxHash (the XXH64 version) as the
>> default
>> > > > hash strategy and deprecate previous murmur3 hash.
>> > > >
>> > > > I will update vote thread as well to make it clearer to all.
>> > > >
>> > > >
>> > > > On Wed, Jul 3, 2019 at 6:08 PM Zoltan Ivanfi
>> <zi...@cloudera.com.invalid>
>> > > > wrote:
>> > > > >
>> > > > > Hi Junjie,
>> > > > >
>> > > > > I think the vote is ambigous in its current form (can people vote
>> on
>> > > one
>> > > > > option only or can they vote on both?) and has a low chance of
>> > getting
>> > > > > votes in general because it's not a yes/no question but a
>> > > > > choose-an-approach question instead. I think most contributors
>> would
>> > > > accept
>> > > > > the hash chosen based on a community discussion but would be
>> > reluctant
>> > > to
>> > > > > make that choice themselves in the form a vote because it
>> requires a
>> > > much
>> > > > > deeper dive into the technical intricacies involved. The
>> committers
>> > are
>> > > > > experienced in the parquet code base but may not be as
>> experienced in
>> > > > bloom
>> > > > > filters as you are.
>> > > > >
>> > > > > In my opinion, to get bloom filtering into parquet-mr, you should
>> > > > convince
>> > > > > the committers that the proposal is viable by addressing their
>> > concerns
>> > > > > (which I believe you have done), and not by delegating the task of
>> > > making
>> > > > > choices to them. I would suggest that you propose which one (or
>> both)
>> > > of
>> > > > > the hashes should be included, summarize your motivations in this
>> > > thread
>> > > > > and if you don't get any objections for a day or two, call a
>> YES/NO
>> > > vote
>> > > > > for that specific proposal in a separate thread.
>> > > > >
>> > > > > Thanks,
>> > > > >
>> > > > > Zoltan
>> > > > >
>> > > > > On Tue, Jul 2, 2019 at 3:52 AM 俊杰陈 <cj...@gmail.com> wrote:
>> > > > >
>> > > > > > Any thoughts from other committers and developers?
>> > > > > >
>> > > > > > I 'd like to start a vote firstly, you could either provide your
>> > > input
>> > > > here
>> > > > > > or on vote thread.
>> > > > > >
>> > > > > >
>> > > > > >
>> > > > > > On Mon, Jul 1, 2019 at 8:20 PM Zoltan Ivanfi
>> > <zi@cloudera.com.invalid
>> > > >
>> > > > > > wrote:
>> > > > > >
>> > > > > > > Hi,
>> > > > > > >
>> > > > > > > I would like to clarify one point of my previous e-mail:
>> While I
>> > > > reasoned
>> > > > > > > that for compressions and encodings we should avoid picking
>> > > > algorithms
>> > > > > > > superseded by better ones, I also reasoned that for bloom
>> filters
>> > > we
>> > > > do
>> > > > > > not
>> > > > > > > necessarily have to be as strict, because a reader with
>> missing
>> > > > > > > implementation will still be able to read data from files that
>> > > > contain
>> > > > > > > unsupported bloom filter data structures.
>> > > > > > >
>> > > > > > > Personally I'm fine with moving forward with the current hash
>> > > > proposal,
>> > > > > > > even if the chosen algorithm is not considered to be the best
>> of
>> > > its
>> > > > > > class.
>> > > > > > >
>> > > > > > > Br,
>> > > > > > >
>> > > > > > > Zoltan
>> > > > > > >
>> > > > > > > On Sun, Jun 30, 2019 at 11:02 PM Jim Apple <
>> jbapple@apache.org>
>> > > > wrote:
>> > > > > > >
>> > > > > > > > On 2019/06/28 16:43:23, Ryan Blue <rblue@netflix.com.INVALID
>> >
>> > > > wrote:
>> > > > > > > > > I agree with Zoltan. Since we want to ensure
>> compatibility,
>> > it
>> > > > would
>> > > > > > be
>> > > > > > > > > better to choose the best option now instead of making
>> > everyone
>> > > > > > support
>> > > > > > > > two
>> > > > > > > > > options forever.
>> > > > > > > >
>> > > > > > > > I'd guess there probably isn't a single best option. I
>> suspect
>> > > > there's
>> > > > > > a
>> > > > > > > > tradeoff between ease of implementation and speed, for
>> > instance,
>> > > > since
>> > > > > > I
>> > > > > > > > expect it's easy to find an MD5 library in most programming
>> > > > languages
>> > > > > > and
>> > > > > > > > operating systems, yet MD5 is very slow compared to
>> > > > non-cryptographic
>> > > > > > > hash
>> > > > > > > > functions designed for speed like xxhash.
>> > > > > > > >
>> > > > > > > > There's also a significant amount of variability across
>> > processor
>> > > > > > > families
>> > > > > > > > (64-bit multiply-shift in ARM vs x86-64) or even different
>> > > > versions of
>> > > > > > > the
>> > > > > > > > same processor family (CLHash in Haswell vs. Sandy Lake).
>> There
>> > > are
>> > > > > > also
>> > > > > > > > quality tradeoffs that depend on the average bye length of
>> the
>> > > > input
>> > > > > > (FNV
>> > > > > > > > vs vhash) or how much L1 cache the user wants to use for the
>> > hash
>> > > > > > > function
>> > > > > > > > (tabulation hashing vs. multiply-shift).
>> > > > > > > >
>> > > > > > > > To deal with this level of ambiguity, I'd suggest that v1
>> > should
>> > > > > > include
>> > > > > > > a
>> > > > > > > > hash function that works well for certain common
>> environments.
>> > As
>> > > > far
>> > > > > > as
>> > > > > > > I
>> > > > > > > > know, murmur and xxhash would both fit that bill.
>> > > > > > > >
>> > > > > > >
>> > > > > >
>> > > > > >
>> > > > > > --
>> > > > > > Thanks & Best Regards
>> > > > > >
>> > > >
>> > > >
>> > > >
>> > > > --
>> > > > Thanks & Best Regards
>> > > >
>> > >
>> >
>> >
>> > --
>> > Thanks & Best Regards
>> >
>>
>
>
> --
> Thanks & Best Regards
>


-- 
Thanks & Best Regards

Re: [DISCUSS] Prepare release for parquet-format 2.7.0?

Posted by 俊杰陈 <cj...@gmail.com>.

Sure, please see this PR <https://github.com/apache/parquet-format/pull/140> or
update file here
<https://github.com/chenjunjiedada/parquet-format/blob/master/BloomFilter.md>
.

Thanks for reviewing spec.

On Thu, Jul 4, 2019 at 11:57 PM Zoltan Ivanfi <zi...@cloudera.com.invalid>
wrote:

> Hi Junjie,
>
> I read through the specification and while I support the feature in
> general, I find that the documentation may not be detailed enough to allow
> developers of  different language bindings to implement it. Specifically,
> the Technical Approach section of the docs is very short and refers the
> reader to two publications for details. I think the specification would
> greatly benefit from including an explanation or a summary of the approach
> in this section.
>
> The "Build a Bloom filter" section contains a formula for calculating the
> optimal filter size for a desired false positive rate, but does not specify
> what false positive rates implementations should target by default and
> through what ways should they make it configurable by users. I understand
> that this may be an intentional omission, since targeting any false
> positive rate will result in a specification-compliant result, still I
> think it would be best to provide some recommendation for the different
> language bindings.
>
> Since this feature is getting added after encryption, it should be briefly
> but explicitly mentioned how it interacts with that (basically that it has
> to be encrypted, otherwise it would leak sensitive information, but by
> placing it inside the column chunk metadata, this is automatically taken
> care of).
>
> Finally, as a nitpick, I would prefer in-line links to related materials
> instead of numeric references that one must manually look up at the bottom
> of the page.
>
> Could you please add these improvements to the specification?
>
> Thanks,
>
> Zoltan
>
> On Wed, Jul 3, 2019 at 4:03 PM 俊杰陈 <cj...@gmail.com> wrote:
>
> > You are welcome, it 's my honor.
> >
> > I think the PR <https://github.com/apache/parquet-format/pull/139> just
> > remove murmur3, that should express what I want.
> >
> >
> >
> >
> > On Wed, Jul 3, 2019 at 9:53 PM Zoltan Ivanfi <zi...@cloudera.com.invalid>
> > wrote:
> >
> > > Hi Junjie,
> > >
> > > Thanks for the update and also for your endruance in going through this
> > > tedious process in order to add bloom filtering to Parquet.
> > >
> > > I understand that your proposal is to go forward with xxHash instead of
> > the
> > > eralier murmur3, which you suggest to deprecate. Since the murmur3 hash
> > was
> > > never released, I think it could be completely removed from the spec
> > > instead of just getting deprecated. What is your opinion on this?
> > >
> > > Thanks,
> > >
> > > Zoltan
> > >
> > > On Wed, Jul 3, 2019 at 3:31 PM 俊杰陈 <cj...@gmail.com> wrote:
> > >
> > > > I see, thanks for guiding on this.
> > > >
> > > > Per discussion in this thread and some investigation about changes on
> > > > current java and c++ implementation, and I think that is not hard to
> > > > handle. So I propose to use xxHash (the XXH64 version) as the default
> > > > hash strategy and deprecate previous murmur3 hash.
> > > >
> > > > I will update vote thread as well to make it clearer to all.
> > > >
> > > >
> > > > On Wed, Jul 3, 2019 at 6:08 PM Zoltan Ivanfi <zi@cloudera.com.invalid
> >
> > > > wrote:
> > > > >
> > > > > Hi Junjie,
> > > > >
> > > > > I think the vote is ambigous in its current form (can people vote
> on
> > > one
> > > > > option only or can they vote on both?) and has a low chance of
> > getting
> > > > > votes in general because it's not a yes/no question but a
> > > > > choose-an-approach question instead. I think most contributors
> would
> > > > accept
> > > > > the hash chosen based on a community discussion but would be
> > reluctant
> > > to
> > > > > make that choice themselves in the form a vote because it requires
> a
> > > much
> > > > > deeper dive into the technical intricacies involved. The committers
> > are
> > > > > experienced in the parquet code base but may not be as experienced
> in
> > > > bloom
> > > > > filters as you are.
> > > > >
> > > > > In my opinion, to get bloom filtering into parquet-mr, you should
> > > > convince
> > > > > the committers that the proposal is viable by addressing their
> > concerns
> > > > > (which I believe you have done), and not by delegating the task of
> > > making
> > > > > choices to them. I would suggest that you propose which one (or
> both)
> > > of
> > > > > the hashes should be included, summarize your motivations in this
> > > thread
> > > > > and if you don't get any objections for a day or two, call a YES/NO
> > > vote
> > > > > for that specific proposal in a separate thread.
> > > > >
> > > > > Thanks,
> > > > >
> > > > > Zoltan
> > > > >
> > > > > On Tue, Jul 2, 2019 at 3:52 AM 俊杰陈 <cj...@gmail.com> wrote:
> > > > >
> > > > > > Any thoughts from other committers and developers?
> > > > > >
> > > > > > I 'd like to start a vote firstly, you could either provide your
> > > input
> > > > here
> > > > > > or on vote thread.
> > > > > >
> > > > > >
> > > > > >
> > > > > > On Mon, Jul 1, 2019 at 8:20 PM Zoltan Ivanfi
> > <zi@cloudera.com.invalid
> > > >
> > > > > > wrote:
> > > > > >
> > > > > > > Hi,
> > > > > > >
> > > > > > > I would like to clarify one point of my previous e-mail: While
> I
> > > > reasoned
> > > > > > > that for compressions and encodings we should avoid picking
> > > > algorithms
> > > > > > > superseded by better ones, I also reasoned that for bloom
> filters
> > > we
> > > > do
> > > > > > not
> > > > > > > necessarily have to be as strict, because a reader with missing
> > > > > > > implementation will still be able to read data from files that
> > > > contain
> > > > > > > unsupported bloom filter data structures.
> > > > > > >
> > > > > > > Personally I'm fine with moving forward with the current hash
> > > > proposal,
> > > > > > > even if the chosen algorithm is not considered to be the best
> of
> > > its
> > > > > > class.
> > > > > > >
> > > > > > > Br,
> > > > > > >
> > > > > > > Zoltan
> > > > > > >
> > > > > > > On Sun, Jun 30, 2019 at 11:02 PM Jim Apple <jbapple@apache.org
> >
> > > > wrote:
> > > > > > >
> > > > > > > > On 2019/06/28 16:43:23, Ryan Blue <rblue@netflix.com.INVALID
> >
> > > > wrote:
> > > > > > > > > I agree with Zoltan. Since we want to ensure compatibility,
> > it
> > > > would
> > > > > > be
> > > > > > > > > better to choose the best option now instead of making
> > everyone
> > > > > > support
> > > > > > > > two
> > > > > > > > > options forever.
> > > > > > > >
> > > > > > > > I'd guess there probably isn't a single best option. I
> suspect
> > > > there's
> > > > > > a
> > > > > > > > tradeoff between ease of implementation and speed, for
> > instance,
> > > > since
> > > > > > I
> > > > > > > > expect it's easy to find an MD5 library in most programming
> > > > languages
> > > > > > and
> > > > > > > > operating systems, yet MD5 is very slow compared to
> > > > non-cryptographic
> > > > > > > hash
> > > > > > > > functions designed for speed like xxhash.
> > > > > > > >
> > > > > > > > There's also a significant amount of variability across
> > processor
> > > > > > > families
> > > > > > > > (64-bit multiply-shift in ARM vs x86-64) or even different
> > > > versions of
> > > > > > > the
> > > > > > > > same processor family (CLHash in Haswell vs. Sandy Lake).
> There
> > > are
> > > > > > also
> > > > > > > > quality tradeoffs that depend on the average bye length of
> the
> > > > input
> > > > > > (FNV
> > > > > > > > vs vhash) or how much L1 cache the user wants to use for the
> > hash
> > > > > > > function
> > > > > > > > (tabulation hashing vs. multiply-shift).
> > > > > > > >
> > > > > > > > To deal with this level of ambiguity, I'd suggest that v1
> > should
> > > > > > include
> > > > > > > a
> > > > > > > > hash function that works well for certain common
> environments.
> > As
> > > > far
> > > > > > as
> > > > > > > I
> > > > > > > > know, murmur and xxhash would both fit that bill.
> > > > > > > >
> > > > > > >
> > > > > >
> > > > > >
> > > > > > --
> > > > > > Thanks & Best Regards
> > > > > >
> > > >
> > > >
> > > >
> > > > --
> > > > Thanks & Best Regards
> > > >
> > >
> >
> >
> > --
> > Thanks & Best Regards
> >
>


-- 
Thanks & Best Regards

Re: [DISCUSS] Prepare release for parquet-format 2.7.0?

Posted by Zoltan Ivanfi <zi...@cloudera.com.INVALID>.

Hi Junjie,

I read through the specification and while I support the feature in
general, I find that the documentation may not be detailed enough to allow
developers of  different language bindings to implement it. Specifically,
the Technical Approach section of the docs is very short and refers the
reader to two publications for details. I think the specification would
greatly benefit from including an explanation or a summary of the approach
in this section.

The "Build a Bloom filter" section contains a formula for calculating the
optimal filter size for a desired false positive rate, but does not specify
what false positive rates implementations should target by default and
through what ways should they make it configurable by users. I understand
that this may be an intentional omission, since targeting any false
positive rate will result in a specification-compliant result, still I
think it would be best to provide some recommendation for the different
language bindings.

Since this feature is getting added after encryption, it should be briefly
but explicitly mentioned how it interacts with that (basically that it has
to be encrypted, otherwise it would leak sensitive information, but by
placing it inside the column chunk metadata, this is automatically taken
care of).

Finally, as a nitpick, I would prefer in-line links to related materials
instead of numeric references that one must manually look up at the bottom
of the page.

Could you please add these improvements to the specification?

Thanks,

Zoltan

On Wed, Jul 3, 2019 at 4:03 PM 俊杰陈 <cj...@gmail.com> wrote:

> You are welcome, it 's my honor.
>
> I think the PR <https://github.com/apache/parquet-format/pull/139> just
> remove murmur3, that should express what I want.
>
>
>
>
> On Wed, Jul 3, 2019 at 9:53 PM Zoltan Ivanfi <zi...@cloudera.com.invalid>
> wrote:
>
> > Hi Junjie,
> >
> > Thanks for the update and also for your endruance in going through this
> > tedious process in order to add bloom filtering to Parquet.
> >
> > I understand that your proposal is to go forward with xxHash instead of
> the
> > eralier murmur3, which you suggest to deprecate. Since the murmur3 hash
> was
> > never released, I think it could be completely removed from the spec
> > instead of just getting deprecated. What is your opinion on this?
> >
> > Thanks,
> >
> > Zoltan
> >
> > On Wed, Jul 3, 2019 at 3:31 PM 俊杰陈 <cj...@gmail.com> wrote:
> >
> > > I see, thanks for guiding on this.
> > >
> > > Per discussion in this thread and some investigation about changes on
> > > current java and c++ implementation, and I think that is not hard to
> > > handle. So I propose to use xxHash (the XXH64 version) as the default
> > > hash strategy and deprecate previous murmur3 hash.
> > >
> > > I will update vote thread as well to make it clearer to all.
> > >
> > >
> > > On Wed, Jul 3, 2019 at 6:08 PM Zoltan Ivanfi <zi...@cloudera.com.invalid>
> > > wrote:
> > > >
> > > > Hi Junjie,
> > > >
> > > > I think the vote is ambigous in its current form (can people vote on
> > one
> > > > option only or can they vote on both?) and has a low chance of
> getting
> > > > votes in general because it's not a yes/no question but a
> > > > choose-an-approach question instead. I think most contributors would
> > > accept
> > > > the hash chosen based on a community discussion but would be
> reluctant
> > to
> > > > make that choice themselves in the form a vote because it requires a
> > much
> > > > deeper dive into the technical intricacies involved. The committers
> are
> > > > experienced in the parquet code base but may not be as experienced in
> > > bloom
> > > > filters as you are.
> > > >
> > > > In my opinion, to get bloom filtering into parquet-mr, you should
> > > convince
> > > > the committers that the proposal is viable by addressing their
> concerns
> > > > (which I believe you have done), and not by delegating the task of
> > making
> > > > choices to them. I would suggest that you propose which one (or both)
> > of
> > > > the hashes should be included, summarize your motivations in this
> > thread
> > > > and if you don't get any objections for a day or two, call a YES/NO
> > vote
> > > > for that specific proposal in a separate thread.
> > > >
> > > > Thanks,
> > > >
> > > > Zoltan
> > > >
> > > > On Tue, Jul 2, 2019 at 3:52 AM 俊杰陈 <cj...@gmail.com> wrote:
> > > >
> > > > > Any thoughts from other committers and developers?
> > > > >
> > > > > I 'd like to start a vote firstly, you could either provide your
> > input
> > > here
> > > > > or on vote thread.
> > > > >
> > > > >
> > > > >
> > > > > On Mon, Jul 1, 2019 at 8:20 PM Zoltan Ivanfi
> <zi@cloudera.com.invalid
> > >
> > > > > wrote:
> > > > >
> > > > > > Hi,
> > > > > >
> > > > > > I would like to clarify one point of my previous e-mail: While I
> > > reasoned
> > > > > > that for compressions and encodings we should avoid picking
> > > algorithms
> > > > > > superseded by better ones, I also reasoned that for bloom filters
> > we
> > > do
> > > > > not
> > > > > > necessarily have to be as strict, because a reader with missing
> > > > > > implementation will still be able to read data from files that
> > > contain
> > > > > > unsupported bloom filter data structures.
> > > > > >
> > > > > > Personally I'm fine with moving forward with the current hash
> > > proposal,
> > > > > > even if the chosen algorithm is not considered to be the best of
> > its
> > > > > class.
> > > > > >
> > > > > > Br,
> > > > > >
> > > > > > Zoltan
> > > > > >
> > > > > > On Sun, Jun 30, 2019 at 11:02 PM Jim Apple <jb...@apache.org>
> > > wrote:
> > > > > >
> > > > > > > On 2019/06/28 16:43:23, Ryan Blue <rb...@netflix.com.INVALID>
> > > wrote:
> > > > > > > > I agree with Zoltan. Since we want to ensure compatibility,
> it
> > > would
> > > > > be
> > > > > > > > better to choose the best option now instead of making
> everyone
> > > > > support
> > > > > > > two
> > > > > > > > options forever.
> > > > > > >
> > > > > > > I'd guess there probably isn't a single best option. I suspect
> > > there's
> > > > > a
> > > > > > > tradeoff between ease of implementation and speed, for
> instance,
> > > since
> > > > > I
> > > > > > > expect it's easy to find an MD5 library in most programming
> > > languages
> > > > > and
> > > > > > > operating systems, yet MD5 is very slow compared to
> > > non-cryptographic
> > > > > > hash
> > > > > > > functions designed for speed like xxhash.
> > > > > > >
> > > > > > > There's also a significant amount of variability across
> processor
> > > > > > families
> > > > > > > (64-bit multiply-shift in ARM vs x86-64) or even different
> > > versions of
> > > > > > the
> > > > > > > same processor family (CLHash in Haswell vs. Sandy Lake). There
> > are
> > > > > also
> > > > > > > quality tradeoffs that depend on the average bye length of the
> > > input
> > > > > (FNV
> > > > > > > vs vhash) or how much L1 cache the user wants to use for the
> hash
> > > > > > function
> > > > > > > (tabulation hashing vs. multiply-shift).
> > > > > > >
> > > > > > > To deal with this level of ambiguity, I'd suggest that v1
> should
> > > > > include
> > > > > > a
> > > > > > > hash function that works well for certain common environments.
> As
> > > far
> > > > > as
> > > > > > I
> > > > > > > know, murmur and xxhash would both fit that bill.
> > > > > > >
> > > > > >
> > > > >
> > > > >
> > > > > --
> > > > > Thanks & Best Regards
> > > > >
> > >
> > >
> > >
> > > --
> > > Thanks & Best Regards
> > >
> >
>
>
> --
> Thanks & Best Regards
>

Re: [DISCUSS] Prepare release for parquet-format 2.7.0?

Posted by 俊杰陈 <cj...@gmail.com>.

You are welcome, it 's my honor.

I think the PR <https://github.com/apache/parquet-format/pull/139> just
remove murmur3, that should express what I want.




On Wed, Jul 3, 2019 at 9:53 PM Zoltan Ivanfi <zi...@cloudera.com.invalid>
wrote:

> Hi Junjie,
>
> Thanks for the update and also for your endruance in going through this
> tedious process in order to add bloom filtering to Parquet.
>
> I understand that your proposal is to go forward with xxHash instead of the
> eralier murmur3, which you suggest to deprecate. Since the murmur3 hash was
> never released, I think it could be completely removed from the spec
> instead of just getting deprecated. What is your opinion on this?
>
> Thanks,
>
> Zoltan
>
> On Wed, Jul 3, 2019 at 3:31 PM 俊杰陈 <cj...@gmail.com> wrote:
>
> > I see, thanks for guiding on this.
> >
> > Per discussion in this thread and some investigation about changes on
> > current java and c++ implementation, and I think that is not hard to
> > handle. So I propose to use xxHash (the XXH64 version) as the default
> > hash strategy and deprecate previous murmur3 hash.
> >
> > I will update vote thread as well to make it clearer to all.
> >
> >
> > On Wed, Jul 3, 2019 at 6:08 PM Zoltan Ivanfi <zi...@cloudera.com.invalid>
> > wrote:
> > >
> > > Hi Junjie,
> > >
> > > I think the vote is ambigous in its current form (can people vote on
> one
> > > option only or can they vote on both?) and has a low chance of getting
> > > votes in general because it's not a yes/no question but a
> > > choose-an-approach question instead. I think most contributors would
> > accept
> > > the hash chosen based on a community discussion but would be reluctant
> to
> > > make that choice themselves in the form a vote because it requires a
> much
> > > deeper dive into the technical intricacies involved. The committers are
> > > experienced in the parquet code base but may not be as experienced in
> > bloom
> > > filters as you are.
> > >
> > > In my opinion, to get bloom filtering into parquet-mr, you should
> > convince
> > > the committers that the proposal is viable by addressing their concerns
> > > (which I believe you have done), and not by delegating the task of
> making
> > > choices to them. I would suggest that you propose which one (or both)
> of
> > > the hashes should be included, summarize your motivations in this
> thread
> > > and if you don't get any objections for a day or two, call a YES/NO
> vote
> > > for that specific proposal in a separate thread.
> > >
> > > Thanks,
> > >
> > > Zoltan
> > >
> > > On Tue, Jul 2, 2019 at 3:52 AM 俊杰陈 <cj...@gmail.com> wrote:
> > >
> > > > Any thoughts from other committers and developers?
> > > >
> > > > I 'd like to start a vote firstly, you could either provide your
> input
> > here
> > > > or on vote thread.
> > > >
> > > >
> > > >
> > > > On Mon, Jul 1, 2019 at 8:20 PM Zoltan Ivanfi <zi@cloudera.com.invalid
> >
> > > > wrote:
> > > >
> > > > > Hi,
> > > > >
> > > > > I would like to clarify one point of my previous e-mail: While I
> > reasoned
> > > > > that for compressions and encodings we should avoid picking
> > algorithms
> > > > > superseded by better ones, I also reasoned that for bloom filters
> we
> > do
> > > > not
> > > > > necessarily have to be as strict, because a reader with missing
> > > > > implementation will still be able to read data from files that
> > contain
> > > > > unsupported bloom filter data structures.
> > > > >
> > > > > Personally I'm fine with moving forward with the current hash
> > proposal,
> > > > > even if the chosen algorithm is not considered to be the best of
> its
> > > > class.
> > > > >
> > > > > Br,
> > > > >
> > > > > Zoltan
> > > > >
> > > > > On Sun, Jun 30, 2019 at 11:02 PM Jim Apple <jb...@apache.org>
> > wrote:
> > > > >
> > > > > > On 2019/06/28 16:43:23, Ryan Blue <rb...@netflix.com.INVALID>
> > wrote:
> > > > > > > I agree with Zoltan. Since we want to ensure compatibility, it
> > would
> > > > be
> > > > > > > better to choose the best option now instead of making everyone
> > > > support
> > > > > > two
> > > > > > > options forever.
> > > > > >
> > > > > > I'd guess there probably isn't a single best option. I suspect
> > there's
> > > > a
> > > > > > tradeoff between ease of implementation and speed, for instance,
> > since
> > > > I
> > > > > > expect it's easy to find an MD5 library in most programming
> > languages
> > > > and
> > > > > > operating systems, yet MD5 is very slow compared to
> > non-cryptographic
> > > > > hash
> > > > > > functions designed for speed like xxhash.
> > > > > >
> > > > > > There's also a significant amount of variability across processor
> > > > > families
> > > > > > (64-bit multiply-shift in ARM vs x86-64) or even different
> > versions of
> > > > > the
> > > > > > same processor family (CLHash in Haswell vs. Sandy Lake). There
> are
> > > > also
> > > > > > quality tradeoffs that depend on the average bye length of the
> > input
> > > > (FNV
> > > > > > vs vhash) or how much L1 cache the user wants to use for the hash
> > > > > function
> > > > > > (tabulation hashing vs. multiply-shift).
> > > > > >
> > > > > > To deal with this level of ambiguity, I'd suggest that v1 should
> > > > include
> > > > > a
> > > > > > hash function that works well for certain common environments. As
> > far
> > > > as
> > > > > I
> > > > > > know, murmur and xxhash would both fit that bill.
> > > > > >
> > > > >
> > > >
> > > >
> > > > --
> > > > Thanks & Best Regards
> > > >
> >
> >
> >
> > --
> > Thanks & Best Regards
> >
>


-- 
Thanks & Best Regards

Re: [DISCUSS] Prepare release for parquet-format 2.7.0?

Posted by Zoltan Ivanfi <zi...@cloudera.com.INVALID>.

Hi Junjie,

Thanks for the update and also for your endruance in going through this
tedious process in order to add bloom filtering to Parquet.

I understand that your proposal is to go forward with xxHash instead of the
eralier murmur3, which you suggest to deprecate. Since the murmur3 hash was
never released, I think it could be completely removed from the spec
instead of just getting deprecated. What is your opinion on this?

Thanks,

Zoltan

On Wed, Jul 3, 2019 at 3:31 PM 俊杰陈 <cj...@gmail.com> wrote:

> I see, thanks for guiding on this.
>
> Per discussion in this thread and some investigation about changes on
> current java and c++ implementation, and I think that is not hard to
> handle. So I propose to use xxHash (the XXH64 version) as the default
> hash strategy and deprecate previous murmur3 hash.
>
> I will update vote thread as well to make it clearer to all.
>
>
> On Wed, Jul 3, 2019 at 6:08 PM Zoltan Ivanfi <zi...@cloudera.com.invalid>
> wrote:
> >
> > Hi Junjie,
> >
> > I think the vote is ambigous in its current form (can people vote on one
> > option only or can they vote on both?) and has a low chance of getting
> > votes in general because it's not a yes/no question but a
> > choose-an-approach question instead. I think most contributors would
> accept
> > the hash chosen based on a community discussion but would be reluctant to
> > make that choice themselves in the form a vote because it requires a much
> > deeper dive into the technical intricacies involved. The committers are
> > experienced in the parquet code base but may not be as experienced in
> bloom
> > filters as you are.
> >
> > In my opinion, to get bloom filtering into parquet-mr, you should
> convince
> > the committers that the proposal is viable by addressing their concerns
> > (which I believe you have done), and not by delegating the task of making
> > choices to them. I would suggest that you propose which one (or both) of
> > the hashes should be included, summarize your motivations in this thread
> > and if you don't get any objections for a day or two, call a YES/NO vote
> > for that specific proposal in a separate thread.
> >
> > Thanks,
> >
> > Zoltan
> >
> > On Tue, Jul 2, 2019 at 3:52 AM 俊杰陈 <cj...@gmail.com> wrote:
> >
> > > Any thoughts from other committers and developers?
> > >
> > > I 'd like to start a vote firstly, you could either provide your input
> here
> > > or on vote thread.
> > >
> > >
> > >
> > > On Mon, Jul 1, 2019 at 8:20 PM Zoltan Ivanfi <zi...@cloudera.com.invalid>
> > > wrote:
> > >
> > > > Hi,
> > > >
> > > > I would like to clarify one point of my previous e-mail: While I
> reasoned
> > > > that for compressions and encodings we should avoid picking
> algorithms
> > > > superseded by better ones, I also reasoned that for bloom filters we
> do
> > > not
> > > > necessarily have to be as strict, because a reader with missing
> > > > implementation will still be able to read data from files that
> contain
> > > > unsupported bloom filter data structures.
> > > >
> > > > Personally I'm fine with moving forward with the current hash
> proposal,
> > > > even if the chosen algorithm is not considered to be the best of its
> > > class.
> > > >
> > > > Br,
> > > >
> > > > Zoltan
> > > >
> > > > On Sun, Jun 30, 2019 at 11:02 PM Jim Apple <jb...@apache.org>
> wrote:
> > > >
> > > > > On 2019/06/28 16:43:23, Ryan Blue <rb...@netflix.com.INVALID>
> wrote:
> > > > > > I agree with Zoltan. Since we want to ensure compatibility, it
> would
> > > be
> > > > > > better to choose the best option now instead of making everyone
> > > support
> > > > > two
> > > > > > options forever.
> > > > >
> > > > > I'd guess there probably isn't a single best option. I suspect
> there's
> > > a
> > > > > tradeoff between ease of implementation and speed, for instance,
> since
> > > I
> > > > > expect it's easy to find an MD5 library in most programming
> languages
> > > and
> > > > > operating systems, yet MD5 is very slow compared to
> non-cryptographic
> > > > hash
> > > > > functions designed for speed like xxhash.
> > > > >
> > > > > There's also a significant amount of variability across processor
> > > > families
> > > > > (64-bit multiply-shift in ARM vs x86-64) or even different
> versions of
> > > > the
> > > > > same processor family (CLHash in Haswell vs. Sandy Lake). There are
> > > also
> > > > > quality tradeoffs that depend on the average bye length of the
> input
> > > (FNV
> > > > > vs vhash) or how much L1 cache the user wants to use for the hash
> > > > function
> > > > > (tabulation hashing vs. multiply-shift).
> > > > >
> > > > > To deal with this level of ambiguity, I'd suggest that v1 should
> > > include
> > > > a
> > > > > hash function that works well for certain common environments. As
> far
> > > as
> > > > I
> > > > > know, murmur and xxhash would both fit that bill.
> > > > >
> > > >
> > >
> > >
> > > --
> > > Thanks & Best Regards
> > >
>
>
>
> --
> Thanks & Best Regards
>

Re: [DISCUSS] Prepare release for parquet-format 2.7.0?

Posted by 俊杰陈 <cj...@gmail.com>.

I see, thanks for guiding on this.

Per discussion in this thread and some investigation about changes on
current java and c++ implementation, and I think that is not hard to
handle. So I propose to use xxHash (the XXH64 version) as the default
hash strategy and deprecate previous murmur3 hash.

I will update vote thread as well to make it clearer to all.


On Wed, Jul 3, 2019 at 6:08 PM Zoltan Ivanfi <zi...@cloudera.com.invalid> wrote:
>
> Hi Junjie,
>
> I think the vote is ambigous in its current form (can people vote on one
> option only or can they vote on both?) and has a low chance of getting
> votes in general because it's not a yes/no question but a
> choose-an-approach question instead. I think most contributors would accept
> the hash chosen based on a community discussion but would be reluctant to
> make that choice themselves in the form a vote because it requires a much
> deeper dive into the technical intricacies involved. The committers are
> experienced in the parquet code base but may not be as experienced in bloom
> filters as you are.
>
> In my opinion, to get bloom filtering into parquet-mr, you should convince
> the committers that the proposal is viable by addressing their concerns
> (which I believe you have done), and not by delegating the task of making
> choices to them. I would suggest that you propose which one (or both) of
> the hashes should be included, summarize your motivations in this thread
> and if you don't get any objections for a day or two, call a YES/NO vote
> for that specific proposal in a separate thread.
>
> Thanks,
>
> Zoltan
>
> On Tue, Jul 2, 2019 at 3:52 AM 俊杰陈 <cj...@gmail.com> wrote:
>
> > Any thoughts from other committers and developers?
> >
> > I 'd like to start a vote firstly, you could either provide your input here
> > or on vote thread.
> >
> >
> >
> > On Mon, Jul 1, 2019 at 8:20 PM Zoltan Ivanfi <zi...@cloudera.com.invalid>
> > wrote:
> >
> > > Hi,
> > >
> > > I would like to clarify one point of my previous e-mail: While I reasoned
> > > that for compressions and encodings we should avoid picking algorithms
> > > superseded by better ones, I also reasoned that for bloom filters we do
> > not
> > > necessarily have to be as strict, because a reader with missing
> > > implementation will still be able to read data from files that contain
> > > unsupported bloom filter data structures.
> > >
> > > Personally I'm fine with moving forward with the current hash proposal,
> > > even if the chosen algorithm is not considered to be the best of its
> > class.
> > >
> > > Br,
> > >
> > > Zoltan
> > >
> > > On Sun, Jun 30, 2019 at 11:02 PM Jim Apple <jb...@apache.org> wrote:
> > >
> > > > On 2019/06/28 16:43:23, Ryan Blue <rb...@netflix.com.INVALID> wrote:
> > > > > I agree with Zoltan. Since we want to ensure compatibility, it would
> > be
> > > > > better to choose the best option now instead of making everyone
> > support
> > > > two
> > > > > options forever.
> > > >
> > > > I'd guess there probably isn't a single best option. I suspect there's
> > a
> > > > tradeoff between ease of implementation and speed, for instance, since
> > I
> > > > expect it's easy to find an MD5 library in most programming languages
> > and
> > > > operating systems, yet MD5 is very slow compared to non-cryptographic
> > > hash
> > > > functions designed for speed like xxhash.
> > > >
> > > > There's also a significant amount of variability across processor
> > > families
> > > > (64-bit multiply-shift in ARM vs x86-64) or even different versions of
> > > the
> > > > same processor family (CLHash in Haswell vs. Sandy Lake). There are
> > also
> > > > quality tradeoffs that depend on the average bye length of the input
> > (FNV
> > > > vs vhash) or how much L1 cache the user wants to use for the hash
> > > function
> > > > (tabulation hashing vs. multiply-shift).
> > > >
> > > > To deal with this level of ambiguity, I'd suggest that v1 should
> > include
> > > a
> > > > hash function that works well for certain common environments. As far
> > as
> > > I
> > > > know, murmur and xxhash would both fit that bill.
> > > >
> > >
> >
> >
> > --
> > Thanks & Best Regards
> >



-- 
Thanks & Best Regards

Re: [DISCUSS] Prepare release for parquet-format 2.7.0?

Posted by Zoltan Ivanfi <zi...@cloudera.com.INVALID>.

Hi Junjie,

I think the vote is ambigous in its current form (can people vote on one
option only or can they vote on both?) and has a low chance of getting
votes in general because it's not a yes/no question but a
choose-an-approach question instead. I think most contributors would accept
the hash chosen based on a community discussion but would be reluctant to
make that choice themselves in the form a vote because it requires a much
deeper dive into the technical intricacies involved. The committers are
experienced in the parquet code base but may not be as experienced in bloom
filters as you are.

In my opinion, to get bloom filtering into parquet-mr, you should convince
the committers that the proposal is viable by addressing their concerns
(which I believe you have done), and not by delegating the task of making
choices to them. I would suggest that you propose which one (or both) of
the hashes should be included, summarize your motivations in this thread
and if you don't get any objections for a day or two, call a YES/NO vote
for that specific proposal in a separate thread.

Thanks,

Zoltan

On Tue, Jul 2, 2019 at 3:52 AM 俊杰陈 <cj...@gmail.com> wrote:

> Any thoughts from other committers and developers?
>
> I 'd like to start a vote firstly, you could either provide your input here
> or on vote thread.
>
>
>
> On Mon, Jul 1, 2019 at 8:20 PM Zoltan Ivanfi <zi...@cloudera.com.invalid>
> wrote:
>
> > Hi,
> >
> > I would like to clarify one point of my previous e-mail: While I reasoned
> > that for compressions and encodings we should avoid picking algorithms
> > superseded by better ones, I also reasoned that for bloom filters we do
> not
> > necessarily have to be as strict, because a reader with missing
> > implementation will still be able to read data from files that contain
> > unsupported bloom filter data structures.
> >
> > Personally I'm fine with moving forward with the current hash proposal,
> > even if the chosen algorithm is not considered to be the best of its
> class.
> >
> > Br,
> >
> > Zoltan
> >
> > On Sun, Jun 30, 2019 at 11:02 PM Jim Apple <jb...@apache.org> wrote:
> >
> > > On 2019/06/28 16:43:23, Ryan Blue <rb...@netflix.com.INVALID> wrote:
> > > > I agree with Zoltan. Since we want to ensure compatibility, it would
> be
> > > > better to choose the best option now instead of making everyone
> support
> > > two
> > > > options forever.
> > >
> > > I'd guess there probably isn't a single best option. I suspect there's
> a
> > > tradeoff between ease of implementation and speed, for instance, since
> I
> > > expect it's easy to find an MD5 library in most programming languages
> and
> > > operating systems, yet MD5 is very slow compared to non-cryptographic
> > hash
> > > functions designed for speed like xxhash.
> > >
> > > There's also a significant amount of variability across processor
> > families
> > > (64-bit multiply-shift in ARM vs x86-64) or even different versions of
> > the
> > > same processor family (CLHash in Haswell vs. Sandy Lake). There are
> also
> > > quality tradeoffs that depend on the average bye length of the input
> (FNV
> > > vs vhash) or how much L1 cache the user wants to use for the hash
> > function
> > > (tabulation hashing vs. multiply-shift).
> > >
> > > To deal with this level of ambiguity, I'd suggest that v1 should
> include
> > a
> > > hash function that works well for certain common environments. As far
> as
> > I
> > > know, murmur and xxhash would both fit that bill.
> > >
> >
>
>
> --
> Thanks & Best Regards
>

Re: [DISCUSS] Prepare release for parquet-format 2.7.0?

Posted by 俊杰陈 <cj...@gmail.com>.

Any thoughts from other committers and developers?

I 'd like to start a vote firstly, you could either provide your input here
or on vote thread.



On Mon, Jul 1, 2019 at 8:20 PM Zoltan Ivanfi <zi...@cloudera.com.invalid>
wrote:

> Hi,
>
> I would like to clarify one point of my previous e-mail: While I reasoned
> that for compressions and encodings we should avoid picking algorithms
> superseded by better ones, I also reasoned that for bloom filters we do not
> necessarily have to be as strict, because a reader with missing
> implementation will still be able to read data from files that contain
> unsupported bloom filter data structures.
>
> Personally I'm fine with moving forward with the current hash proposal,
> even if the chosen algorithm is not considered to be the best of its class.
>
> Br,
>
> Zoltan
>
> On Sun, Jun 30, 2019 at 11:02 PM Jim Apple <jb...@apache.org> wrote:
>
> > On 2019/06/28 16:43:23, Ryan Blue <rb...@netflix.com.INVALID> wrote:
> > > I agree with Zoltan. Since we want to ensure compatibility, it would be
> > > better to choose the best option now instead of making everyone support
> > two
> > > options forever.
> >
> > I'd guess there probably isn't a single best option. I suspect there's a
> > tradeoff between ease of implementation and speed, for instance, since I
> > expect it's easy to find an MD5 library in most programming languages and
> > operating systems, yet MD5 is very slow compared to non-cryptographic
> hash
> > functions designed for speed like xxhash.
> >
> > There's also a significant amount of variability across processor
> families
> > (64-bit multiply-shift in ARM vs x86-64) or even different versions of
> the
> > same processor family (CLHash in Haswell vs. Sandy Lake). There are also
> > quality tradeoffs that depend on the average bye length of the input (FNV
> > vs vhash) or how much L1 cache the user wants to use for the hash
> function
> > (tabulation hashing vs. multiply-shift).
> >
> > To deal with this level of ambiguity, I'd suggest that v1 should include
> a
> > hash function that works well for certain common environments. As far as
> I
> > know, murmur and xxhash would both fit that bill.
> >
>


-- 
Thanks & Best Regards

Re: [DISCUSS] Prepare release for parquet-format 2.7.0?

Posted by Zoltan Ivanfi <zi...@cloudera.com.INVALID>.

Hi,

I would like to clarify one point of my previous e-mail: While I reasoned
that for compressions and encodings we should avoid picking algorithms
superseded by better ones, I also reasoned that for bloom filters we do not
necessarily have to be as strict, because a reader with missing
implementation will still be able to read data from files that contain
unsupported bloom filter data structures.

Personally I'm fine with moving forward with the current hash proposal,
even if the chosen algorithm is not considered to be the best of its class.

Br,

Zoltan

On Sun, Jun 30, 2019 at 11:02 PM Jim Apple <jb...@apache.org> wrote:

> On 2019/06/28 16:43:23, Ryan Blue <rb...@netflix.com.INVALID> wrote:
> > I agree with Zoltan. Since we want to ensure compatibility, it would be
> > better to choose the best option now instead of making everyone support
> two
> > options forever.
>
> I'd guess there probably isn't a single best option. I suspect there's a
> tradeoff between ease of implementation and speed, for instance, since I
> expect it's easy to find an MD5 library in most programming languages and
> operating systems, yet MD5 is very slow compared to non-cryptographic hash
> functions designed for speed like xxhash.
>
> There's also a significant amount of variability across processor families
> (64-bit multiply-shift in ARM vs x86-64) or even different versions of the
> same processor family (CLHash in Haswell vs. Sandy Lake). There are also
> quality tradeoffs that depend on the average bye length of the input (FNV
> vs vhash) or how much L1 cache the user wants to use for the hash function
> (tabulation hashing vs. multiply-shift).
>
> To deal with this level of ambiguity, I'd suggest that v1 should include a
> hash function that works well for certain common environments. As far as I
> know, murmur and xxhash would both fit that bill.
>

Re: [DISCUSS] Prepare release for parquet-format 2.7.0?

Posted by Jim Apple <jb...@apache.org>.

On 2019/06/28 16:43:23, Ryan Blue <rb...@netflix.com.INVALID> wrote: 
> I agree with Zoltan. Since we want to ensure compatibility, it would be
> better to choose the best option now instead of making everyone support two
> options forever.

I'd guess there probably isn't a single best option. I suspect there's a tradeoff between ease of implementation and speed, for instance, since I expect it's easy to find an MD5 library in most programming languages and operating systems, yet MD5 is very slow compared to non-cryptographic hash functions designed for speed like xxhash.

There's also a significant amount of variability across processor families (64-bit multiply-shift in ARM vs x86-64) or even different versions of the same processor family (CLHash in Haswell vs. Sandy Lake). There are also quality tradeoffs that depend on the average bye length of the input (FNV vs vhash) or how much L1 cache the user wants to use for the hash function (tabulation hashing vs. multiply-shift).

To deal with this level of ambiguity, I'd suggest that v1 should include a hash function that works well for certain common environments. As far as I know, murmur and xxhash would both fit that bill.

Re: [DISCUSS] Prepare release for parquet-format 2.7.0?

Posted by Ryan Blue <rb...@netflix.com.INVALID>.

I agree with Zoltan. Since we want to ensure compatibility, it would be
better to choose the best option now instead of making everyone support two
options forever.

In terms of next steps, I think that getting a clean write-up of the design
and changes and starting a VOTE thread that points to them are the next
steps. The write-up is already done, but needs to be updated for xxHash,
right?

rb

On Fri, Jun 28, 2019 at 3:59 AM Zoltan Ivanfi <zi...@cloudera.com.invalid>
wrote:

> Hi,
>
> I think the concern was not about the lack of any specific hash algorithm,
> but about the choice of the one that got added. Generally for compressions
> and encodings, we are very picky about which ones to add to specification,
> because it has to be implemented in every language binding. This is not
> only a considerable effort, but is also error-prone (see LZ4 for an
> example, which was added to both the Java and the C++ implementation of
> Parquet, yet they are incompatible with each other). And lack of support is
> not only a minor annoyance in this case: if one is forced to use an older
> reader that does not support the new encoding yet (or a language binding
> that  does not support it at all), the data simply can not be read.
>
> For this reason, if we already know that an algorithm is suboptimal and
> there are better ones available, we prefer not to add it at all. However, I
> don't think that the reasoning above applies here though, because the bloom
> filter is an optional metadata and the data is perfectly readable without
> supporting it. Even if it is very likely that we will want to move to a
> better hash algorithm later, we already know that we won't have to keep
> supporting the current one forever, since removing support is not a
> breaking change (at least functionally, performance-wise it will result in
> a regression for old files).
>
> Br,
>
> Zoltan
>
> On Fri, Jun 28, 2019 at 11:12 AM 俊杰陈 <cj...@gmail.com> wrote:
>
> > Thanks,
> >
> > The naming issue had been fixed, I also created a PR
> > <https://github.com/apache/parquet-format/pull/139>to add xxHash as an
> > alternative option for Todd's concern. Is that OK for concerns? If that
> is
> > OK, we can create a VOTE against the spec  (the bloom filter diff in
> > parquet-format repo).
> >
> > On Fri, Jun 28, 2019 at 4:03 PM Driesprong, Fokko <fo...@driesprong.frl>
> > wrote:
> >
> > > Ryan has a valid point here. Once the Bloom filters get released, it
> > won't
> > > be as easy anymore to change it because we will break an already
> released
> > > API.
> > >
> > > There was a related discussion a while ago:
> > >
> > >
> >
> https://lists.apache.org/thread.html/027e9d73093df84448e07d8514b9d669906cd5b83ae59a76f38aaa55@%3Cdev.parquet.apache.org%3E
> > >
> > > My suggestion would be to create a VOTE to formally adopt the vote and
> > fix
> > > the remaining concerns. For example, the one that Zoltan raised in the
> > list
> > > above.
> > >
> > > Cheers, Fokko
> > >
> > > Op vr 28 jun. 2019 om 01:13 schreef Jim Apple <jb...@apache.org>:
> > >
> > > > > I think we need to have a vote on the bloom filter
> > > > > structures first. We need to make sure that the community has
> vetted
> > > the
> > > > > design and is comfortable with adding this, just like we did with
> the
> > > > > Parquet encryption design and the page index design.
> > > >
> > > > Thank you for the note, Ryan. Based on my experience on Apache
> Impala,
> > I
> > > > was under the impression that a git commit signified at least a
> > temporary
> > > > agreement that the commit should make it into a future release. I
> > > > understand you to be saying that in parquet-format, a vote on format
> > > > additions is standard, whether or not a commit made it into HEAD.
> > > >
> > > > There have been previous discussions of Bloom filters in the pull
> > > > requests, on this list, and in live videochat meetups (from quite a
> > while
> > > > ago). In your opinion, should we start a new discussion, or start a
> > > [VOTE]
> > > > thread with pointers to the old discussions, or some third option?
> > > >
> > >
> >
> >
> > --
> > Thanks & Best Regards
> >
>


-- 
Ryan Blue
Software Engineer
Netflix

Re: [DISCUSS] Prepare release for parquet-format 2.7.0?

Posted by Zoltan Ivanfi <zi...@cloudera.com.INVALID>.

Hi,

I think the concern was not about the lack of any specific hash algorithm,
but about the choice of the one that got added. Generally for compressions
and encodings, we are very picky about which ones to add to specification,
because it has to be implemented in every language binding. This is not
only a considerable effort, but is also error-prone (see LZ4 for an
example, which was added to both the Java and the C++ implementation of
Parquet, yet they are incompatible with each other). And lack of support is
not only a minor annoyance in this case: if one is forced to use an older
reader that does not support the new encoding yet (or a language binding
that  does not support it at all), the data simply can not be read.

For this reason, if we already know that an algorithm is suboptimal and
there are better ones available, we prefer not to add it at all. However, I
don't think that the reasoning above applies here though, because the bloom
filter is an optional metadata and the data is perfectly readable without
supporting it. Even if it is very likely that we will want to move to a
better hash algorithm later, we already know that we won't have to keep
supporting the current one forever, since removing support is not a
breaking change (at least functionally, performance-wise it will result in
a regression for old files).

Br,

Zoltan

On Fri, Jun 28, 2019 at 11:12 AM 俊杰陈 <cj...@gmail.com> wrote:

> Thanks,
>
> The naming issue had been fixed, I also created a PR
> <https://github.com/apache/parquet-format/pull/139>to add xxHash as an
> alternative option for Todd's concern. Is that OK for concerns? If that is
> OK, we can create a VOTE against the spec  (the bloom filter diff in
> parquet-format repo).
>
> On Fri, Jun 28, 2019 at 4:03 PM Driesprong, Fokko <fo...@driesprong.frl>
> wrote:
>
> > Ryan has a valid point here. Once the Bloom filters get released, it
> won't
> > be as easy anymore to change it because we will break an already released
> > API.
> >
> > There was a related discussion a while ago:
> >
> >
> https://lists.apache.org/thread.html/027e9d73093df84448e07d8514b9d669906cd5b83ae59a76f38aaa55@%3Cdev.parquet.apache.org%3E
> >
> > My suggestion would be to create a VOTE to formally adopt the vote and
> fix
> > the remaining concerns. For example, the one that Zoltan raised in the
> list
> > above.
> >
> > Cheers, Fokko
> >
> > Op vr 28 jun. 2019 om 01:13 schreef Jim Apple <jb...@apache.org>:
> >
> > > > I think we need to have a vote on the bloom filter
> > > > structures first. We need to make sure that the community has vetted
> > the
> > > > design and is comfortable with adding this, just like we did with the
> > > > Parquet encryption design and the page index design.
> > >
> > > Thank you for the note, Ryan. Based on my experience on Apache Impala,
> I
> > > was under the impression that a git commit signified at least a
> temporary
> > > agreement that the commit should make it into a future release. I
> > > understand you to be saying that in parquet-format, a vote on format
> > > additions is standard, whether or not a commit made it into HEAD.
> > >
> > > There have been previous discussions of Bloom filters in the pull
> > > requests, on this list, and in live videochat meetups (from quite a
> while
> > > ago). In your opinion, should we start a new discussion, or start a
> > [VOTE]
> > > thread with pointers to the old discussions, or some third option?
> > >
> >
>
>
> --
> Thanks & Best Regards
>

Re: [DISCUSS] Prepare release for parquet-format 2.7.0?

Posted by 俊杰陈 <cj...@gmail.com>.

Thanks,

The naming issue had been fixed, I also created a PR
<https://github.com/apache/parquet-format/pull/139>to add xxHash as an
alternative option for Todd's concern. Is that OK for concerns? If that is
OK, we can create a VOTE against the spec  (the bloom filter diff in
parquet-format repo).

On Fri, Jun 28, 2019 at 4:03 PM Driesprong, Fokko <fo...@driesprong.frl>
wrote:

> Ryan has a valid point here. Once the Bloom filters get released, it won't
> be as easy anymore to change it because we will break an already released
> API.
>
> There was a related discussion a while ago:
>
> https://lists.apache.org/thread.html/027e9d73093df84448e07d8514b9d669906cd5b83ae59a76f38aaa55@%3Cdev.parquet.apache.org%3E
>
> My suggestion would be to create a VOTE to formally adopt the vote and fix
> the remaining concerns. For example, the one that Zoltan raised in the list
> above.
>
> Cheers, Fokko
>
> Op vr 28 jun. 2019 om 01:13 schreef Jim Apple <jb...@apache.org>:
>
> > > I think we need to have a vote on the bloom filter
> > > structures first. We need to make sure that the community has vetted
> the
> > > design and is comfortable with adding this, just like we did with the
> > > Parquet encryption design and the page index design.
> >
> > Thank you for the note, Ryan. Based on my experience on Apache Impala, I
> > was under the impression that a git commit signified at least a temporary
> > agreement that the commit should make it into a future release. I
> > understand you to be saying that in parquet-format, a vote on format
> > additions is standard, whether or not a commit made it into HEAD.
> >
> > There have been previous discussions of Bloom filters in the pull
> > requests, on this list, and in live videochat meetups (from quite a while
> > ago). In your opinion, should we start a new discussion, or start a
> [VOTE]
> > thread with pointers to the old discussions, or some third option?
> >
>


-- 
Thanks & Best Regards

Re: [DISCUSS] Prepare release for parquet-format 2.7.0?

Posted by "Driesprong, Fokko" <fo...@driesprong.frl>.

Ryan has a valid point here. Once the Bloom filters get released, it won't
be as easy anymore to change it because we will break an already released
API.

There was a related discussion a while ago:
https://lists.apache.org/thread.html/027e9d73093df84448e07d8514b9d669906cd5b83ae59a76f38aaa55@%3Cdev.parquet.apache.org%3E

My suggestion would be to create a VOTE to formally adopt the vote and fix
the remaining concerns. For example, the one that Zoltan raised in the list
above.

Cheers, Fokko

Op vr 28 jun. 2019 om 01:13 schreef Jim Apple <jb...@apache.org>:

> > I think we need to have a vote on the bloom filter
> > structures first. We need to make sure that the community has vetted the
> > design and is comfortable with adding this, just like we did with the
> > Parquet encryption design and the page index design.
>
> Thank you for the note, Ryan. Based on my experience on Apache Impala, I
> was under the impression that a git commit signified at least a temporary
> agreement that the commit should make it into a future release. I
> understand you to be saying that in parquet-format, a vote on format
> additions is standard, whether or not a commit made it into HEAD.
>
> There have been previous discussions of Bloom filters in the pull
> requests, on this list, and in live videochat meetups (from quite a while
> ago). In your opinion, should we start a new discussion, or start a [VOTE]
> thread with pointers to the old discussions, or some third option?
>

Re: [DISCUSS] Prepare release for parquet-format 2.7.0?

Posted by Jim Apple <jb...@apache.org>.

> I think we need to have a vote on the bloom filter
> structures first. We need to make sure that the community has vetted the
> design and is comfortable with adding this, just like we did with the
> Parquet encryption design and the page index design.

Thank you for the note, Ryan. Based on my experience on Apache Impala, I was under the impression that a git commit signified at least a temporary agreement that the commit should make it into a future release. I understand you to be saying that in parquet-format, a vote on format additions is standard, whether or not a commit made it into HEAD.

There have been previous discussions of Bloom filters in the pull requests, on this list, and in live videochat meetups (from quite a while ago). In your opinion, should we start a new discussion, or start a [VOTE] thread with pointers to the old discussions, or some third option?

Re: [DISCUSS] Prepare release for parquet-format 2.7.0?

Posted by Ryan Blue <rb...@netflix.com.INVALID>.

If the goal of a format release is to get the bloom filter structures into
a release, then I think we need to have a vote on the bloom filter
structures first. We need to make sure that the community has vetted the
design and is comfortable with adding this, just like we did with the
Parquet encryption design and the page index design.

On Thu, Jun 27, 2019 at 9:52 AM Driesprong, Fokko <fo...@driesprong.frl>
wrote:

> If there are no other volunteers, I can cut the branch and prepare RC1
> tomorrow morning.
>
> Cheers, Fokko
>
>
> Op do 27 jun. 2019 om 17:38 schreef Jim Apple <jb...@apache.org>:
>
> > > Looks like we don't have any blocking issue since there is no update in
> > the
> > > Jira(https://jira.apache.org/jira/browse/PARQUET-1608) about one week.
> > Can
> > > we start a release vote?
> >
> > Even _starting_ a release vote appears to require a committer to do some
> > prep work first.
> >
> > https://parquet.apache.org/documentation/how-to-release/
> >
> > Any committer volunteers?
> >
>

-- 
Ryan Blue
Software Engineer
Netflix

Re: [DISCUSS] Prepare release for parquet-format 2.7.0?

Posted by "Driesprong, Fokko" <fo...@driesprong.frl>.

If there are no other volunteers, I can cut the branch and prepare RC1
tomorrow morning.

Cheers, Fokko


Op do 27 jun. 2019 om 17:38 schreef Jim Apple <jb...@apache.org>:

> > Looks like we don't have any blocking issue since there is no update in
> the
> > Jira(https://jira.apache.org/jira/browse/PARQUET-1608) about one week.
> Can
> > we start a release vote?
>
> Even _starting_ a release vote appears to require a committer to do some
> prep work first.
>
> https://parquet.apache.org/documentation/how-to-release/
>
> Any committer volunteers?
>

Re: [DISCUSS] Prepare release for parquet-format 2.7.0?

Posted by Jim Apple <jb...@apache.org>.

> Looks like we don't have any blocking issue since there is no update in the
> Jira(https://jira.apache.org/jira/browse/PARQUET-1608) about one week. Can
> we start a release vote?

Even _starting_ a release vote appears to require a committer to do some prep work first.

https://parquet.apache.org/documentation/how-to-release/

Any committer volunteers?

Re: [DISCUSS] Prepare release for parquet-format 2.7.0?

Posted by 俊杰陈 <cj...@gmail.com>.

Hi

Looks like we don't have any blocking issue since there is no update in the
Jira(https://jira.apache.org/jira/browse/PARQUET-1608) about one week. Can
we start a release vote?

On Tue, Jun 25, 2019 at 6:04 AM Jim Apple <jb...@apache.org> wrote:

> > Actually there is a repo at https://github.com/apache/parquet-testing
> that
> > may be used for making sure that the Java, C++ and other implementations
> > are interoperable.
>
> Ah, yes, and it looks like a Bloom filter data file is present:
>
>
> https://github.com/apache/parquet-testing/commit/48a657ca05eb308539f3f00c698e8bb5185d9b38
>
> Thanks for the reminder!
>
> > But in the context of a parquet-format release I don't
> > think we need tests for the interoperability of implementations, because
> > parquet-format is only about the specification which is independent of
> > language binding.
>
> Agreed.
>
> It is nice to have the implementations to help provide evidence that the
> specification is not self-contradictory or impossible to express in some
> way.
>


-- 
Thanks & Best Regards

Re: [DISCUSS] Prepare release for parquet-format 2.7.0?

Posted by Jim Apple <jb...@apache.org>.

> Actually there is a repo at https://github.com/apache/parquet-testing that
> may be used for making sure that the Java, C++ and other implementations
> are interoperable.

Ah, yes, and it looks like a Bloom filter data file is present:

https://github.com/apache/parquet-testing/commit/48a657ca05eb308539f3f00c698e8bb5185d9b38

Thanks for the reminder!

> But in the context of a parquet-format release I don't
> think we need tests for the interoperability of implementations, because
> parquet-format is only about the specification which is independent of
> language binding.

Agreed.

It is nice to have the implementations to help provide evidence that the specification is not self-contradictory or impossible to express in some way.

Re: [DISCUSS] Prepare release for parquet-format 2.7.0?

Posted by Zoltan Ivanfi <zi...@cloudera.com.INVALID>.

Hi,

Actually there is a repo at https://github.com/apache/parquet-testing that
may be used for making sure that the Java, C++ and other implementations
are interoperable. But in the context of a parquet-format release I don't
think we need tests for the interoperability of implementations, because
parquet-format is only about the specification which is independent of
language binding.

Br,

Zoltan

On Thu, Jun 20, 2019 at 5:49 PM Jim Apple <jb...@apache.org> wrote:

> > Regarding your question, I don't have an opinion on 1, but I think 2 is
> > very important. In the end, the parquet format is nothing more than a
> > couple of Thrift definitions. I would suggest writing good unit tests to
> > ensure that the bloom filters behave in the same manner.
>
> I agree, it is important. The Bloom filter tests can be run cross-language
> now, but the running is manual. IIRC we were stuck on not having a repo for
> the binaries to live in. I think such a repo would be valuable for the
> format in general, not just for the Bloom filters, and the magnitude of
> that is part of what makes me suggest a release even though that repo (and
> accompanying tests) doesn't exist yet.
>
> > The release process can be done by a committer, but also requires the
> > involvement of at least 3 PMC's [1] to get the binding votes to get the
> > release signed off.
>
> Yes, indeed. A committer must be involved even before the "[VOTE]" thread.
>

Re: [DISCUSS] Prepare release for parquet-format 2.7.0?

Posted by Jim Apple <jb...@apache.org>.

> Regarding your question, I don't have an opinion on 1, but I think 2 is
> very important. In the end, the parquet format is nothing more than a
> couple of Thrift definitions. I would suggest writing good unit tests to
> ensure that the bloom filters behave in the same manner.

I agree, it is important. The Bloom filter tests can be run cross-language now, but the running is manual. IIRC we were stuck on not having a repo for the binaries to live in. I think such a repo would be valuable for the format in general, not just for the Bloom filters, and the magnitude of that is part of what makes me suggest a release even though that repo (and accompanying tests) doesn't exist yet.

> The release process can be done by a committer, but also requires the
> involvement of at least 3 PMC's [1] to get the binding votes to get the
> release signed off.

Yes, indeed. A committer must be involved even before the "[VOTE]" thread.

Re: [DISCUSS] Prepare release for parquet-format 2.7.0?

Posted by "Driesprong, Fokko" <fo...@driesprong.frl>.

Good point Jim,

Personally, I'm also looking forward to the next release of Apache Parquet
format.

I took the liberty of creating an umbrella ticket to get an overview of the
blockers that we want to get in the 2.7 release. The ticket:
https://jira.apache.org/jira/browse/PARQUET-1608

Regarding your question, I don't have an opinion on 1, but I think 2 is
very important. In the end, the parquet format is nothing more than a
couple of Thrift definitions. I would suggest writing good unit tests to
ensure that the bloom filters behave in the same manner.

The release process can be done by a committer, but also requires the
involvement of at least 3 PMC's [1] to get the binding votes to get the
release signed off.

Cheers, Fokko

[1] https://www.apache.org/foundation/voting.html

Op do 20 jun. 2019 om 02:00 schreef Jim Apple <jb...@apache.org>:

> This is a thread for discussing a release of parquet-format. The last
> release appears to be 2.6.0 from September 2018:
>
> https://github.com/apache/parquet-format/releases
>
> The diff from then until now is
>
>
> https://github.com/apache/parquet-format/compare/df6132b94f273521a418a74442085fdd5a0aa009...4157b4c6132086e318943f1898523f7dcb013f35
>
> It's my understanding we'll need a parquet-format release before a
> parquet-mr and/or parquet-cpp release[0].
>
> In the most recent discussion thread on this, there were two concerns
> raised that are not yet addressed:
>
> 1. Should Bloom filters use a different hash function by default?
> 2. Should we devise an automated way to test cross-language compatibility
> of parquet files, especially for the new Bloom filter spec?
>
> I am suggesting a release despite these open issues based on my belief
> that it's reasonable to handle these after a parquet-format release.
>
> A final note: although I am suggesting the release, it looks to me like
> the release recipe[1] can only be executed by a committer, which I am not.
> This means even if there is consensus on a release, someone else would need
> to do the legwork.
>
> Thanks,
> Jim
>
> [0] This code now lives in
> https://github.com/apache/arrow/tree/master/cpp/src/parquet, I believe.
>
> [1] https://parquet.apache.org/documentation/how-to-release/
>