You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@parquet.apache.org by Claire McGinty <cl...@gmail.com> on 2023/09/14 19:04:15 UTC

Parquet dictionary size limits?

Hi dev@,

I'm running some benchmarking on Parquet read/write performance and have a
few questions about how dictionary encoding works under the hood. Let me
know if there's a better channel for this :)

My test case uses parquet-avro, where I'm writing a single file containing
5 million records. Each record has a single column, an Avro String field
(Parquet binary field). I ran two configurations of base setup: in the
first case, the string field has 5,000 possible unique values. In the
second case, it has 50,000 unique values.

In the first case (5k unique values), I used parquet-tools to inspect the
file metadata and found that a dictionary had been written:

% parquet-tools meta testdata-case1.parquet
> file schema:  testdata.TestRecord
>
> --------------------------------------------------------------------------------
> stringField:  REQUIRED BINARY L:STRING R:0 D:0
> row group 1:  RC:5000001 TS:18262874 OFFSET:4
>
> --------------------------------------------------------------------------------
> stringField:   BINARY UNCOMPRESSED DO:4 FPO:38918 SZ:8181452/8181452/1.00
> VC:5000001 ENC:BIT_PACKED,PLAIN_DICTIONARY ST:[min: 0, max: 999, num_nulls:
> 0]


But in the second case (50k unique values), parquet-tools shows that no
dictionary gets created, and the file size is *much* bigger:

% parquet-tools meta testdata-case2.parquet
> file schema:  testdata.TestRecord
>
> --------------------------------------------------------------------------------
> stringField:  REQUIRED BINARY L:STRING R:0 D:0
> row group 1:  RC:5000001 TS:18262874 OFFSET:4
>
> --------------------------------------------------------------------------------
> stringField:  BINARY UNCOMPRESSED DO:0 FPO:4 SZ:43896278/43896278/1.00
> VC:5000001 ENC:PLAIN,BIT_PACKED ST:[min: 0, max: 9999, num_nulls: 0]


(I created a gist of my test reproduction here
<https://gist.github.com/clairemcginty/c3c0be85f51bc23db45a75e8d8a18806>.)

Based on this, I'm guessing there's some tip-over point after which Parquet
will give up on writing a dictionary for a given column? After reading
the Configuration
docs
<https://github.com/apache/parquet-mr/blob/master/parquet-hadoop/README.md>,
I tried increasing the dictionary page size configuration 5x, with the same
result (no dictionary created).

So to summarize, my questions are:

- What's the heuristic for Parquet dictionary writing to succeed for a
given column?
- Is that heuristic configurable at all?
- For high-cardinality datasets, has the idea of a frequency-based
dictionary encoding been explored? Say, if the data follows a certain
statistical distribution, we can create a dictionary of the most frequent
values only?

Thanks for your time!
- Claire

Re: Parquet dictionary size limits?

Posted by Micah Kornfield <em...@gmail.com>.
I agree thanks for looking into this.

Overall, I think the equalizing behavior of file-level compression (ZSTD)
> makes it not worth it to add a configuration option for dictionary
> compression :)

One reason to potentially still move forward with configuration here is
that general purpose compression can be significantly slower then
dictionary compression.  If you've already chosen to compress then I agree
a new knob might not be worth it.  At least some recent research [1] point
to this being a bottleneck.

[1] https://www.cs.cit.tum.de/fileadmin/w00cfj/dis/papers/btrblocks.pdf

On Wed, Sep 27, 2023 at 9:58 PM Gang Wu <us...@gmail.com> wrote:

> Thanks for the thorough benchmark. These findings are pretty interesting!
>
> On Thu, Sep 28, 2023 at 5:32 AM Claire McGinty <claire.d.mcginty@gmail.com
> >
> wrote:
>
> > Hi all,
> >
> > Just to follow up, I ran some benchmarks with an added Configuration
> option
> > to set a desired "compression ratio" param, which you can see here
> > <
> >
> https://github.com/clairemcginty/parquet-benchmarks/blob/main/write_results.md#max-dictionary-compression-ratio-option
> > >,
> > on a variety of data layouts (distribution, sorting, cardinality etc). I
> > also have a table of comparisons
> > <
> >
> https://github.com/clairemcginty/parquet-benchmarks/blob/main/write_results.md#overall-comparison
> > >
> > using
> > the latest 0.14.0-SNAPSHOT as a baseline. These are my takeaways:
> >
> >    - The compression ratio param doesn't benefit data that's in sorted
> >    order (IMO, because even without the addition of this param, sorted
> > columns
> >    are more likely to produce efficient dict encodings).
> >    - On shuffled (non-sorted) data, setting the compression ratio param
> >    produces a much better *uncompressed* file result (in one case, 39MB
> to
> >    the baseline 102MB). However, after applying a file-level compression
> >    algorithm such as ZSTD, the baseline and ratio-param results turn out
> > about
> >    pretty much equal (within 5% margin). I think this makes sense, since
> in
> >    Parquet 1.0 dictionaries are not encoded, so ZSTD must be better at
> >    compressing many repeated column values across a file than it is at
> >    compressing dictionaries across all pages.
> >    - The compression ratio param works best with larger page sizes (10 mb
> >    or 50mb) with large-ish dictionary page sizes (10mb).
> >
> > Overall, I think the equalizing behavior of file-level compression (ZSTD)
> > makes it not worth it to add a configuration option for dictionary
> > compression :) Thanks for all of your input on this -- if nothing else,
> the
> > benchmarks are a really interesting look at how important data layout is
> to
> > overall file size!
> >
> > Best,
> > Claire
> >
> > On Thu, Sep 21, 2023 at 2:12 PM Claire McGinty <
> claire.d.mcginty@gmail.com
> > >
> > wrote:
> >
> > > Hmm, like a flag to basically turn off the isCompressionSatisfying
> check
> > > per-column? That might be simplest!
> > >
> > > So to summarize, a column will not write a dictionary encoding when
> > either:
> > >
> > > (1) `parquet.enable.dictionary` is set to False
> > > (2) # of distinct values in a column chunk exceeds
> > > `DictionaryValuesWriter#MAX_DICTIONARY_VALUES` (currently set to
> > > Integer.MAX_VALUE)
> > > (3) Total encoded bytes in a dictionary exceed the value of
> > > `parquet.dictionary.page.size`
> > > (4) Desired compression ratio (as a measure of # distinct values :
> total
> > #
> > > values) is not achieved
> > >
> > > I might try out various options for making (4) configurable, starting
> > with
> > > your suggestion, and testing them out on more realistic data
> > distributions.
> > > Will try to return to this thread with my results in a few days :)
> > >
> > > Best,
> > > Claire
> > >
> > >
> > > On Thu, Sep 21, 2023 at 8:48 AM Gang Wu <us...@gmail.com> wrote:
> > >
> > >> The current implementation only checks the first page, which is
> > >> vulnerable in many cases. I think your suggestion makes sense.
> > >> However, there is no one-fit-for-all solution. How about simply
> > >> adding a flag to enforce dictionary encoding to a specific column?
> > >>
> > >>
> > >> On Thu, Sep 21, 2023 at 1:08 AM Claire McGinty <
> > >> claire.d.mcginty@gmail.com>
> > >> wrote:
> > >>
> > >> > I think I figured it out! The dictionaryByteSize == 0 was a red
> > >> herring; I
> > >> > was looking at an IntegerDictionaryValuesWriter for an empty column
> > >> rather
> > >> > than my high-cardinality column. Your analysis of the situation was
> > >> > right--it was just that in the first page, there weren't enough
> > distinct
> > >> > values to pass the check.
> > >> >
> > >> > I wonder if we could maybe make this value configurable per-column?
> > >> Either:
> > >> >
> > >> > - A desired ratio of distinct values / total values, on a scale of
> > 0-1.0
> > >> > - Number of pages to check for compression before falling back
> > >> >
> > >> > Let me know what you think!
> > >> >
> > >> > Best,
> > >> > Claire
> > >> >
> > >> > On Wed, Sep 20, 2023 at 9:37 AM Gang Wu <us...@gmail.com> wrote:
> > >> >
> > >> > > I don't understand why you get encodedSize == 1,
> dictionaryByteSize
> > >> == 0
> > >> > > and rawSize == 0 in the first page check. It seems that the page
> > does
> > >> not
> > >> > > have any meaning values. Could you please check how many values
> are
> > >> > > written before the page check?
> > >> > >
> > >> > > On Thu, Sep 21, 2023 at 12:12 AM Claire McGinty <
> > >> > > claire.d.mcginty@gmail.com>
> > >> > > wrote:
> > >> > >
> > >> > > > Hey Gang,
> > >> > > >
> > >> > > > Thanks for the followup! I see what you're saying where it's
> > >> sometimes
> > >> > > just
> > >> > > > bad luck with what ends up in the first page. The intuition
> seems
> > >> like
> > >> > a
> > >> > > > larger page size should produce a better encoding in this
> case...
> > I
> > >> > > updated
> > >> > > > my branch
> > >> > > > <
> > >> > > >
> > >> > >
> > >> >
> > >>
> >
> https://github.com/apache/parquet-mr/compare/master...clairemcginty:parquet-mr:dict-size-repro?expand=1
> > >> > > > >
> > >> > > > to
> > >> > > > add a test with a page size/dict page size of 10MB and am seeing
> > the
> > >> > same
> > >> > > > failure, though.
> > >> > > >
> > >> > > > Something seems kind of odd actually -- when I stepped through
> the
> > >> > test I
> > >> > > > added w/ debugger, it falls back after invoking
> > >> isCompressionSatisfying
> > >> > > > with encodedSize == 1, dictionaryByteSize == 0 and rawSize ==
> 0; 1
> > >> + 0
> > >> > <
> > >> > > 1
> > >> > > > returns true. (You can also see this in the System.out logs I
> > >> added, in
> > >> > > the
> > >> > > > branch's GHA run logs). This doesn't seem right to me -- does
> > >> > > > isCompressionSatsifying need an extra check to make sure the
> > >> > > > dictionary isn't empty?
> > >> > > >
> > >> > > > Also, thanks, Aaron! I got into this while running some
> > >> > micro-benchmarks
> > >> > > on
> > >> > > > Parquet reads when various dictionary/bloom filter/encoding
> > options
> > >> are
> > >> > > > configured. Happy to share out when I'm done.
> > >> > > >
> > >> > > > Best,
> > >> > > > Claire
> > >> > > >
> > >> > > > On Tue, Sep 19, 2023 at 9:06 PM Gang Wu <us...@gmail.com>
> wrote:
> > >> > > >
> > >> > > > > Thanks for the investigation!
> > >> > > > >
> > >> > > > > I think the check below makes sense for a single page:
> > >> > > > >   @Override
> > >> > > > >   public boolean isCompressionSatisfying(long rawSize, long
> > >> > > encodedSize)
> > >> > > > {
> > >> > > > >     return (encodedSize + dictionaryByteSize) < rawSize;
> > >> > > > >   }
> > >> > > > >
> > >> > > > > The problem is that the fallback check is only performed on
> the
> > >> first
> > >> > > > page.
> > >> > > > > In the first page, all values in that page may be distinct, so
> > it
> > >> > will
> > >> > > > > unlikely
> > >> > > > > pass the isCompressionSatisfying check.
> > >> > > > >
> > >> > > > > Best,
> > >> > > > > Gang
> > >> > > > >
> > >> > > > >
> > >> > > > > On Wed, Sep 20, 2023 at 5:04 AM Aaron Niskode-Dossett
> > >> > > > > <an...@etsy.com.invalid> wrote:
> > >> > > > >
> > >> > > > > > Claire, thank you for your research and examples on this
> > topic,
> > >> > I've
> > >> > > > > > learned a lot.  My hunch is that your change would be a good
> > >> one,
> > >> > but
> > >> > > > I'm
> > >> > > > > > not an expert (and more to the point, not a committer).  I'm
> > >> > looking
> > >> > > > > > forward to learning more as this discussion continues.
> > >> > > > > >
> > >> > > > > > Thank you again, Aaron
> > >> > > > > >
> > >> > > > > > On Tue, Sep 19, 2023 at 2:48 PM Claire McGinty <
> > >> > > > > claire.d.mcginty@gmail.com
> > >> > > > > > >
> > >> > > > > > wrote:
> > >> > > > > >
> > >> > > > > > > I created a quick branch
> > >> > > > > > > <
> > >> > > > > > >
> > >> > > > > >
> > >> > > > >
> > >> > > >
> > >> > >
> > >> >
> > >>
> >
> https://github.com/apache/parquet-mr/compare/master...clairemcginty:parquet-mr:dict-size-repro?expand=1
> > >> > > > > > > >
> > >> > > > > > > to reproduce what I'm seeing -- the test shows that an Int
> > >> column
> > >> > > > with
> > >> > > > > > > cardinality 100 successfully results in a dict encoding,
> but
> > >> an
> > >> > int
> > >> > > > > > column
> > >> > > > > > > with cardinality 10,000 falls back and doesn't create a
> dict
> > >> > > > encoding.
> > >> > > > > > This
> > >> > > > > > > seems like a low threshold given the 1MB dictionary page
> > size,
> > >> > so I
> > >> > > > > just
> > >> > > > > > > wanted to check whether this is expected or not :)
> > >> > > > > > >
> > >> > > > > > > Best,
> > >> > > > > > > Claire
> > >> > > > > > >
> > >> > > > > > > On Tue, Sep 19, 2023 at 9:35 AM Claire McGinty <
> > >> > > > > > claire.d.mcginty@gmail.com
> > >> > > > > > > >
> > >> > > > > > > wrote:
> > >> > > > > > >
> > >> > > > > > > > Hi, just wanted to follow up on this!
> > >> > > > > > > >
> > >> > > > > > > > I ran a debugger to find out why my column wasn't ending
> > up
> > >> > with
> > >> > > a
> > >> > > > > > > > dictionary encoding and it turns out that even though
> > >> > > > > > > > DictionaryValuesWriter#shouldFallback()
> > >> > > > > > > > <
> > >> > > > > > >
> > >> > > > > >
> > >> > > > >
> > >> > > >
> > >> > >
> > >> >
> > >>
> >
> https://github.com/apache/parquet-mr/blob/master/parquet-column/src/main/java/org/apache/parquet/column/values/dictionary/DictionaryValuesWriter.java#L117
> > >> > > > > > > >
> > >> > > > > > > > always returned false (dictionaryByteSize was always
> less
> > >> than
> > >> > my
> > >> > > > > > > > configured page size),
> > >> > > > DictionaryValuesWriter#isCompressionSatisfying
> > >> > > > > > > > <
> > >> > > > > > >
> > >> > > > > >
> > >> > > > >
> > >> > > >
> > >> > >
> > >> >
> > >>
> >
> https://github.com/apache/parquet-mr/blob/master/parquet-column/src/main/java/org/apache/parquet/column/values/dictionary/DictionaryValuesWriter.java#L125
> > >> > > > > > >
> > >> > > > > > > was
> > >> > > > > > > > what was causing Parquet to switch
> > >> > > > > > > > <
> > >> > > > > > >
> > >> > > > > >
> > >> > > > >
> > >> > > >
> > >> > >
> > >> >
> > >>
> >
> https://github.com/apache/parquet-mr/blob/master/parquet-column/src/main/java/org/apache/parquet/column/values/fallback/FallbackValuesWriter.java#L75
> > >> > > > > > > >
> > >> > > > > > > > back to the fallback, non-dict writer.
> > >> > > > > > > >
> > >> > > > > > > > From what I can tell, this check compares the total byte
> > >> size
> > >> > of
> > >> > > > > > > > *every* element with the byte size of each *distinct*
> > >> element
> > >> > as
> > >> > > a
> > >> > > > > kind
> > >> > > > > > > of
> > >> > > > > > > > proxy for encoding efficiency.... however, it seems
> > strange
> > >> > that
> > >> > > > this
> > >> > > > > > > check
> > >> > > > > > > > can cause the writer to fall back even if the total
> > encoded
> > >> > dict
> > >> > > > size
> > >> > > > > > is
> > >> > > > > > > > far below the configured dictionary page size. Out of
> > >> > curiosity,
> > >> > > I
> > >> > > > > > > modified
> > >> > > > > > > > DictionaryValuesWriter#isCompressionSatisfying
> > >> > > > > > > > <
> > >> > > > > > >
> > >> > > > > >
> > >> > > > >
> > >> > > >
> > >> > >
> > >> >
> > >>
> >
> https://github.com/apache/parquet-mr/blob/master/parquet-column/src/main/java/org/apache/parquet/column/values/dictionary/DictionaryValuesWriter.java#L125
> > >> > > > > > >
> > >> > > > > > > to
> > >> > > > > > > > also check whether total byte size was less than
> > dictionary
> > >> max
> > >> > > > size
> > >> > > > > > and
> > >> > > > > > > > re-ran my Parquet write with a local snapshot, and my
> file
> > >> size
> > >> > > > > dropped
> > >> > > > > > > 50%.
> > >> > > > > > > >
> > >> > > > > > > > Best,
> > >> > > > > > > > Claire
> > >> > > > > > > >
> > >> > > > > > > > On Mon, Sep 18, 2023 at 9:16 AM Claire McGinty <
> > >> > > > > > > claire.d.mcginty@gmail.com>
> > >> > > > > > > > wrote:
> > >> > > > > > > >
> > >> > > > > > > >> Oh, interesting! I'm setting it via the
> > >> > > > > > > >> ParquetWriter#withDictionaryPageSize method, and I do
> see
> > >> the
> > >> > > > > overall
> > >> > > > > > > file
> > >> > > > > > > >> size increasing when I bump the value. I'll look into
> it
> > a
> > >> bit
> > >> > > > more
> > >> > > > > --
> > >> > > > > > > it
> > >> > > > > > > >> would be helpful for some cases where the # unique
> values
> > >> in a
> > >> > > > > column
> > >> > > > > > is
> > >> > > > > > > >> just over the size limit.
> > >> > > > > > > >>
> > >> > > > > > > >> - Claire
> > >> > > > > > > >>
> > >> > > > > > > >> On Fri, Sep 15, 2023 at 9:54 AM Micah Kornfield <
> > >> > > > > > emkornfield@gmail.com>
> > >> > > > > > > >> wrote:
> > >> > > > > > > >>
> > >> > > > > > > >>> I'll note there is also a check for encoding
> > effectiveness
> > >> > [1]
> > >> > > > that
> > >> > > > > > > could
> > >> > > > > > > >>> come into play but I'd guess that isn't the case here.
> > >> > > > > > > >>>
> > >> > > > > > > >>> [1]
> > >> > > > > > > >>>
> > >> > > > > > > >>>
> > >> > > > > > >
> > >> > > > > >
> > >> > > > >
> > >> > > >
> > >> > >
> > >> >
> > >>
> >
> https://github.com/apache/parquet-mr/blob/9b5a962df3007009a227ef421600197531f970a5/parquet-column/src/main/java/org/apache/parquet/column/values/dictionary/DictionaryValuesWriter.java#L124
> > >> > > > > > > >>>
> > >> > > > > > > >>> On Fri, Sep 15, 2023 at 9:51 AM Micah Kornfield <
> > >> > > > > > emkornfield@gmail.com
> > >> > > > > > > >
> > >> > > > > > > >>> wrote:
> > >> > > > > > > >>>
> > >> > > > > > > >>> > I'm glad I was looking at the right setting for
> > >> dictionary
> > >> > > > size.
> > >> > > > > I
> > >> > > > > > > just
> > >> > > > > > > >>> >> tried it out with 10x, 50x, and even total file
> size,
> > >> > > though,
> > >> > > > > and
> > >> > > > > > > >>> still am
> > >> > > > > > > >>> >> not seeing a dictionary get created. Is it possible
> > >> it's
> > >> > > > bounded
> > >> > > > > > by
> > >> > > > > > > >>> file
> > >> > > > > > > >>> >> page size or some other layout option that I need
> to
> > >> bump
> > >> > as
> > >> > > > > well?
> > >> > > > > > > >>> >
> > >> > > > > > > >>> >
> > >> > > > > > > >>> > Sorry I'm less familiar with parquet-mr, hopefully
> > >> someone
> > >> > > else
> > >> > > > > to
> > >> > > > > > > >>> chime
> > >> > > > > > > >>> > in.  If I had to guess, maybe somehow the config
> value
> > >> > isn't
> > >> > > > > making
> > >> > > > > > > it
> > >> > > > > > > >>> to
> > >> > > > > > > >>> > the writer (but there could also be something else
> at
> > >> > play).
> > >> > > > > > > >>> >
> > >> > > > > > > >>> > On Fri, Sep 15, 2023 at 9:33 AM Claire McGinty <
> > >> > > > > > > >>> claire.d.mcginty@gmail.com>
> > >> > > > > > > >>> > wrote:
> > >> > > > > > > >>> >
> > >> > > > > > > >>> >> Thanks so much, Micah!
> > >> > > > > > > >>> >>
> > >> > > > > > > >>> >> I think you are using the right setting, but maybe
> it
> > >> is
> > >> > > > > possible
> > >> > > > > > > the
> > >> > > > > > > >>> >> > strings are still exceeding the threshold
> (perhaps
> > >> > > > increasing
> > >> > > > > it
> > >> > > > > > > by
> > >> > > > > > > >>> 50x
> > >> > > > > > > >>> >> or
> > >> > > > > > > >>> >> > more to verify)
> > >> > > > > > > >>> >>
> > >> > > > > > > >>> >>
> > >> > > > > > > >>> >> I'm glad I was looking at the right setting for
> > >> dictionary
> > >> > > > > size. I
> > >> > > > > > > >>> just
> > >> > > > > > > >>> >> tried it out with 10x, 50x, and even total file
> size,
> > >> > > though,
> > >> > > > > and
> > >> > > > > > > >>> still am
> > >> > > > > > > >>> >> not seeing a dictionary get created. Is it possible
> > >> it's
> > >> > > > bounded
> > >> > > > > > by
> > >> > > > > > > >>> file
> > >> > > > > > > >>> >> page size or some other layout option that I need
> to
> > >> bump
> > >> > as
> > >> > > > > well?
> > >> > > > > > > >>> >>
> > >> > > > > > > >>> >> I haven't seen my discussion during my time in the
> > >> > community
> > >> > > > but
> > >> > > > > > > >>> maybe it
> > >> > > > > > > >>> >> > was discussed in the past.  I think the main
> > >> challenge
> > >> > > here
> > >> > > > is
> > >> > > > > > > that
> > >> > > > > > > >>> >> pages
> > >> > > > > > > >>> >> > are either dictionary encoded or not.  I'd guess
> to
> > >> make
> > >> > > > this
> > >> > > > > > > >>> practical
> > >> > > > > > > >>> >> > there would need to be a new hybrid page type,
> > which
> > >> I
> > >> > > think
> > >> > > > > it
> > >> > > > > > > >>> might
> > >> > > > > > > >>> >> be an
> > >> > > > > > > >>> >> > interesting idea but quite a bit of work.
> > >> Additionally,
> > >> > > one
> > >> > > > > > would
> > >> > > > > > > >>> >> likely
> > >> > > > > > > >>> >> > need heuristics for when to potentially use the
> new
> > >> mode
> > >> > > > > versus
> > >> > > > > > a
> > >> > > > > > > >>> >> complete
> > >> > > > > > > >>> >> > fallback.
> > >> > > > > > > >>> >> >
> > >> > > > > > > >>> >>
> > >> > > > > > > >>> >> Got it, thanks for the explanation! It does seem
> > like a
> > >> > huge
> > >> > > > > > amount
> > >> > > > > > > of
> > >> > > > > > > >>> >> work
> > >> > > > > > > >>> >>
> > >> > > > > > > >>> >>
> > >> > > > > > > >>> >> Best,
> > >> > > > > > > >>> >> Claire
> > >> > > > > > > >>> >>
> > >> > > > > > > >>> >>
> > >> > > > > > > >>> >>
> > >> > > > > > > >>> >> On Thu, Sep 14, 2023 at 5:16 PM Micah Kornfield <
> > >> > > > > > > >>> emkornfield@gmail.com>
> > >> > > > > > > >>> >> wrote:
> > >> > > > > > > >>> >>
> > >> > > > > > > >>> >> > >
> > >> > > > > > > >>> >> > > - What's the heuristic for Parquet dictionary
> > >> writing
> > >> > to
> > >> > > > > > succeed
> > >> > > > > > > >>> for a
> > >> > > > > > > >>> >> > > given column?
> > >> > > > > > > >>> >> >
> > >> > > > > > > >>> >> >
> > >> > > > > > > >>> >> >
> > >> > > > > > > >>> >> >
> > >> > > > > > > >>> >>
> > >> > > > > > > >>>
> > >> > > > > > >
> > >> > > > > >
> > >> > > > >
> > >> > > >
> > >> > >
> > >> >
> > >>
> >
> https://github.com/apache/parquet-mr/blob/9b5a962df3007009a227ef421600197531f970a5/parquet-column/src/main/java/org/apache/parquet/column/values/dictionary/DictionaryValuesWriter.java#L117
> > >> > > > > > > >>> >> >
> > >> > > > > > > >>> >> >
> > >> > > > > > > >>> >> > > - Is that heuristic configurable at all?
> > >> > > > > > > >>> >> >
> > >> > > > > > > >>> >> >
> > >> > > > > > > >>> >> > I think you are using the right setting, but
> maybe
> > >> it is
> > >> > > > > > possible
> > >> > > > > > > >>> the
> > >> > > > > > > >>> >> > strings are still exceeding the threshold
> (perhaps
> > >> > > > increasing
> > >> > > > > it
> > >> > > > > > > by
> > >> > > > > > > >>> 50x
> > >> > > > > > > >>> >> or
> > >> > > > > > > >>> >> > more to verify)
> > >> > > > > > > >>> >> >
> > >> > > > > > > >>> >> >
> > >> > > > > > > >>> >> > > - For high-cardinality datasets, has the idea
> of
> > a
> > >> > > > > > > frequency-based
> > >> > > > > > > >>> >> > > dictionary encoding been explored? Say, if the
> > data
> > >> > > > follows
> > >> > > > > a
> > >> > > > > > > >>> certain
> > >> > > > > > > >>> >> > > statistical distribution, we can create a
> > >> dictionary
> > >> > of
> > >> > > > the
> > >> > > > > > most
> > >> > > > > > > >>> >> frequent
> > >> > > > > > > >>> >> > > values only?
> > >> > > > > > > >>> >> >
> > >> > > > > > > >>> >> > I haven't seen my discussion during my time in
> the
> > >> > > community
> > >> > > > > but
> > >> > > > > > > >>> maybe
> > >> > > > > > > >>> >> it
> > >> > > > > > > >>> >> > was discussed in the past.  I think the main
> > >> challenge
> > >> > > here
> > >> > > > is
> > >> > > > > > > that
> > >> > > > > > > >>> >> pages
> > >> > > > > > > >>> >> > are either dictionary encoded or not.  I'd guess
> to
> > >> make
> > >> > > > this
> > >> > > > > > > >>> practical
> > >> > > > > > > >>> >> > there would need to be a new hybrid page type,
> > which
> > >> I
> > >> > > think
> > >> > > > > it
> > >> > > > > > > >>> might
> > >> > > > > > > >>> >> be an
> > >> > > > > > > >>> >> > interesting idea but quite a bit of work.
> > >> Additionally,
> > >> > > one
> > >> > > > > > would
> > >> > > > > > > >>> >> likely
> > >> > > > > > > >>> >> > need heuristics for when to potentially use the
> new
> > >> mode
> > >> > > > > versus
> > >> > > > > > a
> > >> > > > > > > >>> >> complete
> > >> > > > > > > >>> >> > fallback.
> > >> > > > > > > >>> >> >
> > >> > > > > > > >>> >> > Cheers,
> > >> > > > > > > >>> >> > Micah
> > >> > > > > > > >>> >> >
> > >> > > > > > > >>> >> > On Thu, Sep 14, 2023 at 12:07 PM Claire McGinty <
> > >> > > > > > > >>> >> > claire.d.mcginty@gmail.com>
> > >> > > > > > > >>> >> > wrote:
> > >> > > > > > > >>> >> >
> > >> > > > > > > >>> >> > > Hi dev@,
> > >> > > > > > > >>> >> > >
> > >> > > > > > > >>> >> > > I'm running some benchmarking on Parquet
> > read/write
> > >> > > > > > performance
> > >> > > > > > > >>> and
> > >> > > > > > > >>> >> have
> > >> > > > > > > >>> >> > a
> > >> > > > > > > >>> >> > > few questions about how dictionary encoding
> works
> > >> > under
> > >> > > > the
> > >> > > > > > > hood.
> > >> > > > > > > >>> Let
> > >> > > > > > > >>> >> me
> > >> > > > > > > >>> >> > > know if there's a better channel for this :)
> > >> > > > > > > >>> >> > >
> > >> > > > > > > >>> >> > > My test case uses parquet-avro, where I'm
> > writing a
> > >> > > single
> > >> > > > > > file
> > >> > > > > > > >>> >> > containing
> > >> > > > > > > >>> >> > > 5 million records. Each record has a single
> > >> column, an
> > >> > > > Avro
> > >> > > > > > > String
> > >> > > > > > > >>> >> field
> > >> > > > > > > >>> >> > > (Parquet binary field). I ran two
> configurations
> > of
> > >> > base
> > >> > > > > > setup:
> > >> > > > > > > >>> in the
> > >> > > > > > > >>> >> > > first case, the string field has 5,000 possible
> > >> unique
> > >> > > > > values.
> > >> > > > > > > In
> > >> > > > > > > >>> the
> > >> > > > > > > >>> >> > > second case, it has 50,000 unique values.
> > >> > > > > > > >>> >> > >
> > >> > > > > > > >>> >> > > In the first case (5k unique values), I used
> > >> > > parquet-tools
> > >> > > > > to
> > >> > > > > > > >>> inspect
> > >> > > > > > > >>> >> the
> > >> > > > > > > >>> >> > > file metadata and found that a dictionary had
> > been
> > >> > > > written:
> > >> > > > > > > >>> >> > >
> > >> > > > > > > >>> >> > > % parquet-tools meta testdata-case1.parquet
> > >> > > > > > > >>> >> > > > file schema:  testdata.TestRecord
> > >> > > > > > > >>> >> > > >
> > >> > > > > > > >>> >> > > >
> > >> > > > > > > >>> >> > >
> > >> > > > > > > >>> >> >
> > >> > > > > > > >>> >>
> > >> > > > > > > >>>
> > >> > > > > > >
> > >> > > > > >
> > >> > > > >
> > >> > > >
> > >> > >
> > >> >
> > >>
> >
> --------------------------------------------------------------------------------
> > >> > > > > > > >>> >> > > > stringField:  REQUIRED BINARY L:STRING R:0
> D:0
> > >> > > > > > > >>> >> > > > row group 1:  RC:5000001 TS:18262874 OFFSET:4
> > >> > > > > > > >>> >> > > >
> > >> > > > > > > >>> >> > > >
> > >> > > > > > > >>> >> > >
> > >> > > > > > > >>> >> >
> > >> > > > > > > >>> >>
> > >> > > > > > > >>>
> > >> > > > > > >
> > >> > > > > >
> > >> > > > >
> > >> > > >
> > >> > >
> > >> >
> > >>
> >
> --------------------------------------------------------------------------------
> > >> > > > > > > >>> >> > > > stringField:   BINARY UNCOMPRESSED DO:4
> > FPO:38918
> > >> > > > > > > >>> >> > SZ:8181452/8181452/1.00
> > >> > > > > > > >>> >> > > > VC:5000001 ENC:BIT_PACKED,PLAIN_DICTIONARY
> > >> ST:[min:
> > >> > 0,
> > >> > > > > max:
> > >> > > > > > > 999,
> > >> > > > > > > >>> >> > > num_nulls:
> > >> > > > > > > >>> >> > > > 0]
> > >> > > > > > > >>> >> > >
> > >> > > > > > > >>> >> > >
> > >> > > > > > > >>> >> > > But in the second case (50k unique values),
> > >> > > parquet-tools
> > >> > > > > > shows
> > >> > > > > > > >>> that
> > >> > > > > > > >>> >> no
> > >> > > > > > > >>> >> > > dictionary gets created, and the file size is
> > >> *much*
> > >> > > > bigger:
> > >> > > > > > > >>> >> > >
> > >> > > > > > > >>> >> > > % parquet-tools meta testdata-case2.parquet
> > >> > > > > > > >>> >> > > > file schema:  testdata.TestRecord
> > >> > > > > > > >>> >> > > >
> > >> > > > > > > >>> >> > > >
> > >> > > > > > > >>> >> > >
> > >> > > > > > > >>> >> >
> > >> > > > > > > >>> >>
> > >> > > > > > > >>>
> > >> > > > > > >
> > >> > > > > >
> > >> > > > >
> > >> > > >
> > >> > >
> > >> >
> > >>
> >
> --------------------------------------------------------------------------------
> > >> > > > > > > >>> >> > > > stringField:  REQUIRED BINARY L:STRING R:0
> D:0
> > >> > > > > > > >>> >> > > > row group 1:  RC:5000001 TS:18262874 OFFSET:4
> > >> > > > > > > >>> >> > > >
> > >> > > > > > > >>> >> > > >
> > >> > > > > > > >>> >> > >
> > >> > > > > > > >>> >> >
> > >> > > > > > > >>> >>
> > >> > > > > > > >>>
> > >> > > > > > >
> > >> > > > > >
> > >> > > > >
> > >> > > >
> > >> > >
> > >> >
> > >>
> >
> --------------------------------------------------------------------------------
> > >> > > > > > > >>> >> > > > stringField:  BINARY UNCOMPRESSED DO:0 FPO:4
> > >> > > > > > > >>> >> SZ:43896278/43896278/1.00
> > >> > > > > > > >>> >> > > > VC:5000001 ENC:PLAIN,BIT_PACKED ST:[min: 0,
> > max:
> > >> > 9999,
> > >> > > > > > > >>> num_nulls: 0]
> > >> > > > > > > >>> >> > >
> > >> > > > > > > >>> >> > >
> > >> > > > > > > >>> >> > > (I created a gist of my test reproduction here
> > >> > > > > > > >>> >> > > <
> > >> > > > > > > >>> >>
> > >> > > > > > > >>>
> > >> > > > > >
> > >> > >
> > >>
> https://gist.github.com/clairemcginty/c3c0be85f51bc23db45a75e8d8a18806
> > >> > > > > > > >>> >> > >.)
> > >> > > > > > > >>> >> > >
> > >> > > > > > > >>> >> > > Based on this, I'm guessing there's some
> tip-over
> > >> > point
> > >> > > > > after
> > >> > > > > > > >>> which
> > >> > > > > > > >>> >> > Parquet
> > >> > > > > > > >>> >> > > will give up on writing a dictionary for a
> given
> > >> > column?
> > >> > > > > After
> > >> > > > > > > >>> reading
> > >> > > > > > > >>> >> > > the Configuration
> > >> > > > > > > >>> >> > > docs
> > >> > > > > > > >>> >> > > <
> > >> > > > > > > >>> >> >
> > >> > > > > > > >>> >>
> > >> > > > > > > >>>
> > >> > > > > > >
> > >> > > > > >
> > >> > > > >
> > >> > > >
> > >> > >
> > >> >
> > >>
> >
> https://github.com/apache/parquet-mr/blob/master/parquet-hadoop/README.md
> > >> > > > > > > >>> >> > > >,
> > >> > > > > > > >>> >> > > I tried increasing the dictionary page size
> > >> > > configuration
> > >> > > > > 5x,
> > >> > > > > > > >>> with the
> > >> > > > > > > >>> >> > same
> > >> > > > > > > >>> >> > > result (no dictionary created).
> > >> > > > > > > >>> >> > >
> > >> > > > > > > >>> >> > > So to summarize, my questions are:
> > >> > > > > > > >>> >> > >
> > >> > > > > > > >>> >> > > - What's the heuristic for Parquet dictionary
> > >> writing
> > >> > to
> > >> > > > > > succeed
> > >> > > > > > > >>> for a
> > >> > > > > > > >>> >> > > given column?
> > >> > > > > > > >>> >> > > - Is that heuristic configurable at all?
> > >> > > > > > > >>> >> > > - For high-cardinality datasets, has the idea
> of
> > a
> > >> > > > > > > frequency-based
> > >> > > > > > > >>> >> > > dictionary encoding been explored? Say, if the
> > data
> > >> > > > follows
> > >> > > > > a
> > >> > > > > > > >>> certain
> > >> > > > > > > >>> >> > > statistical distribution, we can create a
> > >> dictionary
> > >> > of
> > >> > > > the
> > >> > > > > > most
> > >> > > > > > > >>> >> frequent
> > >> > > > > > > >>> >> > > values only?
> > >> > > > > > > >>> >> > >
> > >> > > > > > > >>> >> > > Thanks for your time!
> > >> > > > > > > >>> >> > > - Claire
> > >> > > > > > > >>> >> > >
> > >> > > > > > > >>> >> >
> > >> > > > > > > >>> >>
> > >> > > > > > > >>> >
> > >> > > > > > > >>>
> > >> > > > > > > >>
> > >> > > > > > >
> > >> > > > > >
> > >> > > > > >
> > >> > > > > > --
> > >> > > > > > Aaron Niskode-Dossett, Data Engineering -- Etsy
> > >> > > > > >
> > >> > > > >
> > >> > > >
> > >> > >
> > >> >
> > >>
> > >
> >
>

Re: Parquet dictionary size limits?

Posted by Gang Wu <us...@gmail.com>.
Thanks for the thorough benchmark. These findings are pretty interesting!

On Thu, Sep 28, 2023 at 5:32 AM Claire McGinty <cl...@gmail.com>
wrote:

> Hi all,
>
> Just to follow up, I ran some benchmarks with an added Configuration option
> to set a desired "compression ratio" param, which you can see here
> <
> https://github.com/clairemcginty/parquet-benchmarks/blob/main/write_results.md#max-dictionary-compression-ratio-option
> >,
> on a variety of data layouts (distribution, sorting, cardinality etc). I
> also have a table of comparisons
> <
> https://github.com/clairemcginty/parquet-benchmarks/blob/main/write_results.md#overall-comparison
> >
> using
> the latest 0.14.0-SNAPSHOT as a baseline. These are my takeaways:
>
>    - The compression ratio param doesn't benefit data that's in sorted
>    order (IMO, because even without the addition of this param, sorted
> columns
>    are more likely to produce efficient dict encodings).
>    - On shuffled (non-sorted) data, setting the compression ratio param
>    produces a much better *uncompressed* file result (in one case, 39MB to
>    the baseline 102MB). However, after applying a file-level compression
>    algorithm such as ZSTD, the baseline and ratio-param results turn out
> about
>    pretty much equal (within 5% margin). I think this makes sense, since in
>    Parquet 1.0 dictionaries are not encoded, so ZSTD must be better at
>    compressing many repeated column values across a file than it is at
>    compressing dictionaries across all pages.
>    - The compression ratio param works best with larger page sizes (10 mb
>    or 50mb) with large-ish dictionary page sizes (10mb).
>
> Overall, I think the equalizing behavior of file-level compression (ZSTD)
> makes it not worth it to add a configuration option for dictionary
> compression :) Thanks for all of your input on this -- if nothing else, the
> benchmarks are a really interesting look at how important data layout is to
> overall file size!
>
> Best,
> Claire
>
> On Thu, Sep 21, 2023 at 2:12 PM Claire McGinty <claire.d.mcginty@gmail.com
> >
> wrote:
>
> > Hmm, like a flag to basically turn off the isCompressionSatisfying check
> > per-column? That might be simplest!
> >
> > So to summarize, a column will not write a dictionary encoding when
> either:
> >
> > (1) `parquet.enable.dictionary` is set to False
> > (2) # of distinct values in a column chunk exceeds
> > `DictionaryValuesWriter#MAX_DICTIONARY_VALUES` (currently set to
> > Integer.MAX_VALUE)
> > (3) Total encoded bytes in a dictionary exceed the value of
> > `parquet.dictionary.page.size`
> > (4) Desired compression ratio (as a measure of # distinct values : total
> #
> > values) is not achieved
> >
> > I might try out various options for making (4) configurable, starting
> with
> > your suggestion, and testing them out on more realistic data
> distributions.
> > Will try to return to this thread with my results in a few days :)
> >
> > Best,
> > Claire
> >
> >
> > On Thu, Sep 21, 2023 at 8:48 AM Gang Wu <us...@gmail.com> wrote:
> >
> >> The current implementation only checks the first page, which is
> >> vulnerable in many cases. I think your suggestion makes sense.
> >> However, there is no one-fit-for-all solution. How about simply
> >> adding a flag to enforce dictionary encoding to a specific column?
> >>
> >>
> >> On Thu, Sep 21, 2023 at 1:08 AM Claire McGinty <
> >> claire.d.mcginty@gmail.com>
> >> wrote:
> >>
> >> > I think I figured it out! The dictionaryByteSize == 0 was a red
> >> herring; I
> >> > was looking at an IntegerDictionaryValuesWriter for an empty column
> >> rather
> >> > than my high-cardinality column. Your analysis of the situation was
> >> > right--it was just that in the first page, there weren't enough
> distinct
> >> > values to pass the check.
> >> >
> >> > I wonder if we could maybe make this value configurable per-column?
> >> Either:
> >> >
> >> > - A desired ratio of distinct values / total values, on a scale of
> 0-1.0
> >> > - Number of pages to check for compression before falling back
> >> >
> >> > Let me know what you think!
> >> >
> >> > Best,
> >> > Claire
> >> >
> >> > On Wed, Sep 20, 2023 at 9:37 AM Gang Wu <us...@gmail.com> wrote:
> >> >
> >> > > I don't understand why you get encodedSize == 1, dictionaryByteSize
> >> == 0
> >> > > and rawSize == 0 in the first page check. It seems that the page
> does
> >> not
> >> > > have any meaning values. Could you please check how many values are
> >> > > written before the page check?
> >> > >
> >> > > On Thu, Sep 21, 2023 at 12:12 AM Claire McGinty <
> >> > > claire.d.mcginty@gmail.com>
> >> > > wrote:
> >> > >
> >> > > > Hey Gang,
> >> > > >
> >> > > > Thanks for the followup! I see what you're saying where it's
> >> sometimes
> >> > > just
> >> > > > bad luck with what ends up in the first page. The intuition seems
> >> like
> >> > a
> >> > > > larger page size should produce a better encoding in this case...
> I
> >> > > updated
> >> > > > my branch
> >> > > > <
> >> > > >
> >> > >
> >> >
> >>
> https://github.com/apache/parquet-mr/compare/master...clairemcginty:parquet-mr:dict-size-repro?expand=1
> >> > > > >
> >> > > > to
> >> > > > add a test with a page size/dict page size of 10MB and am seeing
> the
> >> > same
> >> > > > failure, though.
> >> > > >
> >> > > > Something seems kind of odd actually -- when I stepped through the
> >> > test I
> >> > > > added w/ debugger, it falls back after invoking
> >> isCompressionSatisfying
> >> > > > with encodedSize == 1, dictionaryByteSize == 0 and rawSize == 0; 1
> >> + 0
> >> > <
> >> > > 1
> >> > > > returns true. (You can also see this in the System.out logs I
> >> added, in
> >> > > the
> >> > > > branch's GHA run logs). This doesn't seem right to me -- does
> >> > > > isCompressionSatsifying need an extra check to make sure the
> >> > > > dictionary isn't empty?
> >> > > >
> >> > > > Also, thanks, Aaron! I got into this while running some
> >> > micro-benchmarks
> >> > > on
> >> > > > Parquet reads when various dictionary/bloom filter/encoding
> options
> >> are
> >> > > > configured. Happy to share out when I'm done.
> >> > > >
> >> > > > Best,
> >> > > > Claire
> >> > > >
> >> > > > On Tue, Sep 19, 2023 at 9:06 PM Gang Wu <us...@gmail.com> wrote:
> >> > > >
> >> > > > > Thanks for the investigation!
> >> > > > >
> >> > > > > I think the check below makes sense for a single page:
> >> > > > >   @Override
> >> > > > >   public boolean isCompressionSatisfying(long rawSize, long
> >> > > encodedSize)
> >> > > > {
> >> > > > >     return (encodedSize + dictionaryByteSize) < rawSize;
> >> > > > >   }
> >> > > > >
> >> > > > > The problem is that the fallback check is only performed on the
> >> first
> >> > > > page.
> >> > > > > In the first page, all values in that page may be distinct, so
> it
> >> > will
> >> > > > > unlikely
> >> > > > > pass the isCompressionSatisfying check.
> >> > > > >
> >> > > > > Best,
> >> > > > > Gang
> >> > > > >
> >> > > > >
> >> > > > > On Wed, Sep 20, 2023 at 5:04 AM Aaron Niskode-Dossett
> >> > > > > <an...@etsy.com.invalid> wrote:
> >> > > > >
> >> > > > > > Claire, thank you for your research and examples on this
> topic,
> >> > I've
> >> > > > > > learned a lot.  My hunch is that your change would be a good
> >> one,
> >> > but
> >> > > > I'm
> >> > > > > > not an expert (and more to the point, not a committer).  I'm
> >> > looking
> >> > > > > > forward to learning more as this discussion continues.
> >> > > > > >
> >> > > > > > Thank you again, Aaron
> >> > > > > >
> >> > > > > > On Tue, Sep 19, 2023 at 2:48 PM Claire McGinty <
> >> > > > > claire.d.mcginty@gmail.com
> >> > > > > > >
> >> > > > > > wrote:
> >> > > > > >
> >> > > > > > > I created a quick branch
> >> > > > > > > <
> >> > > > > > >
> >> > > > > >
> >> > > > >
> >> > > >
> >> > >
> >> >
> >>
> https://github.com/apache/parquet-mr/compare/master...clairemcginty:parquet-mr:dict-size-repro?expand=1
> >> > > > > > > >
> >> > > > > > > to reproduce what I'm seeing -- the test shows that an Int
> >> column
> >> > > > with
> >> > > > > > > cardinality 100 successfully results in a dict encoding, but
> >> an
> >> > int
> >> > > > > > column
> >> > > > > > > with cardinality 10,000 falls back and doesn't create a dict
> >> > > > encoding.
> >> > > > > > This
> >> > > > > > > seems like a low threshold given the 1MB dictionary page
> size,
> >> > so I
> >> > > > > just
> >> > > > > > > wanted to check whether this is expected or not :)
> >> > > > > > >
> >> > > > > > > Best,
> >> > > > > > > Claire
> >> > > > > > >
> >> > > > > > > On Tue, Sep 19, 2023 at 9:35 AM Claire McGinty <
> >> > > > > > claire.d.mcginty@gmail.com
> >> > > > > > > >
> >> > > > > > > wrote:
> >> > > > > > >
> >> > > > > > > > Hi, just wanted to follow up on this!
> >> > > > > > > >
> >> > > > > > > > I ran a debugger to find out why my column wasn't ending
> up
> >> > with
> >> > > a
> >> > > > > > > > dictionary encoding and it turns out that even though
> >> > > > > > > > DictionaryValuesWriter#shouldFallback()
> >> > > > > > > > <
> >> > > > > > >
> >> > > > > >
> >> > > > >
> >> > > >
> >> > >
> >> >
> >>
> https://github.com/apache/parquet-mr/blob/master/parquet-column/src/main/java/org/apache/parquet/column/values/dictionary/DictionaryValuesWriter.java#L117
> >> > > > > > > >
> >> > > > > > > > always returned false (dictionaryByteSize was always less
> >> than
> >> > my
> >> > > > > > > > configured page size),
> >> > > > DictionaryValuesWriter#isCompressionSatisfying
> >> > > > > > > > <
> >> > > > > > >
> >> > > > > >
> >> > > > >
> >> > > >
> >> > >
> >> >
> >>
> https://github.com/apache/parquet-mr/blob/master/parquet-column/src/main/java/org/apache/parquet/column/values/dictionary/DictionaryValuesWriter.java#L125
> >> > > > > > >
> >> > > > > > > was
> >> > > > > > > > what was causing Parquet to switch
> >> > > > > > > > <
> >> > > > > > >
> >> > > > > >
> >> > > > >
> >> > > >
> >> > >
> >> >
> >>
> https://github.com/apache/parquet-mr/blob/master/parquet-column/src/main/java/org/apache/parquet/column/values/fallback/FallbackValuesWriter.java#L75
> >> > > > > > > >
> >> > > > > > > > back to the fallback, non-dict writer.
> >> > > > > > > >
> >> > > > > > > > From what I can tell, this check compares the total byte
> >> size
> >> > of
> >> > > > > > > > *every* element with the byte size of each *distinct*
> >> element
> >> > as
> >> > > a
> >> > > > > kind
> >> > > > > > > of
> >> > > > > > > > proxy for encoding efficiency.... however, it seems
> strange
> >> > that
> >> > > > this
> >> > > > > > > check
> >> > > > > > > > can cause the writer to fall back even if the total
> encoded
> >> > dict
> >> > > > size
> >> > > > > > is
> >> > > > > > > > far below the configured dictionary page size. Out of
> >> > curiosity,
> >> > > I
> >> > > > > > > modified
> >> > > > > > > > DictionaryValuesWriter#isCompressionSatisfying
> >> > > > > > > > <
> >> > > > > > >
> >> > > > > >
> >> > > > >
> >> > > >
> >> > >
> >> >
> >>
> https://github.com/apache/parquet-mr/blob/master/parquet-column/src/main/java/org/apache/parquet/column/values/dictionary/DictionaryValuesWriter.java#L125
> >> > > > > > >
> >> > > > > > > to
> >> > > > > > > > also check whether total byte size was less than
> dictionary
> >> max
> >> > > > size
> >> > > > > > and
> >> > > > > > > > re-ran my Parquet write with a local snapshot, and my file
> >> size
> >> > > > > dropped
> >> > > > > > > 50%.
> >> > > > > > > >
> >> > > > > > > > Best,
> >> > > > > > > > Claire
> >> > > > > > > >
> >> > > > > > > > On Mon, Sep 18, 2023 at 9:16 AM Claire McGinty <
> >> > > > > > > claire.d.mcginty@gmail.com>
> >> > > > > > > > wrote:
> >> > > > > > > >
> >> > > > > > > >> Oh, interesting! I'm setting it via the
> >> > > > > > > >> ParquetWriter#withDictionaryPageSize method, and I do see
> >> the
> >> > > > > overall
> >> > > > > > > file
> >> > > > > > > >> size increasing when I bump the value. I'll look into it
> a
> >> bit
> >> > > > more
> >> > > > > --
> >> > > > > > > it
> >> > > > > > > >> would be helpful for some cases where the # unique values
> >> in a
> >> > > > > column
> >> > > > > > is
> >> > > > > > > >> just over the size limit.
> >> > > > > > > >>
> >> > > > > > > >> - Claire
> >> > > > > > > >>
> >> > > > > > > >> On Fri, Sep 15, 2023 at 9:54 AM Micah Kornfield <
> >> > > > > > emkornfield@gmail.com>
> >> > > > > > > >> wrote:
> >> > > > > > > >>
> >> > > > > > > >>> I'll note there is also a check for encoding
> effectiveness
> >> > [1]
> >> > > > that
> >> > > > > > > could
> >> > > > > > > >>> come into play but I'd guess that isn't the case here.
> >> > > > > > > >>>
> >> > > > > > > >>> [1]
> >> > > > > > > >>>
> >> > > > > > > >>>
> >> > > > > > >
> >> > > > > >
> >> > > > >
> >> > > >
> >> > >
> >> >
> >>
> https://github.com/apache/parquet-mr/blob/9b5a962df3007009a227ef421600197531f970a5/parquet-column/src/main/java/org/apache/parquet/column/values/dictionary/DictionaryValuesWriter.java#L124
> >> > > > > > > >>>
> >> > > > > > > >>> On Fri, Sep 15, 2023 at 9:51 AM Micah Kornfield <
> >> > > > > > emkornfield@gmail.com
> >> > > > > > > >
> >> > > > > > > >>> wrote:
> >> > > > > > > >>>
> >> > > > > > > >>> > I'm glad I was looking at the right setting for
> >> dictionary
> >> > > > size.
> >> > > > > I
> >> > > > > > > just
> >> > > > > > > >>> >> tried it out with 10x, 50x, and even total file size,
> >> > > though,
> >> > > > > and
> >> > > > > > > >>> still am
> >> > > > > > > >>> >> not seeing a dictionary get created. Is it possible
> >> it's
> >> > > > bounded
> >> > > > > > by
> >> > > > > > > >>> file
> >> > > > > > > >>> >> page size or some other layout option that I need to
> >> bump
> >> > as
> >> > > > > well?
> >> > > > > > > >>> >
> >> > > > > > > >>> >
> >> > > > > > > >>> > Sorry I'm less familiar with parquet-mr, hopefully
> >> someone
> >> > > else
> >> > > > > to
> >> > > > > > > >>> chime
> >> > > > > > > >>> > in.  If I had to guess, maybe somehow the config value
> >> > isn't
> >> > > > > making
> >> > > > > > > it
> >> > > > > > > >>> to
> >> > > > > > > >>> > the writer (but there could also be something else at
> >> > play).
> >> > > > > > > >>> >
> >> > > > > > > >>> > On Fri, Sep 15, 2023 at 9:33 AM Claire McGinty <
> >> > > > > > > >>> claire.d.mcginty@gmail.com>
> >> > > > > > > >>> > wrote:
> >> > > > > > > >>> >
> >> > > > > > > >>> >> Thanks so much, Micah!
> >> > > > > > > >>> >>
> >> > > > > > > >>> >> I think you are using the right setting, but maybe it
> >> is
> >> > > > > possible
> >> > > > > > > the
> >> > > > > > > >>> >> > strings are still exceeding the threshold (perhaps
> >> > > > increasing
> >> > > > > it
> >> > > > > > > by
> >> > > > > > > >>> 50x
> >> > > > > > > >>> >> or
> >> > > > > > > >>> >> > more to verify)
> >> > > > > > > >>> >>
> >> > > > > > > >>> >>
> >> > > > > > > >>> >> I'm glad I was looking at the right setting for
> >> dictionary
> >> > > > > size. I
> >> > > > > > > >>> just
> >> > > > > > > >>> >> tried it out with 10x, 50x, and even total file size,
> >> > > though,
> >> > > > > and
> >> > > > > > > >>> still am
> >> > > > > > > >>> >> not seeing a dictionary get created. Is it possible
> >> it's
> >> > > > bounded
> >> > > > > > by
> >> > > > > > > >>> file
> >> > > > > > > >>> >> page size or some other layout option that I need to
> >> bump
> >> > as
> >> > > > > well?
> >> > > > > > > >>> >>
> >> > > > > > > >>> >> I haven't seen my discussion during my time in the
> >> > community
> >> > > > but
> >> > > > > > > >>> maybe it
> >> > > > > > > >>> >> > was discussed in the past.  I think the main
> >> challenge
> >> > > here
> >> > > > is
> >> > > > > > > that
> >> > > > > > > >>> >> pages
> >> > > > > > > >>> >> > are either dictionary encoded or not.  I'd guess to
> >> make
> >> > > > this
> >> > > > > > > >>> practical
> >> > > > > > > >>> >> > there would need to be a new hybrid page type,
> which
> >> I
> >> > > think
> >> > > > > it
> >> > > > > > > >>> might
> >> > > > > > > >>> >> be an
> >> > > > > > > >>> >> > interesting idea but quite a bit of work.
> >> Additionally,
> >> > > one
> >> > > > > > would
> >> > > > > > > >>> >> likely
> >> > > > > > > >>> >> > need heuristics for when to potentially use the new
> >> mode
> >> > > > > versus
> >> > > > > > a
> >> > > > > > > >>> >> complete
> >> > > > > > > >>> >> > fallback.
> >> > > > > > > >>> >> >
> >> > > > > > > >>> >>
> >> > > > > > > >>> >> Got it, thanks for the explanation! It does seem
> like a
> >> > huge
> >> > > > > > amount
> >> > > > > > > of
> >> > > > > > > >>> >> work
> >> > > > > > > >>> >>
> >> > > > > > > >>> >>
> >> > > > > > > >>> >> Best,
> >> > > > > > > >>> >> Claire
> >> > > > > > > >>> >>
> >> > > > > > > >>> >>
> >> > > > > > > >>> >>
> >> > > > > > > >>> >> On Thu, Sep 14, 2023 at 5:16 PM Micah Kornfield <
> >> > > > > > > >>> emkornfield@gmail.com>
> >> > > > > > > >>> >> wrote:
> >> > > > > > > >>> >>
> >> > > > > > > >>> >> > >
> >> > > > > > > >>> >> > > - What's the heuristic for Parquet dictionary
> >> writing
> >> > to
> >> > > > > > succeed
> >> > > > > > > >>> for a
> >> > > > > > > >>> >> > > given column?
> >> > > > > > > >>> >> >
> >> > > > > > > >>> >> >
> >> > > > > > > >>> >> >
> >> > > > > > > >>> >> >
> >> > > > > > > >>> >>
> >> > > > > > > >>>
> >> > > > > > >
> >> > > > > >
> >> > > > >
> >> > > >
> >> > >
> >> >
> >>
> https://github.com/apache/parquet-mr/blob/9b5a962df3007009a227ef421600197531f970a5/parquet-column/src/main/java/org/apache/parquet/column/values/dictionary/DictionaryValuesWriter.java#L117
> >> > > > > > > >>> >> >
> >> > > > > > > >>> >> >
> >> > > > > > > >>> >> > > - Is that heuristic configurable at all?
> >> > > > > > > >>> >> >
> >> > > > > > > >>> >> >
> >> > > > > > > >>> >> > I think you are using the right setting, but maybe
> >> it is
> >> > > > > > possible
> >> > > > > > > >>> the
> >> > > > > > > >>> >> > strings are still exceeding the threshold (perhaps
> >> > > > increasing
> >> > > > > it
> >> > > > > > > by
> >> > > > > > > >>> 50x
> >> > > > > > > >>> >> or
> >> > > > > > > >>> >> > more to verify)
> >> > > > > > > >>> >> >
> >> > > > > > > >>> >> >
> >> > > > > > > >>> >> > > - For high-cardinality datasets, has the idea of
> a
> >> > > > > > > frequency-based
> >> > > > > > > >>> >> > > dictionary encoding been explored? Say, if the
> data
> >> > > > follows
> >> > > > > a
> >> > > > > > > >>> certain
> >> > > > > > > >>> >> > > statistical distribution, we can create a
> >> dictionary
> >> > of
> >> > > > the
> >> > > > > > most
> >> > > > > > > >>> >> frequent
> >> > > > > > > >>> >> > > values only?
> >> > > > > > > >>> >> >
> >> > > > > > > >>> >> > I haven't seen my discussion during my time in the
> >> > > community
> >> > > > > but
> >> > > > > > > >>> maybe
> >> > > > > > > >>> >> it
> >> > > > > > > >>> >> > was discussed in the past.  I think the main
> >> challenge
> >> > > here
> >> > > > is
> >> > > > > > > that
> >> > > > > > > >>> >> pages
> >> > > > > > > >>> >> > are either dictionary encoded or not.  I'd guess to
> >> make
> >> > > > this
> >> > > > > > > >>> practical
> >> > > > > > > >>> >> > there would need to be a new hybrid page type,
> which
> >> I
> >> > > think
> >> > > > > it
> >> > > > > > > >>> might
> >> > > > > > > >>> >> be an
> >> > > > > > > >>> >> > interesting idea but quite a bit of work.
> >> Additionally,
> >> > > one
> >> > > > > > would
> >> > > > > > > >>> >> likely
> >> > > > > > > >>> >> > need heuristics for when to potentially use the new
> >> mode
> >> > > > > versus
> >> > > > > > a
> >> > > > > > > >>> >> complete
> >> > > > > > > >>> >> > fallback.
> >> > > > > > > >>> >> >
> >> > > > > > > >>> >> > Cheers,
> >> > > > > > > >>> >> > Micah
> >> > > > > > > >>> >> >
> >> > > > > > > >>> >> > On Thu, Sep 14, 2023 at 12:07 PM Claire McGinty <
> >> > > > > > > >>> >> > claire.d.mcginty@gmail.com>
> >> > > > > > > >>> >> > wrote:
> >> > > > > > > >>> >> >
> >> > > > > > > >>> >> > > Hi dev@,
> >> > > > > > > >>> >> > >
> >> > > > > > > >>> >> > > I'm running some benchmarking on Parquet
> read/write
> >> > > > > > performance
> >> > > > > > > >>> and
> >> > > > > > > >>> >> have
> >> > > > > > > >>> >> > a
> >> > > > > > > >>> >> > > few questions about how dictionary encoding works
> >> > under
> >> > > > the
> >> > > > > > > hood.
> >> > > > > > > >>> Let
> >> > > > > > > >>> >> me
> >> > > > > > > >>> >> > > know if there's a better channel for this :)
> >> > > > > > > >>> >> > >
> >> > > > > > > >>> >> > > My test case uses parquet-avro, where I'm
> writing a
> >> > > single
> >> > > > > > file
> >> > > > > > > >>> >> > containing
> >> > > > > > > >>> >> > > 5 million records. Each record has a single
> >> column, an
> >> > > > Avro
> >> > > > > > > String
> >> > > > > > > >>> >> field
> >> > > > > > > >>> >> > > (Parquet binary field). I ran two configurations
> of
> >> > base
> >> > > > > > setup:
> >> > > > > > > >>> in the
> >> > > > > > > >>> >> > > first case, the string field has 5,000 possible
> >> unique
> >> > > > > values.
> >> > > > > > > In
> >> > > > > > > >>> the
> >> > > > > > > >>> >> > > second case, it has 50,000 unique values.
> >> > > > > > > >>> >> > >
> >> > > > > > > >>> >> > > In the first case (5k unique values), I used
> >> > > parquet-tools
> >> > > > > to
> >> > > > > > > >>> inspect
> >> > > > > > > >>> >> the
> >> > > > > > > >>> >> > > file metadata and found that a dictionary had
> been
> >> > > > written:
> >> > > > > > > >>> >> > >
> >> > > > > > > >>> >> > > % parquet-tools meta testdata-case1.parquet
> >> > > > > > > >>> >> > > > file schema:  testdata.TestRecord
> >> > > > > > > >>> >> > > >
> >> > > > > > > >>> >> > > >
> >> > > > > > > >>> >> > >
> >> > > > > > > >>> >> >
> >> > > > > > > >>> >>
> >> > > > > > > >>>
> >> > > > > > >
> >> > > > > >
> >> > > > >
> >> > > >
> >> > >
> >> >
> >>
> --------------------------------------------------------------------------------
> >> > > > > > > >>> >> > > > stringField:  REQUIRED BINARY L:STRING R:0 D:0
> >> > > > > > > >>> >> > > > row group 1:  RC:5000001 TS:18262874 OFFSET:4
> >> > > > > > > >>> >> > > >
> >> > > > > > > >>> >> > > >
> >> > > > > > > >>> >> > >
> >> > > > > > > >>> >> >
> >> > > > > > > >>> >>
> >> > > > > > > >>>
> >> > > > > > >
> >> > > > > >
> >> > > > >
> >> > > >
> >> > >
> >> >
> >>
> --------------------------------------------------------------------------------
> >> > > > > > > >>> >> > > > stringField:   BINARY UNCOMPRESSED DO:4
> FPO:38918
> >> > > > > > > >>> >> > SZ:8181452/8181452/1.00
> >> > > > > > > >>> >> > > > VC:5000001 ENC:BIT_PACKED,PLAIN_DICTIONARY
> >> ST:[min:
> >> > 0,
> >> > > > > max:
> >> > > > > > > 999,
> >> > > > > > > >>> >> > > num_nulls:
> >> > > > > > > >>> >> > > > 0]
> >> > > > > > > >>> >> > >
> >> > > > > > > >>> >> > >
> >> > > > > > > >>> >> > > But in the second case (50k unique values),
> >> > > parquet-tools
> >> > > > > > shows
> >> > > > > > > >>> that
> >> > > > > > > >>> >> no
> >> > > > > > > >>> >> > > dictionary gets created, and the file size is
> >> *much*
> >> > > > bigger:
> >> > > > > > > >>> >> > >
> >> > > > > > > >>> >> > > % parquet-tools meta testdata-case2.parquet
> >> > > > > > > >>> >> > > > file schema:  testdata.TestRecord
> >> > > > > > > >>> >> > > >
> >> > > > > > > >>> >> > > >
> >> > > > > > > >>> >> > >
> >> > > > > > > >>> >> >
> >> > > > > > > >>> >>
> >> > > > > > > >>>
> >> > > > > > >
> >> > > > > >
> >> > > > >
> >> > > >
> >> > >
> >> >
> >>
> --------------------------------------------------------------------------------
> >> > > > > > > >>> >> > > > stringField:  REQUIRED BINARY L:STRING R:0 D:0
> >> > > > > > > >>> >> > > > row group 1:  RC:5000001 TS:18262874 OFFSET:4
> >> > > > > > > >>> >> > > >
> >> > > > > > > >>> >> > > >
> >> > > > > > > >>> >> > >
> >> > > > > > > >>> >> >
> >> > > > > > > >>> >>
> >> > > > > > > >>>
> >> > > > > > >
> >> > > > > >
> >> > > > >
> >> > > >
> >> > >
> >> >
> >>
> --------------------------------------------------------------------------------
> >> > > > > > > >>> >> > > > stringField:  BINARY UNCOMPRESSED DO:0 FPO:4
> >> > > > > > > >>> >> SZ:43896278/43896278/1.00
> >> > > > > > > >>> >> > > > VC:5000001 ENC:PLAIN,BIT_PACKED ST:[min: 0,
> max:
> >> > 9999,
> >> > > > > > > >>> num_nulls: 0]
> >> > > > > > > >>> >> > >
> >> > > > > > > >>> >> > >
> >> > > > > > > >>> >> > > (I created a gist of my test reproduction here
> >> > > > > > > >>> >> > > <
> >> > > > > > > >>> >>
> >> > > > > > > >>>
> >> > > > > >
> >> > >
> >> https://gist.github.com/clairemcginty/c3c0be85f51bc23db45a75e8d8a18806
> >> > > > > > > >>> >> > >.)
> >> > > > > > > >>> >> > >
> >> > > > > > > >>> >> > > Based on this, I'm guessing there's some tip-over
> >> > point
> >> > > > > after
> >> > > > > > > >>> which
> >> > > > > > > >>> >> > Parquet
> >> > > > > > > >>> >> > > will give up on writing a dictionary for a given
> >> > column?
> >> > > > > After
> >> > > > > > > >>> reading
> >> > > > > > > >>> >> > > the Configuration
> >> > > > > > > >>> >> > > docs
> >> > > > > > > >>> >> > > <
> >> > > > > > > >>> >> >
> >> > > > > > > >>> >>
> >> > > > > > > >>>
> >> > > > > > >
> >> > > > > >
> >> > > > >
> >> > > >
> >> > >
> >> >
> >>
> https://github.com/apache/parquet-mr/blob/master/parquet-hadoop/README.md
> >> > > > > > > >>> >> > > >,
> >> > > > > > > >>> >> > > I tried increasing the dictionary page size
> >> > > configuration
> >> > > > > 5x,
> >> > > > > > > >>> with the
> >> > > > > > > >>> >> > same
> >> > > > > > > >>> >> > > result (no dictionary created).
> >> > > > > > > >>> >> > >
> >> > > > > > > >>> >> > > So to summarize, my questions are:
> >> > > > > > > >>> >> > >
> >> > > > > > > >>> >> > > - What's the heuristic for Parquet dictionary
> >> writing
> >> > to
> >> > > > > > succeed
> >> > > > > > > >>> for a
> >> > > > > > > >>> >> > > given column?
> >> > > > > > > >>> >> > > - Is that heuristic configurable at all?
> >> > > > > > > >>> >> > > - For high-cardinality datasets, has the idea of
> a
> >> > > > > > > frequency-based
> >> > > > > > > >>> >> > > dictionary encoding been explored? Say, if the
> data
> >> > > > follows
> >> > > > > a
> >> > > > > > > >>> certain
> >> > > > > > > >>> >> > > statistical distribution, we can create a
> >> dictionary
> >> > of
> >> > > > the
> >> > > > > > most
> >> > > > > > > >>> >> frequent
> >> > > > > > > >>> >> > > values only?
> >> > > > > > > >>> >> > >
> >> > > > > > > >>> >> > > Thanks for your time!
> >> > > > > > > >>> >> > > - Claire
> >> > > > > > > >>> >> > >
> >> > > > > > > >>> >> >
> >> > > > > > > >>> >>
> >> > > > > > > >>> >
> >> > > > > > > >>>
> >> > > > > > > >>
> >> > > > > > >
> >> > > > > >
> >> > > > > >
> >> > > > > > --
> >> > > > > > Aaron Niskode-Dossett, Data Engineering -- Etsy
> >> > > > > >
> >> > > > >
> >> > > >
> >> > >
> >> >
> >>
> >
>

Re: Parquet dictionary size limits?

Posted by Claire McGinty <cl...@gmail.com>.
Hi all,

Just to follow up, I ran some benchmarks with an added Configuration option
to set a desired "compression ratio" param, which you can see here
<https://github.com/clairemcginty/parquet-benchmarks/blob/main/write_results.md#max-dictionary-compression-ratio-option>,
on a variety of data layouts (distribution, sorting, cardinality etc). I
also have a table of comparisons
<https://github.com/clairemcginty/parquet-benchmarks/blob/main/write_results.md#overall-comparison>
using
the latest 0.14.0-SNAPSHOT as a baseline. These are my takeaways:

   - The compression ratio param doesn't benefit data that's in sorted
   order (IMO, because even without the addition of this param, sorted columns
   are more likely to produce efficient dict encodings).
   - On shuffled (non-sorted) data, setting the compression ratio param
   produces a much better *uncompressed* file result (in one case, 39MB to
   the baseline 102MB). However, after applying a file-level compression
   algorithm such as ZSTD, the baseline and ratio-param results turn out about
   pretty much equal (within 5% margin). I think this makes sense, since in
   Parquet 1.0 dictionaries are not encoded, so ZSTD must be better at
   compressing many repeated column values across a file than it is at
   compressing dictionaries across all pages.
   - The compression ratio param works best with larger page sizes (10 mb
   or 50mb) with large-ish dictionary page sizes (10mb).

Overall, I think the equalizing behavior of file-level compression (ZSTD)
makes it not worth it to add a configuration option for dictionary
compression :) Thanks for all of your input on this -- if nothing else, the
benchmarks are a really interesting look at how important data layout is to
overall file size!

Best,
Claire

On Thu, Sep 21, 2023 at 2:12 PM Claire McGinty <cl...@gmail.com>
wrote:

> Hmm, like a flag to basically turn off the isCompressionSatisfying check
> per-column? That might be simplest!
>
> So to summarize, a column will not write a dictionary encoding when either:
>
> (1) `parquet.enable.dictionary` is set to False
> (2) # of distinct values in a column chunk exceeds
> `DictionaryValuesWriter#MAX_DICTIONARY_VALUES` (currently set to
> Integer.MAX_VALUE)
> (3) Total encoded bytes in a dictionary exceed the value of
> `parquet.dictionary.page.size`
> (4) Desired compression ratio (as a measure of # distinct values : total #
> values) is not achieved
>
> I might try out various options for making (4) configurable, starting with
> your suggestion, and testing them out on more realistic data distributions.
> Will try to return to this thread with my results in a few days :)
>
> Best,
> Claire
>
>
> On Thu, Sep 21, 2023 at 8:48 AM Gang Wu <us...@gmail.com> wrote:
>
>> The current implementation only checks the first page, which is
>> vulnerable in many cases. I think your suggestion makes sense.
>> However, there is no one-fit-for-all solution. How about simply
>> adding a flag to enforce dictionary encoding to a specific column?
>>
>>
>> On Thu, Sep 21, 2023 at 1:08 AM Claire McGinty <
>> claire.d.mcginty@gmail.com>
>> wrote:
>>
>> > I think I figured it out! The dictionaryByteSize == 0 was a red
>> herring; I
>> > was looking at an IntegerDictionaryValuesWriter for an empty column
>> rather
>> > than my high-cardinality column. Your analysis of the situation was
>> > right--it was just that in the first page, there weren't enough distinct
>> > values to pass the check.
>> >
>> > I wonder if we could maybe make this value configurable per-column?
>> Either:
>> >
>> > - A desired ratio of distinct values / total values, on a scale of 0-1.0
>> > - Number of pages to check for compression before falling back
>> >
>> > Let me know what you think!
>> >
>> > Best,
>> > Claire
>> >
>> > On Wed, Sep 20, 2023 at 9:37 AM Gang Wu <us...@gmail.com> wrote:
>> >
>> > > I don't understand why you get encodedSize == 1, dictionaryByteSize
>> == 0
>> > > and rawSize == 0 in the first page check. It seems that the page does
>> not
>> > > have any meaning values. Could you please check how many values are
>> > > written before the page check?
>> > >
>> > > On Thu, Sep 21, 2023 at 12:12 AM Claire McGinty <
>> > > claire.d.mcginty@gmail.com>
>> > > wrote:
>> > >
>> > > > Hey Gang,
>> > > >
>> > > > Thanks for the followup! I see what you're saying where it's
>> sometimes
>> > > just
>> > > > bad luck with what ends up in the first page. The intuition seems
>> like
>> > a
>> > > > larger page size should produce a better encoding in this case... I
>> > > updated
>> > > > my branch
>> > > > <
>> > > >
>> > >
>> >
>> https://github.com/apache/parquet-mr/compare/master...clairemcginty:parquet-mr:dict-size-repro?expand=1
>> > > > >
>> > > > to
>> > > > add a test with a page size/dict page size of 10MB and am seeing the
>> > same
>> > > > failure, though.
>> > > >
>> > > > Something seems kind of odd actually -- when I stepped through the
>> > test I
>> > > > added w/ debugger, it falls back after invoking
>> isCompressionSatisfying
>> > > > with encodedSize == 1, dictionaryByteSize == 0 and rawSize == 0; 1
>> + 0
>> > <
>> > > 1
>> > > > returns true. (You can also see this in the System.out logs I
>> added, in
>> > > the
>> > > > branch's GHA run logs). This doesn't seem right to me -- does
>> > > > isCompressionSatsifying need an extra check to make sure the
>> > > > dictionary isn't empty?
>> > > >
>> > > > Also, thanks, Aaron! I got into this while running some
>> > micro-benchmarks
>> > > on
>> > > > Parquet reads when various dictionary/bloom filter/encoding options
>> are
>> > > > configured. Happy to share out when I'm done.
>> > > >
>> > > > Best,
>> > > > Claire
>> > > >
>> > > > On Tue, Sep 19, 2023 at 9:06 PM Gang Wu <us...@gmail.com> wrote:
>> > > >
>> > > > > Thanks for the investigation!
>> > > > >
>> > > > > I think the check below makes sense for a single page:
>> > > > >   @Override
>> > > > >   public boolean isCompressionSatisfying(long rawSize, long
>> > > encodedSize)
>> > > > {
>> > > > >     return (encodedSize + dictionaryByteSize) < rawSize;
>> > > > >   }
>> > > > >
>> > > > > The problem is that the fallback check is only performed on the
>> first
>> > > > page.
>> > > > > In the first page, all values in that page may be distinct, so it
>> > will
>> > > > > unlikely
>> > > > > pass the isCompressionSatisfying check.
>> > > > >
>> > > > > Best,
>> > > > > Gang
>> > > > >
>> > > > >
>> > > > > On Wed, Sep 20, 2023 at 5:04 AM Aaron Niskode-Dossett
>> > > > > <an...@etsy.com.invalid> wrote:
>> > > > >
>> > > > > > Claire, thank you for your research and examples on this topic,
>> > I've
>> > > > > > learned a lot.  My hunch is that your change would be a good
>> one,
>> > but
>> > > > I'm
>> > > > > > not an expert (and more to the point, not a committer).  I'm
>> > looking
>> > > > > > forward to learning more as this discussion continues.
>> > > > > >
>> > > > > > Thank you again, Aaron
>> > > > > >
>> > > > > > On Tue, Sep 19, 2023 at 2:48 PM Claire McGinty <
>> > > > > claire.d.mcginty@gmail.com
>> > > > > > >
>> > > > > > wrote:
>> > > > > >
>> > > > > > > I created a quick branch
>> > > > > > > <
>> > > > > > >
>> > > > > >
>> > > > >
>> > > >
>> > >
>> >
>> https://github.com/apache/parquet-mr/compare/master...clairemcginty:parquet-mr:dict-size-repro?expand=1
>> > > > > > > >
>> > > > > > > to reproduce what I'm seeing -- the test shows that an Int
>> column
>> > > > with
>> > > > > > > cardinality 100 successfully results in a dict encoding, but
>> an
>> > int
>> > > > > > column
>> > > > > > > with cardinality 10,000 falls back and doesn't create a dict
>> > > > encoding.
>> > > > > > This
>> > > > > > > seems like a low threshold given the 1MB dictionary page size,
>> > so I
>> > > > > just
>> > > > > > > wanted to check whether this is expected or not :)
>> > > > > > >
>> > > > > > > Best,
>> > > > > > > Claire
>> > > > > > >
>> > > > > > > On Tue, Sep 19, 2023 at 9:35 AM Claire McGinty <
>> > > > > > claire.d.mcginty@gmail.com
>> > > > > > > >
>> > > > > > > wrote:
>> > > > > > >
>> > > > > > > > Hi, just wanted to follow up on this!
>> > > > > > > >
>> > > > > > > > I ran a debugger to find out why my column wasn't ending up
>> > with
>> > > a
>> > > > > > > > dictionary encoding and it turns out that even though
>> > > > > > > > DictionaryValuesWriter#shouldFallback()
>> > > > > > > > <
>> > > > > > >
>> > > > > >
>> > > > >
>> > > >
>> > >
>> >
>> https://github.com/apache/parquet-mr/blob/master/parquet-column/src/main/java/org/apache/parquet/column/values/dictionary/DictionaryValuesWriter.java#L117
>> > > > > > > >
>> > > > > > > > always returned false (dictionaryByteSize was always less
>> than
>> > my
>> > > > > > > > configured page size),
>> > > > DictionaryValuesWriter#isCompressionSatisfying
>> > > > > > > > <
>> > > > > > >
>> > > > > >
>> > > > >
>> > > >
>> > >
>> >
>> https://github.com/apache/parquet-mr/blob/master/parquet-column/src/main/java/org/apache/parquet/column/values/dictionary/DictionaryValuesWriter.java#L125
>> > > > > > >
>> > > > > > > was
>> > > > > > > > what was causing Parquet to switch
>> > > > > > > > <
>> > > > > > >
>> > > > > >
>> > > > >
>> > > >
>> > >
>> >
>> https://github.com/apache/parquet-mr/blob/master/parquet-column/src/main/java/org/apache/parquet/column/values/fallback/FallbackValuesWriter.java#L75
>> > > > > > > >
>> > > > > > > > back to the fallback, non-dict writer.
>> > > > > > > >
>> > > > > > > > From what I can tell, this check compares the total byte
>> size
>> > of
>> > > > > > > > *every* element with the byte size of each *distinct*
>> element
>> > as
>> > > a
>> > > > > kind
>> > > > > > > of
>> > > > > > > > proxy for encoding efficiency.... however, it seems strange
>> > that
>> > > > this
>> > > > > > > check
>> > > > > > > > can cause the writer to fall back even if the total encoded
>> > dict
>> > > > size
>> > > > > > is
>> > > > > > > > far below the configured dictionary page size. Out of
>> > curiosity,
>> > > I
>> > > > > > > modified
>> > > > > > > > DictionaryValuesWriter#isCompressionSatisfying
>> > > > > > > > <
>> > > > > > >
>> > > > > >
>> > > > >
>> > > >
>> > >
>> >
>> https://github.com/apache/parquet-mr/blob/master/parquet-column/src/main/java/org/apache/parquet/column/values/dictionary/DictionaryValuesWriter.java#L125
>> > > > > > >
>> > > > > > > to
>> > > > > > > > also check whether total byte size was less than dictionary
>> max
>> > > > size
>> > > > > > and
>> > > > > > > > re-ran my Parquet write with a local snapshot, and my file
>> size
>> > > > > dropped
>> > > > > > > 50%.
>> > > > > > > >
>> > > > > > > > Best,
>> > > > > > > > Claire
>> > > > > > > >
>> > > > > > > > On Mon, Sep 18, 2023 at 9:16 AM Claire McGinty <
>> > > > > > > claire.d.mcginty@gmail.com>
>> > > > > > > > wrote:
>> > > > > > > >
>> > > > > > > >> Oh, interesting! I'm setting it via the
>> > > > > > > >> ParquetWriter#withDictionaryPageSize method, and I do see
>> the
>> > > > > overall
>> > > > > > > file
>> > > > > > > >> size increasing when I bump the value. I'll look into it a
>> bit
>> > > > more
>> > > > > --
>> > > > > > > it
>> > > > > > > >> would be helpful for some cases where the # unique values
>> in a
>> > > > > column
>> > > > > > is
>> > > > > > > >> just over the size limit.
>> > > > > > > >>
>> > > > > > > >> - Claire
>> > > > > > > >>
>> > > > > > > >> On Fri, Sep 15, 2023 at 9:54 AM Micah Kornfield <
>> > > > > > emkornfield@gmail.com>
>> > > > > > > >> wrote:
>> > > > > > > >>
>> > > > > > > >>> I'll note there is also a check for encoding effectiveness
>> > [1]
>> > > > that
>> > > > > > > could
>> > > > > > > >>> come into play but I'd guess that isn't the case here.
>> > > > > > > >>>
>> > > > > > > >>> [1]
>> > > > > > > >>>
>> > > > > > > >>>
>> > > > > > >
>> > > > > >
>> > > > >
>> > > >
>> > >
>> >
>> https://github.com/apache/parquet-mr/blob/9b5a962df3007009a227ef421600197531f970a5/parquet-column/src/main/java/org/apache/parquet/column/values/dictionary/DictionaryValuesWriter.java#L124
>> > > > > > > >>>
>> > > > > > > >>> On Fri, Sep 15, 2023 at 9:51 AM Micah Kornfield <
>> > > > > > emkornfield@gmail.com
>> > > > > > > >
>> > > > > > > >>> wrote:
>> > > > > > > >>>
>> > > > > > > >>> > I'm glad I was looking at the right setting for
>> dictionary
>> > > > size.
>> > > > > I
>> > > > > > > just
>> > > > > > > >>> >> tried it out with 10x, 50x, and even total file size,
>> > > though,
>> > > > > and
>> > > > > > > >>> still am
>> > > > > > > >>> >> not seeing a dictionary get created. Is it possible
>> it's
>> > > > bounded
>> > > > > > by
>> > > > > > > >>> file
>> > > > > > > >>> >> page size or some other layout option that I need to
>> bump
>> > as
>> > > > > well?
>> > > > > > > >>> >
>> > > > > > > >>> >
>> > > > > > > >>> > Sorry I'm less familiar with parquet-mr, hopefully
>> someone
>> > > else
>> > > > > to
>> > > > > > > >>> chime
>> > > > > > > >>> > in.  If I had to guess, maybe somehow the config value
>> > isn't
>> > > > > making
>> > > > > > > it
>> > > > > > > >>> to
>> > > > > > > >>> > the writer (but there could also be something else at
>> > play).
>> > > > > > > >>> >
>> > > > > > > >>> > On Fri, Sep 15, 2023 at 9:33 AM Claire McGinty <
>> > > > > > > >>> claire.d.mcginty@gmail.com>
>> > > > > > > >>> > wrote:
>> > > > > > > >>> >
>> > > > > > > >>> >> Thanks so much, Micah!
>> > > > > > > >>> >>
>> > > > > > > >>> >> I think you are using the right setting, but maybe it
>> is
>> > > > > possible
>> > > > > > > the
>> > > > > > > >>> >> > strings are still exceeding the threshold (perhaps
>> > > > increasing
>> > > > > it
>> > > > > > > by
>> > > > > > > >>> 50x
>> > > > > > > >>> >> or
>> > > > > > > >>> >> > more to verify)
>> > > > > > > >>> >>
>> > > > > > > >>> >>
>> > > > > > > >>> >> I'm glad I was looking at the right setting for
>> dictionary
>> > > > > size. I
>> > > > > > > >>> just
>> > > > > > > >>> >> tried it out with 10x, 50x, and even total file size,
>> > > though,
>> > > > > and
>> > > > > > > >>> still am
>> > > > > > > >>> >> not seeing a dictionary get created. Is it possible
>> it's
>> > > > bounded
>> > > > > > by
>> > > > > > > >>> file
>> > > > > > > >>> >> page size or some other layout option that I need to
>> bump
>> > as
>> > > > > well?
>> > > > > > > >>> >>
>> > > > > > > >>> >> I haven't seen my discussion during my time in the
>> > community
>> > > > but
>> > > > > > > >>> maybe it
>> > > > > > > >>> >> > was discussed in the past.  I think the main
>> challenge
>> > > here
>> > > > is
>> > > > > > > that
>> > > > > > > >>> >> pages
>> > > > > > > >>> >> > are either dictionary encoded or not.  I'd guess to
>> make
>> > > > this
>> > > > > > > >>> practical
>> > > > > > > >>> >> > there would need to be a new hybrid page type, which
>> I
>> > > think
>> > > > > it
>> > > > > > > >>> might
>> > > > > > > >>> >> be an
>> > > > > > > >>> >> > interesting idea but quite a bit of work.
>> Additionally,
>> > > one
>> > > > > > would
>> > > > > > > >>> >> likely
>> > > > > > > >>> >> > need heuristics for when to potentially use the new
>> mode
>> > > > > versus
>> > > > > > a
>> > > > > > > >>> >> complete
>> > > > > > > >>> >> > fallback.
>> > > > > > > >>> >> >
>> > > > > > > >>> >>
>> > > > > > > >>> >> Got it, thanks for the explanation! It does seem like a
>> > huge
>> > > > > > amount
>> > > > > > > of
>> > > > > > > >>> >> work
>> > > > > > > >>> >>
>> > > > > > > >>> >>
>> > > > > > > >>> >> Best,
>> > > > > > > >>> >> Claire
>> > > > > > > >>> >>
>> > > > > > > >>> >>
>> > > > > > > >>> >>
>> > > > > > > >>> >> On Thu, Sep 14, 2023 at 5:16 PM Micah Kornfield <
>> > > > > > > >>> emkornfield@gmail.com>
>> > > > > > > >>> >> wrote:
>> > > > > > > >>> >>
>> > > > > > > >>> >> > >
>> > > > > > > >>> >> > > - What's the heuristic for Parquet dictionary
>> writing
>> > to
>> > > > > > succeed
>> > > > > > > >>> for a
>> > > > > > > >>> >> > > given column?
>> > > > > > > >>> >> >
>> > > > > > > >>> >> >
>> > > > > > > >>> >> >
>> > > > > > > >>> >> >
>> > > > > > > >>> >>
>> > > > > > > >>>
>> > > > > > >
>> > > > > >
>> > > > >
>> > > >
>> > >
>> >
>> https://github.com/apache/parquet-mr/blob/9b5a962df3007009a227ef421600197531f970a5/parquet-column/src/main/java/org/apache/parquet/column/values/dictionary/DictionaryValuesWriter.java#L117
>> > > > > > > >>> >> >
>> > > > > > > >>> >> >
>> > > > > > > >>> >> > > - Is that heuristic configurable at all?
>> > > > > > > >>> >> >
>> > > > > > > >>> >> >
>> > > > > > > >>> >> > I think you are using the right setting, but maybe
>> it is
>> > > > > > possible
>> > > > > > > >>> the
>> > > > > > > >>> >> > strings are still exceeding the threshold (perhaps
>> > > > increasing
>> > > > > it
>> > > > > > > by
>> > > > > > > >>> 50x
>> > > > > > > >>> >> or
>> > > > > > > >>> >> > more to verify)
>> > > > > > > >>> >> >
>> > > > > > > >>> >> >
>> > > > > > > >>> >> > > - For high-cardinality datasets, has the idea of a
>> > > > > > > frequency-based
>> > > > > > > >>> >> > > dictionary encoding been explored? Say, if the data
>> > > > follows
>> > > > > a
>> > > > > > > >>> certain
>> > > > > > > >>> >> > > statistical distribution, we can create a
>> dictionary
>> > of
>> > > > the
>> > > > > > most
>> > > > > > > >>> >> frequent
>> > > > > > > >>> >> > > values only?
>> > > > > > > >>> >> >
>> > > > > > > >>> >> > I haven't seen my discussion during my time in the
>> > > community
>> > > > > but
>> > > > > > > >>> maybe
>> > > > > > > >>> >> it
>> > > > > > > >>> >> > was discussed in the past.  I think the main
>> challenge
>> > > here
>> > > > is
>> > > > > > > that
>> > > > > > > >>> >> pages
>> > > > > > > >>> >> > are either dictionary encoded or not.  I'd guess to
>> make
>> > > > this
>> > > > > > > >>> practical
>> > > > > > > >>> >> > there would need to be a new hybrid page type, which
>> I
>> > > think
>> > > > > it
>> > > > > > > >>> might
>> > > > > > > >>> >> be an
>> > > > > > > >>> >> > interesting idea but quite a bit of work.
>> Additionally,
>> > > one
>> > > > > > would
>> > > > > > > >>> >> likely
>> > > > > > > >>> >> > need heuristics for when to potentially use the new
>> mode
>> > > > > versus
>> > > > > > a
>> > > > > > > >>> >> complete
>> > > > > > > >>> >> > fallback.
>> > > > > > > >>> >> >
>> > > > > > > >>> >> > Cheers,
>> > > > > > > >>> >> > Micah
>> > > > > > > >>> >> >
>> > > > > > > >>> >> > On Thu, Sep 14, 2023 at 12:07 PM Claire McGinty <
>> > > > > > > >>> >> > claire.d.mcginty@gmail.com>
>> > > > > > > >>> >> > wrote:
>> > > > > > > >>> >> >
>> > > > > > > >>> >> > > Hi dev@,
>> > > > > > > >>> >> > >
>> > > > > > > >>> >> > > I'm running some benchmarking on Parquet read/write
>> > > > > > performance
>> > > > > > > >>> and
>> > > > > > > >>> >> have
>> > > > > > > >>> >> > a
>> > > > > > > >>> >> > > few questions about how dictionary encoding works
>> > under
>> > > > the
>> > > > > > > hood.
>> > > > > > > >>> Let
>> > > > > > > >>> >> me
>> > > > > > > >>> >> > > know if there's a better channel for this :)
>> > > > > > > >>> >> > >
>> > > > > > > >>> >> > > My test case uses parquet-avro, where I'm writing a
>> > > single
>> > > > > > file
>> > > > > > > >>> >> > containing
>> > > > > > > >>> >> > > 5 million records. Each record has a single
>> column, an
>> > > > Avro
>> > > > > > > String
>> > > > > > > >>> >> field
>> > > > > > > >>> >> > > (Parquet binary field). I ran two configurations of
>> > base
>> > > > > > setup:
>> > > > > > > >>> in the
>> > > > > > > >>> >> > > first case, the string field has 5,000 possible
>> unique
>> > > > > values.
>> > > > > > > In
>> > > > > > > >>> the
>> > > > > > > >>> >> > > second case, it has 50,000 unique values.
>> > > > > > > >>> >> > >
>> > > > > > > >>> >> > > In the first case (5k unique values), I used
>> > > parquet-tools
>> > > > > to
>> > > > > > > >>> inspect
>> > > > > > > >>> >> the
>> > > > > > > >>> >> > > file metadata and found that a dictionary had been
>> > > > written:
>> > > > > > > >>> >> > >
>> > > > > > > >>> >> > > % parquet-tools meta testdata-case1.parquet
>> > > > > > > >>> >> > > > file schema:  testdata.TestRecord
>> > > > > > > >>> >> > > >
>> > > > > > > >>> >> > > >
>> > > > > > > >>> >> > >
>> > > > > > > >>> >> >
>> > > > > > > >>> >>
>> > > > > > > >>>
>> > > > > > >
>> > > > > >
>> > > > >
>> > > >
>> > >
>> >
>> --------------------------------------------------------------------------------
>> > > > > > > >>> >> > > > stringField:  REQUIRED BINARY L:STRING R:0 D:0
>> > > > > > > >>> >> > > > row group 1:  RC:5000001 TS:18262874 OFFSET:4
>> > > > > > > >>> >> > > >
>> > > > > > > >>> >> > > >
>> > > > > > > >>> >> > >
>> > > > > > > >>> >> >
>> > > > > > > >>> >>
>> > > > > > > >>>
>> > > > > > >
>> > > > > >
>> > > > >
>> > > >
>> > >
>> >
>> --------------------------------------------------------------------------------
>> > > > > > > >>> >> > > > stringField:   BINARY UNCOMPRESSED DO:4 FPO:38918
>> > > > > > > >>> >> > SZ:8181452/8181452/1.00
>> > > > > > > >>> >> > > > VC:5000001 ENC:BIT_PACKED,PLAIN_DICTIONARY
>> ST:[min:
>> > 0,
>> > > > > max:
>> > > > > > > 999,
>> > > > > > > >>> >> > > num_nulls:
>> > > > > > > >>> >> > > > 0]
>> > > > > > > >>> >> > >
>> > > > > > > >>> >> > >
>> > > > > > > >>> >> > > But in the second case (50k unique values),
>> > > parquet-tools
>> > > > > > shows
>> > > > > > > >>> that
>> > > > > > > >>> >> no
>> > > > > > > >>> >> > > dictionary gets created, and the file size is
>> *much*
>> > > > bigger:
>> > > > > > > >>> >> > >
>> > > > > > > >>> >> > > % parquet-tools meta testdata-case2.parquet
>> > > > > > > >>> >> > > > file schema:  testdata.TestRecord
>> > > > > > > >>> >> > > >
>> > > > > > > >>> >> > > >
>> > > > > > > >>> >> > >
>> > > > > > > >>> >> >
>> > > > > > > >>> >>
>> > > > > > > >>>
>> > > > > > >
>> > > > > >
>> > > > >
>> > > >
>> > >
>> >
>> --------------------------------------------------------------------------------
>> > > > > > > >>> >> > > > stringField:  REQUIRED BINARY L:STRING R:0 D:0
>> > > > > > > >>> >> > > > row group 1:  RC:5000001 TS:18262874 OFFSET:4
>> > > > > > > >>> >> > > >
>> > > > > > > >>> >> > > >
>> > > > > > > >>> >> > >
>> > > > > > > >>> >> >
>> > > > > > > >>> >>
>> > > > > > > >>>
>> > > > > > >
>> > > > > >
>> > > > >
>> > > >
>> > >
>> >
>> --------------------------------------------------------------------------------
>> > > > > > > >>> >> > > > stringField:  BINARY UNCOMPRESSED DO:0 FPO:4
>> > > > > > > >>> >> SZ:43896278/43896278/1.00
>> > > > > > > >>> >> > > > VC:5000001 ENC:PLAIN,BIT_PACKED ST:[min: 0, max:
>> > 9999,
>> > > > > > > >>> num_nulls: 0]
>> > > > > > > >>> >> > >
>> > > > > > > >>> >> > >
>> > > > > > > >>> >> > > (I created a gist of my test reproduction here
>> > > > > > > >>> >> > > <
>> > > > > > > >>> >>
>> > > > > > > >>>
>> > > > > >
>> > >
>> https://gist.github.com/clairemcginty/c3c0be85f51bc23db45a75e8d8a18806
>> > > > > > > >>> >> > >.)
>> > > > > > > >>> >> > >
>> > > > > > > >>> >> > > Based on this, I'm guessing there's some tip-over
>> > point
>> > > > > after
>> > > > > > > >>> which
>> > > > > > > >>> >> > Parquet
>> > > > > > > >>> >> > > will give up on writing a dictionary for a given
>> > column?
>> > > > > After
>> > > > > > > >>> reading
>> > > > > > > >>> >> > > the Configuration
>> > > > > > > >>> >> > > docs
>> > > > > > > >>> >> > > <
>> > > > > > > >>> >> >
>> > > > > > > >>> >>
>> > > > > > > >>>
>> > > > > > >
>> > > > > >
>> > > > >
>> > > >
>> > >
>> >
>> https://github.com/apache/parquet-mr/blob/master/parquet-hadoop/README.md
>> > > > > > > >>> >> > > >,
>> > > > > > > >>> >> > > I tried increasing the dictionary page size
>> > > configuration
>> > > > > 5x,
>> > > > > > > >>> with the
>> > > > > > > >>> >> > same
>> > > > > > > >>> >> > > result (no dictionary created).
>> > > > > > > >>> >> > >
>> > > > > > > >>> >> > > So to summarize, my questions are:
>> > > > > > > >>> >> > >
>> > > > > > > >>> >> > > - What's the heuristic for Parquet dictionary
>> writing
>> > to
>> > > > > > succeed
>> > > > > > > >>> for a
>> > > > > > > >>> >> > > given column?
>> > > > > > > >>> >> > > - Is that heuristic configurable at all?
>> > > > > > > >>> >> > > - For high-cardinality datasets, has the idea of a
>> > > > > > > frequency-based
>> > > > > > > >>> >> > > dictionary encoding been explored? Say, if the data
>> > > > follows
>> > > > > a
>> > > > > > > >>> certain
>> > > > > > > >>> >> > > statistical distribution, we can create a
>> dictionary
>> > of
>> > > > the
>> > > > > > most
>> > > > > > > >>> >> frequent
>> > > > > > > >>> >> > > values only?
>> > > > > > > >>> >> > >
>> > > > > > > >>> >> > > Thanks for your time!
>> > > > > > > >>> >> > > - Claire
>> > > > > > > >>> >> > >
>> > > > > > > >>> >> >
>> > > > > > > >>> >>
>> > > > > > > >>> >
>> > > > > > > >>>
>> > > > > > > >>
>> > > > > > >
>> > > > > >
>> > > > > >
>> > > > > > --
>> > > > > > Aaron Niskode-Dossett, Data Engineering -- Etsy
>> > > > > >
>> > > > >
>> > > >
>> > >
>> >
>>
>

Re: Parquet dictionary size limits?

Posted by Claire McGinty <cl...@gmail.com>.
Hmm, like a flag to basically turn off the isCompressionSatisfying check
per-column? That might be simplest!

So to summarize, a column will not write a dictionary encoding when either:

(1) `parquet.enable.dictionary` is set to False
(2) # of distinct values in a column chunk exceeds
`DictionaryValuesWriter#MAX_DICTIONARY_VALUES` (currently set to
Integer.MAX_VALUE)
(3) Total encoded bytes in a dictionary exceed the value of
`parquet.dictionary.page.size`
(4) Desired compression ratio (as a measure of # distinct values : total #
values) is not achieved

I might try out various options for making (4) configurable, starting with
your suggestion, and testing them out on more realistic data distributions.
Will try to return to this thread with my results in a few days :)

Best,
Claire


On Thu, Sep 21, 2023 at 8:48 AM Gang Wu <us...@gmail.com> wrote:

> The current implementation only checks the first page, which is
> vulnerable in many cases. I think your suggestion makes sense.
> However, there is no one-fit-for-all solution. How about simply
> adding a flag to enforce dictionary encoding to a specific column?
>
>
> On Thu, Sep 21, 2023 at 1:08 AM Claire McGinty <claire.d.mcginty@gmail.com
> >
> wrote:
>
> > I think I figured it out! The dictionaryByteSize == 0 was a red herring;
> I
> > was looking at an IntegerDictionaryValuesWriter for an empty column
> rather
> > than my high-cardinality column. Your analysis of the situation was
> > right--it was just that in the first page, there weren't enough distinct
> > values to pass the check.
> >
> > I wonder if we could maybe make this value configurable per-column?
> Either:
> >
> > - A desired ratio of distinct values / total values, on a scale of 0-1.0
> > - Number of pages to check for compression before falling back
> >
> > Let me know what you think!
> >
> > Best,
> > Claire
> >
> > On Wed, Sep 20, 2023 at 9:37 AM Gang Wu <us...@gmail.com> wrote:
> >
> > > I don't understand why you get encodedSize == 1, dictionaryByteSize ==
> 0
> > > and rawSize == 0 in the first page check. It seems that the page does
> not
> > > have any meaning values. Could you please check how many values are
> > > written before the page check?
> > >
> > > On Thu, Sep 21, 2023 at 12:12 AM Claire McGinty <
> > > claire.d.mcginty@gmail.com>
> > > wrote:
> > >
> > > > Hey Gang,
> > > >
> > > > Thanks for the followup! I see what you're saying where it's
> sometimes
> > > just
> > > > bad luck with what ends up in the first page. The intuition seems
> like
> > a
> > > > larger page size should produce a better encoding in this case... I
> > > updated
> > > > my branch
> > > > <
> > > >
> > >
> >
> https://github.com/apache/parquet-mr/compare/master...clairemcginty:parquet-mr:dict-size-repro?expand=1
> > > > >
> > > > to
> > > > add a test with a page size/dict page size of 10MB and am seeing the
> > same
> > > > failure, though.
> > > >
> > > > Something seems kind of odd actually -- when I stepped through the
> > test I
> > > > added w/ debugger, it falls back after invoking
> isCompressionSatisfying
> > > > with encodedSize == 1, dictionaryByteSize == 0 and rawSize == 0; 1 +
> 0
> > <
> > > 1
> > > > returns true. (You can also see this in the System.out logs I added,
> in
> > > the
> > > > branch's GHA run logs). This doesn't seem right to me -- does
> > > > isCompressionSatsifying need an extra check to make sure the
> > > > dictionary isn't empty?
> > > >
> > > > Also, thanks, Aaron! I got into this while running some
> > micro-benchmarks
> > > on
> > > > Parquet reads when various dictionary/bloom filter/encoding options
> are
> > > > configured. Happy to share out when I'm done.
> > > >
> > > > Best,
> > > > Claire
> > > >
> > > > On Tue, Sep 19, 2023 at 9:06 PM Gang Wu <us...@gmail.com> wrote:
> > > >
> > > > > Thanks for the investigation!
> > > > >
> > > > > I think the check below makes sense for a single page:
> > > > >   @Override
> > > > >   public boolean isCompressionSatisfying(long rawSize, long
> > > encodedSize)
> > > > {
> > > > >     return (encodedSize + dictionaryByteSize) < rawSize;
> > > > >   }
> > > > >
> > > > > The problem is that the fallback check is only performed on the
> first
> > > > page.
> > > > > In the first page, all values in that page may be distinct, so it
> > will
> > > > > unlikely
> > > > > pass the isCompressionSatisfying check.
> > > > >
> > > > > Best,
> > > > > Gang
> > > > >
> > > > >
> > > > > On Wed, Sep 20, 2023 at 5:04 AM Aaron Niskode-Dossett
> > > > > <an...@etsy.com.invalid> wrote:
> > > > >
> > > > > > Claire, thank you for your research and examples on this topic,
> > I've
> > > > > > learned a lot.  My hunch is that your change would be a good one,
> > but
> > > > I'm
> > > > > > not an expert (and more to the point, not a committer).  I'm
> > looking
> > > > > > forward to learning more as this discussion continues.
> > > > > >
> > > > > > Thank you again, Aaron
> > > > > >
> > > > > > On Tue, Sep 19, 2023 at 2:48 PM Claire McGinty <
> > > > > claire.d.mcginty@gmail.com
> > > > > > >
> > > > > > wrote:
> > > > > >
> > > > > > > I created a quick branch
> > > > > > > <
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> https://github.com/apache/parquet-mr/compare/master...clairemcginty:parquet-mr:dict-size-repro?expand=1
> > > > > > > >
> > > > > > > to reproduce what I'm seeing -- the test shows that an Int
> column
> > > > with
> > > > > > > cardinality 100 successfully results in a dict encoding, but an
> > int
> > > > > > column
> > > > > > > with cardinality 10,000 falls back and doesn't create a dict
> > > > encoding.
> > > > > > This
> > > > > > > seems like a low threshold given the 1MB dictionary page size,
> > so I
> > > > > just
> > > > > > > wanted to check whether this is expected or not :)
> > > > > > >
> > > > > > > Best,
> > > > > > > Claire
> > > > > > >
> > > > > > > On Tue, Sep 19, 2023 at 9:35 AM Claire McGinty <
> > > > > > claire.d.mcginty@gmail.com
> > > > > > > >
> > > > > > > wrote:
> > > > > > >
> > > > > > > > Hi, just wanted to follow up on this!
> > > > > > > >
> > > > > > > > I ran a debugger to find out why my column wasn't ending up
> > with
> > > a
> > > > > > > > dictionary encoding and it turns out that even though
> > > > > > > > DictionaryValuesWriter#shouldFallback()
> > > > > > > > <
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> https://github.com/apache/parquet-mr/blob/master/parquet-column/src/main/java/org/apache/parquet/column/values/dictionary/DictionaryValuesWriter.java#L117
> > > > > > > >
> > > > > > > > always returned false (dictionaryByteSize was always less
> than
> > my
> > > > > > > > configured page size),
> > > > DictionaryValuesWriter#isCompressionSatisfying
> > > > > > > > <
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> https://github.com/apache/parquet-mr/blob/master/parquet-column/src/main/java/org/apache/parquet/column/values/dictionary/DictionaryValuesWriter.java#L125
> > > > > > >
> > > > > > > was
> > > > > > > > what was causing Parquet to switch
> > > > > > > > <
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> https://github.com/apache/parquet-mr/blob/master/parquet-column/src/main/java/org/apache/parquet/column/values/fallback/FallbackValuesWriter.java#L75
> > > > > > > >
> > > > > > > > back to the fallback, non-dict writer.
> > > > > > > >
> > > > > > > > From what I can tell, this check compares the total byte size
> > of
> > > > > > > > *every* element with the byte size of each *distinct* element
> > as
> > > a
> > > > > kind
> > > > > > > of
> > > > > > > > proxy for encoding efficiency.... however, it seems strange
> > that
> > > > this
> > > > > > > check
> > > > > > > > can cause the writer to fall back even if the total encoded
> > dict
> > > > size
> > > > > > is
> > > > > > > > far below the configured dictionary page size. Out of
> > curiosity,
> > > I
> > > > > > > modified
> > > > > > > > DictionaryValuesWriter#isCompressionSatisfying
> > > > > > > > <
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> https://github.com/apache/parquet-mr/blob/master/parquet-column/src/main/java/org/apache/parquet/column/values/dictionary/DictionaryValuesWriter.java#L125
> > > > > > >
> > > > > > > to
> > > > > > > > also check whether total byte size was less than dictionary
> max
> > > > size
> > > > > > and
> > > > > > > > re-ran my Parquet write with a local snapshot, and my file
> size
> > > > > dropped
> > > > > > > 50%.
> > > > > > > >
> > > > > > > > Best,
> > > > > > > > Claire
> > > > > > > >
> > > > > > > > On Mon, Sep 18, 2023 at 9:16 AM Claire McGinty <
> > > > > > > claire.d.mcginty@gmail.com>
> > > > > > > > wrote:
> > > > > > > >
> > > > > > > >> Oh, interesting! I'm setting it via the
> > > > > > > >> ParquetWriter#withDictionaryPageSize method, and I do see
> the
> > > > > overall
> > > > > > > file
> > > > > > > >> size increasing when I bump the value. I'll look into it a
> bit
> > > > more
> > > > > --
> > > > > > > it
> > > > > > > >> would be helpful for some cases where the # unique values
> in a
> > > > > column
> > > > > > is
> > > > > > > >> just over the size limit.
> > > > > > > >>
> > > > > > > >> - Claire
> > > > > > > >>
> > > > > > > >> On Fri, Sep 15, 2023 at 9:54 AM Micah Kornfield <
> > > > > > emkornfield@gmail.com>
> > > > > > > >> wrote:
> > > > > > > >>
> > > > > > > >>> I'll note there is also a check for encoding effectiveness
> > [1]
> > > > that
> > > > > > > could
> > > > > > > >>> come into play but I'd guess that isn't the case here.
> > > > > > > >>>
> > > > > > > >>> [1]
> > > > > > > >>>
> > > > > > > >>>
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> https://github.com/apache/parquet-mr/blob/9b5a962df3007009a227ef421600197531f970a5/parquet-column/src/main/java/org/apache/parquet/column/values/dictionary/DictionaryValuesWriter.java#L124
> > > > > > > >>>
> > > > > > > >>> On Fri, Sep 15, 2023 at 9:51 AM Micah Kornfield <
> > > > > > emkornfield@gmail.com
> > > > > > > >
> > > > > > > >>> wrote:
> > > > > > > >>>
> > > > > > > >>> > I'm glad I was looking at the right setting for
> dictionary
> > > > size.
> > > > > I
> > > > > > > just
> > > > > > > >>> >> tried it out with 10x, 50x, and even total file size,
> > > though,
> > > > > and
> > > > > > > >>> still am
> > > > > > > >>> >> not seeing a dictionary get created. Is it possible it's
> > > > bounded
> > > > > > by
> > > > > > > >>> file
> > > > > > > >>> >> page size or some other layout option that I need to
> bump
> > as
> > > > > well?
> > > > > > > >>> >
> > > > > > > >>> >
> > > > > > > >>> > Sorry I'm less familiar with parquet-mr, hopefully
> someone
> > > else
> > > > > to
> > > > > > > >>> chime
> > > > > > > >>> > in.  If I had to guess, maybe somehow the config value
> > isn't
> > > > > making
> > > > > > > it
> > > > > > > >>> to
> > > > > > > >>> > the writer (but there could also be something else at
> > play).
> > > > > > > >>> >
> > > > > > > >>> > On Fri, Sep 15, 2023 at 9:33 AM Claire McGinty <
> > > > > > > >>> claire.d.mcginty@gmail.com>
> > > > > > > >>> > wrote:
> > > > > > > >>> >
> > > > > > > >>> >> Thanks so much, Micah!
> > > > > > > >>> >>
> > > > > > > >>> >> I think you are using the right setting, but maybe it is
> > > > > possible
> > > > > > > the
> > > > > > > >>> >> > strings are still exceeding the threshold (perhaps
> > > > increasing
> > > > > it
> > > > > > > by
> > > > > > > >>> 50x
> > > > > > > >>> >> or
> > > > > > > >>> >> > more to verify)
> > > > > > > >>> >>
> > > > > > > >>> >>
> > > > > > > >>> >> I'm glad I was looking at the right setting for
> dictionary
> > > > > size. I
> > > > > > > >>> just
> > > > > > > >>> >> tried it out with 10x, 50x, and even total file size,
> > > though,
> > > > > and
> > > > > > > >>> still am
> > > > > > > >>> >> not seeing a dictionary get created. Is it possible it's
> > > > bounded
> > > > > > by
> > > > > > > >>> file
> > > > > > > >>> >> page size or some other layout option that I need to
> bump
> > as
> > > > > well?
> > > > > > > >>> >>
> > > > > > > >>> >> I haven't seen my discussion during my time in the
> > community
> > > > but
> > > > > > > >>> maybe it
> > > > > > > >>> >> > was discussed in the past.  I think the main challenge
> > > here
> > > > is
> > > > > > > that
> > > > > > > >>> >> pages
> > > > > > > >>> >> > are either dictionary encoded or not.  I'd guess to
> make
> > > > this
> > > > > > > >>> practical
> > > > > > > >>> >> > there would need to be a new hybrid page type, which I
> > > think
> > > > > it
> > > > > > > >>> might
> > > > > > > >>> >> be an
> > > > > > > >>> >> > interesting idea but quite a bit of work.
> Additionally,
> > > one
> > > > > > would
> > > > > > > >>> >> likely
> > > > > > > >>> >> > need heuristics for when to potentially use the new
> mode
> > > > > versus
> > > > > > a
> > > > > > > >>> >> complete
> > > > > > > >>> >> > fallback.
> > > > > > > >>> >> >
> > > > > > > >>> >>
> > > > > > > >>> >> Got it, thanks for the explanation! It does seem like a
> > huge
> > > > > > amount
> > > > > > > of
> > > > > > > >>> >> work
> > > > > > > >>> >>
> > > > > > > >>> >>
> > > > > > > >>> >> Best,
> > > > > > > >>> >> Claire
> > > > > > > >>> >>
> > > > > > > >>> >>
> > > > > > > >>> >>
> > > > > > > >>> >> On Thu, Sep 14, 2023 at 5:16 PM Micah Kornfield <
> > > > > > > >>> emkornfield@gmail.com>
> > > > > > > >>> >> wrote:
> > > > > > > >>> >>
> > > > > > > >>> >> > >
> > > > > > > >>> >> > > - What's the heuristic for Parquet dictionary
> writing
> > to
> > > > > > succeed
> > > > > > > >>> for a
> > > > > > > >>> >> > > given column?
> > > > > > > >>> >> >
> > > > > > > >>> >> >
> > > > > > > >>> >> >
> > > > > > > >>> >> >
> > > > > > > >>> >>
> > > > > > > >>>
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> https://github.com/apache/parquet-mr/blob/9b5a962df3007009a227ef421600197531f970a5/parquet-column/src/main/java/org/apache/parquet/column/values/dictionary/DictionaryValuesWriter.java#L117
> > > > > > > >>> >> >
> > > > > > > >>> >> >
> > > > > > > >>> >> > > - Is that heuristic configurable at all?
> > > > > > > >>> >> >
> > > > > > > >>> >> >
> > > > > > > >>> >> > I think you are using the right setting, but maybe it
> is
> > > > > > possible
> > > > > > > >>> the
> > > > > > > >>> >> > strings are still exceeding the threshold (perhaps
> > > > increasing
> > > > > it
> > > > > > > by
> > > > > > > >>> 50x
> > > > > > > >>> >> or
> > > > > > > >>> >> > more to verify)
> > > > > > > >>> >> >
> > > > > > > >>> >> >
> > > > > > > >>> >> > > - For high-cardinality datasets, has the idea of a
> > > > > > > frequency-based
> > > > > > > >>> >> > > dictionary encoding been explored? Say, if the data
> > > > follows
> > > > > a
> > > > > > > >>> certain
> > > > > > > >>> >> > > statistical distribution, we can create a dictionary
> > of
> > > > the
> > > > > > most
> > > > > > > >>> >> frequent
> > > > > > > >>> >> > > values only?
> > > > > > > >>> >> >
> > > > > > > >>> >> > I haven't seen my discussion during my time in the
> > > community
> > > > > but
> > > > > > > >>> maybe
> > > > > > > >>> >> it
> > > > > > > >>> >> > was discussed in the past.  I think the main challenge
> > > here
> > > > is
> > > > > > > that
> > > > > > > >>> >> pages
> > > > > > > >>> >> > are either dictionary encoded or not.  I'd guess to
> make
> > > > this
> > > > > > > >>> practical
> > > > > > > >>> >> > there would need to be a new hybrid page type, which I
> > > think
> > > > > it
> > > > > > > >>> might
> > > > > > > >>> >> be an
> > > > > > > >>> >> > interesting idea but quite a bit of work.
> Additionally,
> > > one
> > > > > > would
> > > > > > > >>> >> likely
> > > > > > > >>> >> > need heuristics for when to potentially use the new
> mode
> > > > > versus
> > > > > > a
> > > > > > > >>> >> complete
> > > > > > > >>> >> > fallback.
> > > > > > > >>> >> >
> > > > > > > >>> >> > Cheers,
> > > > > > > >>> >> > Micah
> > > > > > > >>> >> >
> > > > > > > >>> >> > On Thu, Sep 14, 2023 at 12:07 PM Claire McGinty <
> > > > > > > >>> >> > claire.d.mcginty@gmail.com>
> > > > > > > >>> >> > wrote:
> > > > > > > >>> >> >
> > > > > > > >>> >> > > Hi dev@,
> > > > > > > >>> >> > >
> > > > > > > >>> >> > > I'm running some benchmarking on Parquet read/write
> > > > > > performance
> > > > > > > >>> and
> > > > > > > >>> >> have
> > > > > > > >>> >> > a
> > > > > > > >>> >> > > few questions about how dictionary encoding works
> > under
> > > > the
> > > > > > > hood.
> > > > > > > >>> Let
> > > > > > > >>> >> me
> > > > > > > >>> >> > > know if there's a better channel for this :)
> > > > > > > >>> >> > >
> > > > > > > >>> >> > > My test case uses parquet-avro, where I'm writing a
> > > single
> > > > > > file
> > > > > > > >>> >> > containing
> > > > > > > >>> >> > > 5 million records. Each record has a single column,
> an
> > > > Avro
> > > > > > > String
> > > > > > > >>> >> field
> > > > > > > >>> >> > > (Parquet binary field). I ran two configurations of
> > base
> > > > > > setup:
> > > > > > > >>> in the
> > > > > > > >>> >> > > first case, the string field has 5,000 possible
> unique
> > > > > values.
> > > > > > > In
> > > > > > > >>> the
> > > > > > > >>> >> > > second case, it has 50,000 unique values.
> > > > > > > >>> >> > >
> > > > > > > >>> >> > > In the first case (5k unique values), I used
> > > parquet-tools
> > > > > to
> > > > > > > >>> inspect
> > > > > > > >>> >> the
> > > > > > > >>> >> > > file metadata and found that a dictionary had been
> > > > written:
> > > > > > > >>> >> > >
> > > > > > > >>> >> > > % parquet-tools meta testdata-case1.parquet
> > > > > > > >>> >> > > > file schema:  testdata.TestRecord
> > > > > > > >>> >> > > >
> > > > > > > >>> >> > > >
> > > > > > > >>> >> > >
> > > > > > > >>> >> >
> > > > > > > >>> >>
> > > > > > > >>>
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> --------------------------------------------------------------------------------
> > > > > > > >>> >> > > > stringField:  REQUIRED BINARY L:STRING R:0 D:0
> > > > > > > >>> >> > > > row group 1:  RC:5000001 TS:18262874 OFFSET:4
> > > > > > > >>> >> > > >
> > > > > > > >>> >> > > >
> > > > > > > >>> >> > >
> > > > > > > >>> >> >
> > > > > > > >>> >>
> > > > > > > >>>
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> --------------------------------------------------------------------------------
> > > > > > > >>> >> > > > stringField:   BINARY UNCOMPRESSED DO:4 FPO:38918
> > > > > > > >>> >> > SZ:8181452/8181452/1.00
> > > > > > > >>> >> > > > VC:5000001 ENC:BIT_PACKED,PLAIN_DICTIONARY
> ST:[min:
> > 0,
> > > > > max:
> > > > > > > 999,
> > > > > > > >>> >> > > num_nulls:
> > > > > > > >>> >> > > > 0]
> > > > > > > >>> >> > >
> > > > > > > >>> >> > >
> > > > > > > >>> >> > > But in the second case (50k unique values),
> > > parquet-tools
> > > > > > shows
> > > > > > > >>> that
> > > > > > > >>> >> no
> > > > > > > >>> >> > > dictionary gets created, and the file size is *much*
> > > > bigger:
> > > > > > > >>> >> > >
> > > > > > > >>> >> > > % parquet-tools meta testdata-case2.parquet
> > > > > > > >>> >> > > > file schema:  testdata.TestRecord
> > > > > > > >>> >> > > >
> > > > > > > >>> >> > > >
> > > > > > > >>> >> > >
> > > > > > > >>> >> >
> > > > > > > >>> >>
> > > > > > > >>>
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> --------------------------------------------------------------------------------
> > > > > > > >>> >> > > > stringField:  REQUIRED BINARY L:STRING R:0 D:0
> > > > > > > >>> >> > > > row group 1:  RC:5000001 TS:18262874 OFFSET:4
> > > > > > > >>> >> > > >
> > > > > > > >>> >> > > >
> > > > > > > >>> >> > >
> > > > > > > >>> >> >
> > > > > > > >>> >>
> > > > > > > >>>
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> --------------------------------------------------------------------------------
> > > > > > > >>> >> > > > stringField:  BINARY UNCOMPRESSED DO:0 FPO:4
> > > > > > > >>> >> SZ:43896278/43896278/1.00
> > > > > > > >>> >> > > > VC:5000001 ENC:PLAIN,BIT_PACKED ST:[min: 0, max:
> > 9999,
> > > > > > > >>> num_nulls: 0]
> > > > > > > >>> >> > >
> > > > > > > >>> >> > >
> > > > > > > >>> >> > > (I created a gist of my test reproduction here
> > > > > > > >>> >> > > <
> > > > > > > >>> >>
> > > > > > > >>>
> > > > > >
> > > https://gist.github.com/clairemcginty/c3c0be85f51bc23db45a75e8d8a18806
> > > > > > > >>> >> > >.)
> > > > > > > >>> >> > >
> > > > > > > >>> >> > > Based on this, I'm guessing there's some tip-over
> > point
> > > > > after
> > > > > > > >>> which
> > > > > > > >>> >> > Parquet
> > > > > > > >>> >> > > will give up on writing a dictionary for a given
> > column?
> > > > > After
> > > > > > > >>> reading
> > > > > > > >>> >> > > the Configuration
> > > > > > > >>> >> > > docs
> > > > > > > >>> >> > > <
> > > > > > > >>> >> >
> > > > > > > >>> >>
> > > > > > > >>>
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> https://github.com/apache/parquet-mr/blob/master/parquet-hadoop/README.md
> > > > > > > >>> >> > > >,
> > > > > > > >>> >> > > I tried increasing the dictionary page size
> > > configuration
> > > > > 5x,
> > > > > > > >>> with the
> > > > > > > >>> >> > same
> > > > > > > >>> >> > > result (no dictionary created).
> > > > > > > >>> >> > >
> > > > > > > >>> >> > > So to summarize, my questions are:
> > > > > > > >>> >> > >
> > > > > > > >>> >> > > - What's the heuristic for Parquet dictionary
> writing
> > to
> > > > > > succeed
> > > > > > > >>> for a
> > > > > > > >>> >> > > given column?
> > > > > > > >>> >> > > - Is that heuristic configurable at all?
> > > > > > > >>> >> > > - For high-cardinality datasets, has the idea of a
> > > > > > > frequency-based
> > > > > > > >>> >> > > dictionary encoding been explored? Say, if the data
> > > > follows
> > > > > a
> > > > > > > >>> certain
> > > > > > > >>> >> > > statistical distribution, we can create a dictionary
> > of
> > > > the
> > > > > > most
> > > > > > > >>> >> frequent
> > > > > > > >>> >> > > values only?
> > > > > > > >>> >> > >
> > > > > > > >>> >> > > Thanks for your time!
> > > > > > > >>> >> > > - Claire
> > > > > > > >>> >> > >
> > > > > > > >>> >> >
> > > > > > > >>> >>
> > > > > > > >>> >
> > > > > > > >>>
> > > > > > > >>
> > > > > > >
> > > > > >
> > > > > >
> > > > > > --
> > > > > > Aaron Niskode-Dossett, Data Engineering -- Etsy
> > > > > >
> > > > >
> > > >
> > >
> >
>

Re: Parquet dictionary size limits?

Posted by Gang Wu <us...@gmail.com>.
The current implementation only checks the first page, which is
vulnerable in many cases. I think your suggestion makes sense.
However, there is no one-fit-for-all solution. How about simply
adding a flag to enforce dictionary encoding to a specific column?


On Thu, Sep 21, 2023 at 1:08 AM Claire McGinty <cl...@gmail.com>
wrote:

> I think I figured it out! The dictionaryByteSize == 0 was a red herring; I
> was looking at an IntegerDictionaryValuesWriter for an empty column rather
> than my high-cardinality column. Your analysis of the situation was
> right--it was just that in the first page, there weren't enough distinct
> values to pass the check.
>
> I wonder if we could maybe make this value configurable per-column? Either:
>
> - A desired ratio of distinct values / total values, on a scale of 0-1.0
> - Number of pages to check for compression before falling back
>
> Let me know what you think!
>
> Best,
> Claire
>
> On Wed, Sep 20, 2023 at 9:37 AM Gang Wu <us...@gmail.com> wrote:
>
> > I don't understand why you get encodedSize == 1, dictionaryByteSize == 0
> > and rawSize == 0 in the first page check. It seems that the page does not
> > have any meaning values. Could you please check how many values are
> > written before the page check?
> >
> > On Thu, Sep 21, 2023 at 12:12 AM Claire McGinty <
> > claire.d.mcginty@gmail.com>
> > wrote:
> >
> > > Hey Gang,
> > >
> > > Thanks for the followup! I see what you're saying where it's sometimes
> > just
> > > bad luck with what ends up in the first page. The intuition seems like
> a
> > > larger page size should produce a better encoding in this case... I
> > updated
> > > my branch
> > > <
> > >
> >
> https://github.com/apache/parquet-mr/compare/master...clairemcginty:parquet-mr:dict-size-repro?expand=1
> > > >
> > > to
> > > add a test with a page size/dict page size of 10MB and am seeing the
> same
> > > failure, though.
> > >
> > > Something seems kind of odd actually -- when I stepped through the
> test I
> > > added w/ debugger, it falls back after invoking isCompressionSatisfying
> > > with encodedSize == 1, dictionaryByteSize == 0 and rawSize == 0; 1 + 0
> <
> > 1
> > > returns true. (You can also see this in the System.out logs I added, in
> > the
> > > branch's GHA run logs). This doesn't seem right to me -- does
> > > isCompressionSatsifying need an extra check to make sure the
> > > dictionary isn't empty?
> > >
> > > Also, thanks, Aaron! I got into this while running some
> micro-benchmarks
> > on
> > > Parquet reads when various dictionary/bloom filter/encoding options are
> > > configured. Happy to share out when I'm done.
> > >
> > > Best,
> > > Claire
> > >
> > > On Tue, Sep 19, 2023 at 9:06 PM Gang Wu <us...@gmail.com> wrote:
> > >
> > > > Thanks for the investigation!
> > > >
> > > > I think the check below makes sense for a single page:
> > > >   @Override
> > > >   public boolean isCompressionSatisfying(long rawSize, long
> > encodedSize)
> > > {
> > > >     return (encodedSize + dictionaryByteSize) < rawSize;
> > > >   }
> > > >
> > > > The problem is that the fallback check is only performed on the first
> > > page.
> > > > In the first page, all values in that page may be distinct, so it
> will
> > > > unlikely
> > > > pass the isCompressionSatisfying check.
> > > >
> > > > Best,
> > > > Gang
> > > >
> > > >
> > > > On Wed, Sep 20, 2023 at 5:04 AM Aaron Niskode-Dossett
> > > > <an...@etsy.com.invalid> wrote:
> > > >
> > > > > Claire, thank you for your research and examples on this topic,
> I've
> > > > > learned a lot.  My hunch is that your change would be a good one,
> but
> > > I'm
> > > > > not an expert (and more to the point, not a committer).  I'm
> looking
> > > > > forward to learning more as this discussion continues.
> > > > >
> > > > > Thank you again, Aaron
> > > > >
> > > > > On Tue, Sep 19, 2023 at 2:48 PM Claire McGinty <
> > > > claire.d.mcginty@gmail.com
> > > > > >
> > > > > wrote:
> > > > >
> > > > > > I created a quick branch
> > > > > > <
> > > > > >
> > > > >
> > > >
> > >
> >
> https://github.com/apache/parquet-mr/compare/master...clairemcginty:parquet-mr:dict-size-repro?expand=1
> > > > > > >
> > > > > > to reproduce what I'm seeing -- the test shows that an Int column
> > > with
> > > > > > cardinality 100 successfully results in a dict encoding, but an
> int
> > > > > column
> > > > > > with cardinality 10,000 falls back and doesn't create a dict
> > > encoding.
> > > > > This
> > > > > > seems like a low threshold given the 1MB dictionary page size,
> so I
> > > > just
> > > > > > wanted to check whether this is expected or not :)
> > > > > >
> > > > > > Best,
> > > > > > Claire
> > > > > >
> > > > > > On Tue, Sep 19, 2023 at 9:35 AM Claire McGinty <
> > > > > claire.d.mcginty@gmail.com
> > > > > > >
> > > > > > wrote:
> > > > > >
> > > > > > > Hi, just wanted to follow up on this!
> > > > > > >
> > > > > > > I ran a debugger to find out why my column wasn't ending up
> with
> > a
> > > > > > > dictionary encoding and it turns out that even though
> > > > > > > DictionaryValuesWriter#shouldFallback()
> > > > > > > <
> > > > > >
> > > > >
> > > >
> > >
> >
> https://github.com/apache/parquet-mr/blob/master/parquet-column/src/main/java/org/apache/parquet/column/values/dictionary/DictionaryValuesWriter.java#L117
> > > > > > >
> > > > > > > always returned false (dictionaryByteSize was always less than
> my
> > > > > > > configured page size),
> > > DictionaryValuesWriter#isCompressionSatisfying
> > > > > > > <
> > > > > >
> > > > >
> > > >
> > >
> >
> https://github.com/apache/parquet-mr/blob/master/parquet-column/src/main/java/org/apache/parquet/column/values/dictionary/DictionaryValuesWriter.java#L125
> > > > > >
> > > > > > was
> > > > > > > what was causing Parquet to switch
> > > > > > > <
> > > > > >
> > > > >
> > > >
> > >
> >
> https://github.com/apache/parquet-mr/blob/master/parquet-column/src/main/java/org/apache/parquet/column/values/fallback/FallbackValuesWriter.java#L75
> > > > > > >
> > > > > > > back to the fallback, non-dict writer.
> > > > > > >
> > > > > > > From what I can tell, this check compares the total byte size
> of
> > > > > > > *every* element with the byte size of each *distinct* element
> as
> > a
> > > > kind
> > > > > > of
> > > > > > > proxy for encoding efficiency.... however, it seems strange
> that
> > > this
> > > > > > check
> > > > > > > can cause the writer to fall back even if the total encoded
> dict
> > > size
> > > > > is
> > > > > > > far below the configured dictionary page size. Out of
> curiosity,
> > I
> > > > > > modified
> > > > > > > DictionaryValuesWriter#isCompressionSatisfying
> > > > > > > <
> > > > > >
> > > > >
> > > >
> > >
> >
> https://github.com/apache/parquet-mr/blob/master/parquet-column/src/main/java/org/apache/parquet/column/values/dictionary/DictionaryValuesWriter.java#L125
> > > > > >
> > > > > > to
> > > > > > > also check whether total byte size was less than dictionary max
> > > size
> > > > > and
> > > > > > > re-ran my Parquet write with a local snapshot, and my file size
> > > > dropped
> > > > > > 50%.
> > > > > > >
> > > > > > > Best,
> > > > > > > Claire
> > > > > > >
> > > > > > > On Mon, Sep 18, 2023 at 9:16 AM Claire McGinty <
> > > > > > claire.d.mcginty@gmail.com>
> > > > > > > wrote:
> > > > > > >
> > > > > > >> Oh, interesting! I'm setting it via the
> > > > > > >> ParquetWriter#withDictionaryPageSize method, and I do see the
> > > > overall
> > > > > > file
> > > > > > >> size increasing when I bump the value. I'll look into it a bit
> > > more
> > > > --
> > > > > > it
> > > > > > >> would be helpful for some cases where the # unique values in a
> > > > column
> > > > > is
> > > > > > >> just over the size limit.
> > > > > > >>
> > > > > > >> - Claire
> > > > > > >>
> > > > > > >> On Fri, Sep 15, 2023 at 9:54 AM Micah Kornfield <
> > > > > emkornfield@gmail.com>
> > > > > > >> wrote:
> > > > > > >>
> > > > > > >>> I'll note there is also a check for encoding effectiveness
> [1]
> > > that
> > > > > > could
> > > > > > >>> come into play but I'd guess that isn't the case here.
> > > > > > >>>
> > > > > > >>> [1]
> > > > > > >>>
> > > > > > >>>
> > > > > >
> > > > >
> > > >
> > >
> >
> https://github.com/apache/parquet-mr/blob/9b5a962df3007009a227ef421600197531f970a5/parquet-column/src/main/java/org/apache/parquet/column/values/dictionary/DictionaryValuesWriter.java#L124
> > > > > > >>>
> > > > > > >>> On Fri, Sep 15, 2023 at 9:51 AM Micah Kornfield <
> > > > > emkornfield@gmail.com
> > > > > > >
> > > > > > >>> wrote:
> > > > > > >>>
> > > > > > >>> > I'm glad I was looking at the right setting for dictionary
> > > size.
> > > > I
> > > > > > just
> > > > > > >>> >> tried it out with 10x, 50x, and even total file size,
> > though,
> > > > and
> > > > > > >>> still am
> > > > > > >>> >> not seeing a dictionary get created. Is it possible it's
> > > bounded
> > > > > by
> > > > > > >>> file
> > > > > > >>> >> page size or some other layout option that I need to bump
> as
> > > > well?
> > > > > > >>> >
> > > > > > >>> >
> > > > > > >>> > Sorry I'm less familiar with parquet-mr, hopefully someone
> > else
> > > > to
> > > > > > >>> chime
> > > > > > >>> > in.  If I had to guess, maybe somehow the config value
> isn't
> > > > making
> > > > > > it
> > > > > > >>> to
> > > > > > >>> > the writer (but there could also be something else at
> play).
> > > > > > >>> >
> > > > > > >>> > On Fri, Sep 15, 2023 at 9:33 AM Claire McGinty <
> > > > > > >>> claire.d.mcginty@gmail.com>
> > > > > > >>> > wrote:
> > > > > > >>> >
> > > > > > >>> >> Thanks so much, Micah!
> > > > > > >>> >>
> > > > > > >>> >> I think you are using the right setting, but maybe it is
> > > > possible
> > > > > > the
> > > > > > >>> >> > strings are still exceeding the threshold (perhaps
> > > increasing
> > > > it
> > > > > > by
> > > > > > >>> 50x
> > > > > > >>> >> or
> > > > > > >>> >> > more to verify)
> > > > > > >>> >>
> > > > > > >>> >>
> > > > > > >>> >> I'm glad I was looking at the right setting for dictionary
> > > > size. I
> > > > > > >>> just
> > > > > > >>> >> tried it out with 10x, 50x, and even total file size,
> > though,
> > > > and
> > > > > > >>> still am
> > > > > > >>> >> not seeing a dictionary get created. Is it possible it's
> > > bounded
> > > > > by
> > > > > > >>> file
> > > > > > >>> >> page size or some other layout option that I need to bump
> as
> > > > well?
> > > > > > >>> >>
> > > > > > >>> >> I haven't seen my discussion during my time in the
> community
> > > but
> > > > > > >>> maybe it
> > > > > > >>> >> > was discussed in the past.  I think the main challenge
> > here
> > > is
> > > > > > that
> > > > > > >>> >> pages
> > > > > > >>> >> > are either dictionary encoded or not.  I'd guess to make
> > > this
> > > > > > >>> practical
> > > > > > >>> >> > there would need to be a new hybrid page type, which I
> > think
> > > > it
> > > > > > >>> might
> > > > > > >>> >> be an
> > > > > > >>> >> > interesting idea but quite a bit of work.  Additionally,
> > one
> > > > > would
> > > > > > >>> >> likely
> > > > > > >>> >> > need heuristics for when to potentially use the new mode
> > > > versus
> > > > > a
> > > > > > >>> >> complete
> > > > > > >>> >> > fallback.
> > > > > > >>> >> >
> > > > > > >>> >>
> > > > > > >>> >> Got it, thanks for the explanation! It does seem like a
> huge
> > > > > amount
> > > > > > of
> > > > > > >>> >> work
> > > > > > >>> >>
> > > > > > >>> >>
> > > > > > >>> >> Best,
> > > > > > >>> >> Claire
> > > > > > >>> >>
> > > > > > >>> >>
> > > > > > >>> >>
> > > > > > >>> >> On Thu, Sep 14, 2023 at 5:16 PM Micah Kornfield <
> > > > > > >>> emkornfield@gmail.com>
> > > > > > >>> >> wrote:
> > > > > > >>> >>
> > > > > > >>> >> > >
> > > > > > >>> >> > > - What's the heuristic for Parquet dictionary writing
> to
> > > > > succeed
> > > > > > >>> for a
> > > > > > >>> >> > > given column?
> > > > > > >>> >> >
> > > > > > >>> >> >
> > > > > > >>> >> >
> > > > > > >>> >> >
> > > > > > >>> >>
> > > > > > >>>
> > > > > >
> > > > >
> > > >
> > >
> >
> https://github.com/apache/parquet-mr/blob/9b5a962df3007009a227ef421600197531f970a5/parquet-column/src/main/java/org/apache/parquet/column/values/dictionary/DictionaryValuesWriter.java#L117
> > > > > > >>> >> >
> > > > > > >>> >> >
> > > > > > >>> >> > > - Is that heuristic configurable at all?
> > > > > > >>> >> >
> > > > > > >>> >> >
> > > > > > >>> >> > I think you are using the right setting, but maybe it is
> > > > > possible
> > > > > > >>> the
> > > > > > >>> >> > strings are still exceeding the threshold (perhaps
> > > increasing
> > > > it
> > > > > > by
> > > > > > >>> 50x
> > > > > > >>> >> or
> > > > > > >>> >> > more to verify)
> > > > > > >>> >> >
> > > > > > >>> >> >
> > > > > > >>> >> > > - For high-cardinality datasets, has the idea of a
> > > > > > frequency-based
> > > > > > >>> >> > > dictionary encoding been explored? Say, if the data
> > > follows
> > > > a
> > > > > > >>> certain
> > > > > > >>> >> > > statistical distribution, we can create a dictionary
> of
> > > the
> > > > > most
> > > > > > >>> >> frequent
> > > > > > >>> >> > > values only?
> > > > > > >>> >> >
> > > > > > >>> >> > I haven't seen my discussion during my time in the
> > community
> > > > but
> > > > > > >>> maybe
> > > > > > >>> >> it
> > > > > > >>> >> > was discussed in the past.  I think the main challenge
> > here
> > > is
> > > > > > that
> > > > > > >>> >> pages
> > > > > > >>> >> > are either dictionary encoded or not.  I'd guess to make
> > > this
> > > > > > >>> practical
> > > > > > >>> >> > there would need to be a new hybrid page type, which I
> > think
> > > > it
> > > > > > >>> might
> > > > > > >>> >> be an
> > > > > > >>> >> > interesting idea but quite a bit of work.  Additionally,
> > one
> > > > > would
> > > > > > >>> >> likely
> > > > > > >>> >> > need heuristics for when to potentially use the new mode
> > > > versus
> > > > > a
> > > > > > >>> >> complete
> > > > > > >>> >> > fallback.
> > > > > > >>> >> >
> > > > > > >>> >> > Cheers,
> > > > > > >>> >> > Micah
> > > > > > >>> >> >
> > > > > > >>> >> > On Thu, Sep 14, 2023 at 12:07 PM Claire McGinty <
> > > > > > >>> >> > claire.d.mcginty@gmail.com>
> > > > > > >>> >> > wrote:
> > > > > > >>> >> >
> > > > > > >>> >> > > Hi dev@,
> > > > > > >>> >> > >
> > > > > > >>> >> > > I'm running some benchmarking on Parquet read/write
> > > > > performance
> > > > > > >>> and
> > > > > > >>> >> have
> > > > > > >>> >> > a
> > > > > > >>> >> > > few questions about how dictionary encoding works
> under
> > > the
> > > > > > hood.
> > > > > > >>> Let
> > > > > > >>> >> me
> > > > > > >>> >> > > know if there's a better channel for this :)
> > > > > > >>> >> > >
> > > > > > >>> >> > > My test case uses parquet-avro, where I'm writing a
> > single
> > > > > file
> > > > > > >>> >> > containing
> > > > > > >>> >> > > 5 million records. Each record has a single column, an
> > > Avro
> > > > > > String
> > > > > > >>> >> field
> > > > > > >>> >> > > (Parquet binary field). I ran two configurations of
> base
> > > > > setup:
> > > > > > >>> in the
> > > > > > >>> >> > > first case, the string field has 5,000 possible unique
> > > > values.
> > > > > > In
> > > > > > >>> the
> > > > > > >>> >> > > second case, it has 50,000 unique values.
> > > > > > >>> >> > >
> > > > > > >>> >> > > In the first case (5k unique values), I used
> > parquet-tools
> > > > to
> > > > > > >>> inspect
> > > > > > >>> >> the
> > > > > > >>> >> > > file metadata and found that a dictionary had been
> > > written:
> > > > > > >>> >> > >
> > > > > > >>> >> > > % parquet-tools meta testdata-case1.parquet
> > > > > > >>> >> > > > file schema:  testdata.TestRecord
> > > > > > >>> >> > > >
> > > > > > >>> >> > > >
> > > > > > >>> >> > >
> > > > > > >>> >> >
> > > > > > >>> >>
> > > > > > >>>
> > > > > >
> > > > >
> > > >
> > >
> >
> --------------------------------------------------------------------------------
> > > > > > >>> >> > > > stringField:  REQUIRED BINARY L:STRING R:0 D:0
> > > > > > >>> >> > > > row group 1:  RC:5000001 TS:18262874 OFFSET:4
> > > > > > >>> >> > > >
> > > > > > >>> >> > > >
> > > > > > >>> >> > >
> > > > > > >>> >> >
> > > > > > >>> >>
> > > > > > >>>
> > > > > >
> > > > >
> > > >
> > >
> >
> --------------------------------------------------------------------------------
> > > > > > >>> >> > > > stringField:   BINARY UNCOMPRESSED DO:4 FPO:38918
> > > > > > >>> >> > SZ:8181452/8181452/1.00
> > > > > > >>> >> > > > VC:5000001 ENC:BIT_PACKED,PLAIN_DICTIONARY ST:[min:
> 0,
> > > > max:
> > > > > > 999,
> > > > > > >>> >> > > num_nulls:
> > > > > > >>> >> > > > 0]
> > > > > > >>> >> > >
> > > > > > >>> >> > >
> > > > > > >>> >> > > But in the second case (50k unique values),
> > parquet-tools
> > > > > shows
> > > > > > >>> that
> > > > > > >>> >> no
> > > > > > >>> >> > > dictionary gets created, and the file size is *much*
> > > bigger:
> > > > > > >>> >> > >
> > > > > > >>> >> > > % parquet-tools meta testdata-case2.parquet
> > > > > > >>> >> > > > file schema:  testdata.TestRecord
> > > > > > >>> >> > > >
> > > > > > >>> >> > > >
> > > > > > >>> >> > >
> > > > > > >>> >> >
> > > > > > >>> >>
> > > > > > >>>
> > > > > >
> > > > >
> > > >
> > >
> >
> --------------------------------------------------------------------------------
> > > > > > >>> >> > > > stringField:  REQUIRED BINARY L:STRING R:0 D:0
> > > > > > >>> >> > > > row group 1:  RC:5000001 TS:18262874 OFFSET:4
> > > > > > >>> >> > > >
> > > > > > >>> >> > > >
> > > > > > >>> >> > >
> > > > > > >>> >> >
> > > > > > >>> >>
> > > > > > >>>
> > > > > >
> > > > >
> > > >
> > >
> >
> --------------------------------------------------------------------------------
> > > > > > >>> >> > > > stringField:  BINARY UNCOMPRESSED DO:0 FPO:4
> > > > > > >>> >> SZ:43896278/43896278/1.00
> > > > > > >>> >> > > > VC:5000001 ENC:PLAIN,BIT_PACKED ST:[min: 0, max:
> 9999,
> > > > > > >>> num_nulls: 0]
> > > > > > >>> >> > >
> > > > > > >>> >> > >
> > > > > > >>> >> > > (I created a gist of my test reproduction here
> > > > > > >>> >> > > <
> > > > > > >>> >>
> > > > > > >>>
> > > > >
> > https://gist.github.com/clairemcginty/c3c0be85f51bc23db45a75e8d8a18806
> > > > > > >>> >> > >.)
> > > > > > >>> >> > >
> > > > > > >>> >> > > Based on this, I'm guessing there's some tip-over
> point
> > > > after
> > > > > > >>> which
> > > > > > >>> >> > Parquet
> > > > > > >>> >> > > will give up on writing a dictionary for a given
> column?
> > > > After
> > > > > > >>> reading
> > > > > > >>> >> > > the Configuration
> > > > > > >>> >> > > docs
> > > > > > >>> >> > > <
> > > > > > >>> >> >
> > > > > > >>> >>
> > > > > > >>>
> > > > > >
> > > > >
> > > >
> > >
> >
> https://github.com/apache/parquet-mr/blob/master/parquet-hadoop/README.md
> > > > > > >>> >> > > >,
> > > > > > >>> >> > > I tried increasing the dictionary page size
> > configuration
> > > > 5x,
> > > > > > >>> with the
> > > > > > >>> >> > same
> > > > > > >>> >> > > result (no dictionary created).
> > > > > > >>> >> > >
> > > > > > >>> >> > > So to summarize, my questions are:
> > > > > > >>> >> > >
> > > > > > >>> >> > > - What's the heuristic for Parquet dictionary writing
> to
> > > > > succeed
> > > > > > >>> for a
> > > > > > >>> >> > > given column?
> > > > > > >>> >> > > - Is that heuristic configurable at all?
> > > > > > >>> >> > > - For high-cardinality datasets, has the idea of a
> > > > > > frequency-based
> > > > > > >>> >> > > dictionary encoding been explored? Say, if the data
> > > follows
> > > > a
> > > > > > >>> certain
> > > > > > >>> >> > > statistical distribution, we can create a dictionary
> of
> > > the
> > > > > most
> > > > > > >>> >> frequent
> > > > > > >>> >> > > values only?
> > > > > > >>> >> > >
> > > > > > >>> >> > > Thanks for your time!
> > > > > > >>> >> > > - Claire
> > > > > > >>> >> > >
> > > > > > >>> >> >
> > > > > > >>> >>
> > > > > > >>> >
> > > > > > >>>
> > > > > > >>
> > > > > >
> > > > >
> > > > >
> > > > > --
> > > > > Aaron Niskode-Dossett, Data Engineering -- Etsy
> > > > >
> > > >
> > >
> >
>

Re: Parquet dictionary size limits?

Posted by Claire McGinty <cl...@gmail.com>.
I think I figured it out! The dictionaryByteSize == 0 was a red herring; I
was looking at an IntegerDictionaryValuesWriter for an empty column rather
than my high-cardinality column. Your analysis of the situation was
right--it was just that in the first page, there weren't enough distinct
values to pass the check.

I wonder if we could maybe make this value configurable per-column? Either:

- A desired ratio of distinct values / total values, on a scale of 0-1.0
- Number of pages to check for compression before falling back

Let me know what you think!

Best,
Claire

On Wed, Sep 20, 2023 at 9:37 AM Gang Wu <us...@gmail.com> wrote:

> I don't understand why you get encodedSize == 1, dictionaryByteSize == 0
> and rawSize == 0 in the first page check. It seems that the page does not
> have any meaning values. Could you please check how many values are
> written before the page check?
>
> On Thu, Sep 21, 2023 at 12:12 AM Claire McGinty <
> claire.d.mcginty@gmail.com>
> wrote:
>
> > Hey Gang,
> >
> > Thanks for the followup! I see what you're saying where it's sometimes
> just
> > bad luck with what ends up in the first page. The intuition seems like a
> > larger page size should produce a better encoding in this case... I
> updated
> > my branch
> > <
> >
> https://github.com/apache/parquet-mr/compare/master...clairemcginty:parquet-mr:dict-size-repro?expand=1
> > >
> > to
> > add a test with a page size/dict page size of 10MB and am seeing the same
> > failure, though.
> >
> > Something seems kind of odd actually -- when I stepped through the test I
> > added w/ debugger, it falls back after invoking isCompressionSatisfying
> > with encodedSize == 1, dictionaryByteSize == 0 and rawSize == 0; 1 + 0 <
> 1
> > returns true. (You can also see this in the System.out logs I added, in
> the
> > branch's GHA run logs). This doesn't seem right to me -- does
> > isCompressionSatsifying need an extra check to make sure the
> > dictionary isn't empty?
> >
> > Also, thanks, Aaron! I got into this while running some micro-benchmarks
> on
> > Parquet reads when various dictionary/bloom filter/encoding options are
> > configured. Happy to share out when I'm done.
> >
> > Best,
> > Claire
> >
> > On Tue, Sep 19, 2023 at 9:06 PM Gang Wu <us...@gmail.com> wrote:
> >
> > > Thanks for the investigation!
> > >
> > > I think the check below makes sense for a single page:
> > >   @Override
> > >   public boolean isCompressionSatisfying(long rawSize, long
> encodedSize)
> > {
> > >     return (encodedSize + dictionaryByteSize) < rawSize;
> > >   }
> > >
> > > The problem is that the fallback check is only performed on the first
> > page.
> > > In the first page, all values in that page may be distinct, so it will
> > > unlikely
> > > pass the isCompressionSatisfying check.
> > >
> > > Best,
> > > Gang
> > >
> > >
> > > On Wed, Sep 20, 2023 at 5:04 AM Aaron Niskode-Dossett
> > > <an...@etsy.com.invalid> wrote:
> > >
> > > > Claire, thank you for your research and examples on this topic, I've
> > > > learned a lot.  My hunch is that your change would be a good one, but
> > I'm
> > > > not an expert (and more to the point, not a committer).  I'm looking
> > > > forward to learning more as this discussion continues.
> > > >
> > > > Thank you again, Aaron
> > > >
> > > > On Tue, Sep 19, 2023 at 2:48 PM Claire McGinty <
> > > claire.d.mcginty@gmail.com
> > > > >
> > > > wrote:
> > > >
> > > > > I created a quick branch
> > > > > <
> > > > >
> > > >
> > >
> >
> https://github.com/apache/parquet-mr/compare/master...clairemcginty:parquet-mr:dict-size-repro?expand=1
> > > > > >
> > > > > to reproduce what I'm seeing -- the test shows that an Int column
> > with
> > > > > cardinality 100 successfully results in a dict encoding, but an int
> > > > column
> > > > > with cardinality 10,000 falls back and doesn't create a dict
> > encoding.
> > > > This
> > > > > seems like a low threshold given the 1MB dictionary page size, so I
> > > just
> > > > > wanted to check whether this is expected or not :)
> > > > >
> > > > > Best,
> > > > > Claire
> > > > >
> > > > > On Tue, Sep 19, 2023 at 9:35 AM Claire McGinty <
> > > > claire.d.mcginty@gmail.com
> > > > > >
> > > > > wrote:
> > > > >
> > > > > > Hi, just wanted to follow up on this!
> > > > > >
> > > > > > I ran a debugger to find out why my column wasn't ending up with
> a
> > > > > > dictionary encoding and it turns out that even though
> > > > > > DictionaryValuesWriter#shouldFallback()
> > > > > > <
> > > > >
> > > >
> > >
> >
> https://github.com/apache/parquet-mr/blob/master/parquet-column/src/main/java/org/apache/parquet/column/values/dictionary/DictionaryValuesWriter.java#L117
> > > > > >
> > > > > > always returned false (dictionaryByteSize was always less than my
> > > > > > configured page size),
> > DictionaryValuesWriter#isCompressionSatisfying
> > > > > > <
> > > > >
> > > >
> > >
> >
> https://github.com/apache/parquet-mr/blob/master/parquet-column/src/main/java/org/apache/parquet/column/values/dictionary/DictionaryValuesWriter.java#L125
> > > > >
> > > > > was
> > > > > > what was causing Parquet to switch
> > > > > > <
> > > > >
> > > >
> > >
> >
> https://github.com/apache/parquet-mr/blob/master/parquet-column/src/main/java/org/apache/parquet/column/values/fallback/FallbackValuesWriter.java#L75
> > > > > >
> > > > > > back to the fallback, non-dict writer.
> > > > > >
> > > > > > From what I can tell, this check compares the total byte size of
> > > > > > *every* element with the byte size of each *distinct* element as
> a
> > > kind
> > > > > of
> > > > > > proxy for encoding efficiency.... however, it seems strange that
> > this
> > > > > check
> > > > > > can cause the writer to fall back even if the total encoded dict
> > size
> > > > is
> > > > > > far below the configured dictionary page size. Out of curiosity,
> I
> > > > > modified
> > > > > > DictionaryValuesWriter#isCompressionSatisfying
> > > > > > <
> > > > >
> > > >
> > >
> >
> https://github.com/apache/parquet-mr/blob/master/parquet-column/src/main/java/org/apache/parquet/column/values/dictionary/DictionaryValuesWriter.java#L125
> > > > >
> > > > > to
> > > > > > also check whether total byte size was less than dictionary max
> > size
> > > > and
> > > > > > re-ran my Parquet write with a local snapshot, and my file size
> > > dropped
> > > > > 50%.
> > > > > >
> > > > > > Best,
> > > > > > Claire
> > > > > >
> > > > > > On Mon, Sep 18, 2023 at 9:16 AM Claire McGinty <
> > > > > claire.d.mcginty@gmail.com>
> > > > > > wrote:
> > > > > >
> > > > > >> Oh, interesting! I'm setting it via the
> > > > > >> ParquetWriter#withDictionaryPageSize method, and I do see the
> > > overall
> > > > > file
> > > > > >> size increasing when I bump the value. I'll look into it a bit
> > more
> > > --
> > > > > it
> > > > > >> would be helpful for some cases where the # unique values in a
> > > column
> > > > is
> > > > > >> just over the size limit.
> > > > > >>
> > > > > >> - Claire
> > > > > >>
> > > > > >> On Fri, Sep 15, 2023 at 9:54 AM Micah Kornfield <
> > > > emkornfield@gmail.com>
> > > > > >> wrote:
> > > > > >>
> > > > > >>> I'll note there is also a check for encoding effectiveness [1]
> > that
> > > > > could
> > > > > >>> come into play but I'd guess that isn't the case here.
> > > > > >>>
> > > > > >>> [1]
> > > > > >>>
> > > > > >>>
> > > > >
> > > >
> > >
> >
> https://github.com/apache/parquet-mr/blob/9b5a962df3007009a227ef421600197531f970a5/parquet-column/src/main/java/org/apache/parquet/column/values/dictionary/DictionaryValuesWriter.java#L124
> > > > > >>>
> > > > > >>> On Fri, Sep 15, 2023 at 9:51 AM Micah Kornfield <
> > > > emkornfield@gmail.com
> > > > > >
> > > > > >>> wrote:
> > > > > >>>
> > > > > >>> > I'm glad I was looking at the right setting for dictionary
> > size.
> > > I
> > > > > just
> > > > > >>> >> tried it out with 10x, 50x, and even total file size,
> though,
> > > and
> > > > > >>> still am
> > > > > >>> >> not seeing a dictionary get created. Is it possible it's
> > bounded
> > > > by
> > > > > >>> file
> > > > > >>> >> page size or some other layout option that I need to bump as
> > > well?
> > > > > >>> >
> > > > > >>> >
> > > > > >>> > Sorry I'm less familiar with parquet-mr, hopefully someone
> else
> > > to
> > > > > >>> chime
> > > > > >>> > in.  If I had to guess, maybe somehow the config value isn't
> > > making
> > > > > it
> > > > > >>> to
> > > > > >>> > the writer (but there could also be something else at play).
> > > > > >>> >
> > > > > >>> > On Fri, Sep 15, 2023 at 9:33 AM Claire McGinty <
> > > > > >>> claire.d.mcginty@gmail.com>
> > > > > >>> > wrote:
> > > > > >>> >
> > > > > >>> >> Thanks so much, Micah!
> > > > > >>> >>
> > > > > >>> >> I think you are using the right setting, but maybe it is
> > > possible
> > > > > the
> > > > > >>> >> > strings are still exceeding the threshold (perhaps
> > increasing
> > > it
> > > > > by
> > > > > >>> 50x
> > > > > >>> >> or
> > > > > >>> >> > more to verify)
> > > > > >>> >>
> > > > > >>> >>
> > > > > >>> >> I'm glad I was looking at the right setting for dictionary
> > > size. I
> > > > > >>> just
> > > > > >>> >> tried it out with 10x, 50x, and even total file size,
> though,
> > > and
> > > > > >>> still am
> > > > > >>> >> not seeing a dictionary get created. Is it possible it's
> > bounded
> > > > by
> > > > > >>> file
> > > > > >>> >> page size or some other layout option that I need to bump as
> > > well?
> > > > > >>> >>
> > > > > >>> >> I haven't seen my discussion during my time in the community
> > but
> > > > > >>> maybe it
> > > > > >>> >> > was discussed in the past.  I think the main challenge
> here
> > is
> > > > > that
> > > > > >>> >> pages
> > > > > >>> >> > are either dictionary encoded or not.  I'd guess to make
> > this
> > > > > >>> practical
> > > > > >>> >> > there would need to be a new hybrid page type, which I
> think
> > > it
> > > > > >>> might
> > > > > >>> >> be an
> > > > > >>> >> > interesting idea but quite a bit of work.  Additionally,
> one
> > > > would
> > > > > >>> >> likely
> > > > > >>> >> > need heuristics for when to potentially use the new mode
> > > versus
> > > > a
> > > > > >>> >> complete
> > > > > >>> >> > fallback.
> > > > > >>> >> >
> > > > > >>> >>
> > > > > >>> >> Got it, thanks for the explanation! It does seem like a huge
> > > > amount
> > > > > of
> > > > > >>> >> work
> > > > > >>> >>
> > > > > >>> >>
> > > > > >>> >> Best,
> > > > > >>> >> Claire
> > > > > >>> >>
> > > > > >>> >>
> > > > > >>> >>
> > > > > >>> >> On Thu, Sep 14, 2023 at 5:16 PM Micah Kornfield <
> > > > > >>> emkornfield@gmail.com>
> > > > > >>> >> wrote:
> > > > > >>> >>
> > > > > >>> >> > >
> > > > > >>> >> > > - What's the heuristic for Parquet dictionary writing to
> > > > succeed
> > > > > >>> for a
> > > > > >>> >> > > given column?
> > > > > >>> >> >
> > > > > >>> >> >
> > > > > >>> >> >
> > > > > >>> >> >
> > > > > >>> >>
> > > > > >>>
> > > > >
> > > >
> > >
> >
> https://github.com/apache/parquet-mr/blob/9b5a962df3007009a227ef421600197531f970a5/parquet-column/src/main/java/org/apache/parquet/column/values/dictionary/DictionaryValuesWriter.java#L117
> > > > > >>> >> >
> > > > > >>> >> >
> > > > > >>> >> > > - Is that heuristic configurable at all?
> > > > > >>> >> >
> > > > > >>> >> >
> > > > > >>> >> > I think you are using the right setting, but maybe it is
> > > > possible
> > > > > >>> the
> > > > > >>> >> > strings are still exceeding the threshold (perhaps
> > increasing
> > > it
> > > > > by
> > > > > >>> 50x
> > > > > >>> >> or
> > > > > >>> >> > more to verify)
> > > > > >>> >> >
> > > > > >>> >> >
> > > > > >>> >> > > - For high-cardinality datasets, has the idea of a
> > > > > frequency-based
> > > > > >>> >> > > dictionary encoding been explored? Say, if the data
> > follows
> > > a
> > > > > >>> certain
> > > > > >>> >> > > statistical distribution, we can create a dictionary of
> > the
> > > > most
> > > > > >>> >> frequent
> > > > > >>> >> > > values only?
> > > > > >>> >> >
> > > > > >>> >> > I haven't seen my discussion during my time in the
> community
> > > but
> > > > > >>> maybe
> > > > > >>> >> it
> > > > > >>> >> > was discussed in the past.  I think the main challenge
> here
> > is
> > > > > that
> > > > > >>> >> pages
> > > > > >>> >> > are either dictionary encoded or not.  I'd guess to make
> > this
> > > > > >>> practical
> > > > > >>> >> > there would need to be a new hybrid page type, which I
> think
> > > it
> > > > > >>> might
> > > > > >>> >> be an
> > > > > >>> >> > interesting idea but quite a bit of work.  Additionally,
> one
> > > > would
> > > > > >>> >> likely
> > > > > >>> >> > need heuristics for when to potentially use the new mode
> > > versus
> > > > a
> > > > > >>> >> complete
> > > > > >>> >> > fallback.
> > > > > >>> >> >
> > > > > >>> >> > Cheers,
> > > > > >>> >> > Micah
> > > > > >>> >> >
> > > > > >>> >> > On Thu, Sep 14, 2023 at 12:07 PM Claire McGinty <
> > > > > >>> >> > claire.d.mcginty@gmail.com>
> > > > > >>> >> > wrote:
> > > > > >>> >> >
> > > > > >>> >> > > Hi dev@,
> > > > > >>> >> > >
> > > > > >>> >> > > I'm running some benchmarking on Parquet read/write
> > > > performance
> > > > > >>> and
> > > > > >>> >> have
> > > > > >>> >> > a
> > > > > >>> >> > > few questions about how dictionary encoding works under
> > the
> > > > > hood.
> > > > > >>> Let
> > > > > >>> >> me
> > > > > >>> >> > > know if there's a better channel for this :)
> > > > > >>> >> > >
> > > > > >>> >> > > My test case uses parquet-avro, where I'm writing a
> single
> > > > file
> > > > > >>> >> > containing
> > > > > >>> >> > > 5 million records. Each record has a single column, an
> > Avro
> > > > > String
> > > > > >>> >> field
> > > > > >>> >> > > (Parquet binary field). I ran two configurations of base
> > > > setup:
> > > > > >>> in the
> > > > > >>> >> > > first case, the string field has 5,000 possible unique
> > > values.
> > > > > In
> > > > > >>> the
> > > > > >>> >> > > second case, it has 50,000 unique values.
> > > > > >>> >> > >
> > > > > >>> >> > > In the first case (5k unique values), I used
> parquet-tools
> > > to
> > > > > >>> inspect
> > > > > >>> >> the
> > > > > >>> >> > > file metadata and found that a dictionary had been
> > written:
> > > > > >>> >> > >
> > > > > >>> >> > > % parquet-tools meta testdata-case1.parquet
> > > > > >>> >> > > > file schema:  testdata.TestRecord
> > > > > >>> >> > > >
> > > > > >>> >> > > >
> > > > > >>> >> > >
> > > > > >>> >> >
> > > > > >>> >>
> > > > > >>>
> > > > >
> > > >
> > >
> >
> --------------------------------------------------------------------------------
> > > > > >>> >> > > > stringField:  REQUIRED BINARY L:STRING R:0 D:0
> > > > > >>> >> > > > row group 1:  RC:5000001 TS:18262874 OFFSET:4
> > > > > >>> >> > > >
> > > > > >>> >> > > >
> > > > > >>> >> > >
> > > > > >>> >> >
> > > > > >>> >>
> > > > > >>>
> > > > >
> > > >
> > >
> >
> --------------------------------------------------------------------------------
> > > > > >>> >> > > > stringField:   BINARY UNCOMPRESSED DO:4 FPO:38918
> > > > > >>> >> > SZ:8181452/8181452/1.00
> > > > > >>> >> > > > VC:5000001 ENC:BIT_PACKED,PLAIN_DICTIONARY ST:[min: 0,
> > > max:
> > > > > 999,
> > > > > >>> >> > > num_nulls:
> > > > > >>> >> > > > 0]
> > > > > >>> >> > >
> > > > > >>> >> > >
> > > > > >>> >> > > But in the second case (50k unique values),
> parquet-tools
> > > > shows
> > > > > >>> that
> > > > > >>> >> no
> > > > > >>> >> > > dictionary gets created, and the file size is *much*
> > bigger:
> > > > > >>> >> > >
> > > > > >>> >> > > % parquet-tools meta testdata-case2.parquet
> > > > > >>> >> > > > file schema:  testdata.TestRecord
> > > > > >>> >> > > >
> > > > > >>> >> > > >
> > > > > >>> >> > >
> > > > > >>> >> >
> > > > > >>> >>
> > > > > >>>
> > > > >
> > > >
> > >
> >
> --------------------------------------------------------------------------------
> > > > > >>> >> > > > stringField:  REQUIRED BINARY L:STRING R:0 D:0
> > > > > >>> >> > > > row group 1:  RC:5000001 TS:18262874 OFFSET:4
> > > > > >>> >> > > >
> > > > > >>> >> > > >
> > > > > >>> >> > >
> > > > > >>> >> >
> > > > > >>> >>
> > > > > >>>
> > > > >
> > > >
> > >
> >
> --------------------------------------------------------------------------------
> > > > > >>> >> > > > stringField:  BINARY UNCOMPRESSED DO:0 FPO:4
> > > > > >>> >> SZ:43896278/43896278/1.00
> > > > > >>> >> > > > VC:5000001 ENC:PLAIN,BIT_PACKED ST:[min: 0, max: 9999,
> > > > > >>> num_nulls: 0]
> > > > > >>> >> > >
> > > > > >>> >> > >
> > > > > >>> >> > > (I created a gist of my test reproduction here
> > > > > >>> >> > > <
> > > > > >>> >>
> > > > > >>>
> > > >
> https://gist.github.com/clairemcginty/c3c0be85f51bc23db45a75e8d8a18806
> > > > > >>> >> > >.)
> > > > > >>> >> > >
> > > > > >>> >> > > Based on this, I'm guessing there's some tip-over point
> > > after
> > > > > >>> which
> > > > > >>> >> > Parquet
> > > > > >>> >> > > will give up on writing a dictionary for a given column?
> > > After
> > > > > >>> reading
> > > > > >>> >> > > the Configuration
> > > > > >>> >> > > docs
> > > > > >>> >> > > <
> > > > > >>> >> >
> > > > > >>> >>
> > > > > >>>
> > > > >
> > > >
> > >
> >
> https://github.com/apache/parquet-mr/blob/master/parquet-hadoop/README.md
> > > > > >>> >> > > >,
> > > > > >>> >> > > I tried increasing the dictionary page size
> configuration
> > > 5x,
> > > > > >>> with the
> > > > > >>> >> > same
> > > > > >>> >> > > result (no dictionary created).
> > > > > >>> >> > >
> > > > > >>> >> > > So to summarize, my questions are:
> > > > > >>> >> > >
> > > > > >>> >> > > - What's the heuristic for Parquet dictionary writing to
> > > > succeed
> > > > > >>> for a
> > > > > >>> >> > > given column?
> > > > > >>> >> > > - Is that heuristic configurable at all?
> > > > > >>> >> > > - For high-cardinality datasets, has the idea of a
> > > > > frequency-based
> > > > > >>> >> > > dictionary encoding been explored? Say, if the data
> > follows
> > > a
> > > > > >>> certain
> > > > > >>> >> > > statistical distribution, we can create a dictionary of
> > the
> > > > most
> > > > > >>> >> frequent
> > > > > >>> >> > > values only?
> > > > > >>> >> > >
> > > > > >>> >> > > Thanks for your time!
> > > > > >>> >> > > - Claire
> > > > > >>> >> > >
> > > > > >>> >> >
> > > > > >>> >>
> > > > > >>> >
> > > > > >>>
> > > > > >>
> > > > >
> > > >
> > > >
> > > > --
> > > > Aaron Niskode-Dossett, Data Engineering -- Etsy
> > > >
> > >
> >
>

Re: Parquet dictionary size limits?

Posted by Gang Wu <us...@gmail.com>.
I don't understand why you get encodedSize == 1, dictionaryByteSize == 0
and rawSize == 0 in the first page check. It seems that the page does not
have any meaning values. Could you please check how many values are
written before the page check?

On Thu, Sep 21, 2023 at 12:12 AM Claire McGinty <cl...@gmail.com>
wrote:

> Hey Gang,
>
> Thanks for the followup! I see what you're saying where it's sometimes just
> bad luck with what ends up in the first page. The intuition seems like a
> larger page size should produce a better encoding in this case... I updated
> my branch
> <
> https://github.com/apache/parquet-mr/compare/master...clairemcginty:parquet-mr:dict-size-repro?expand=1
> >
> to
> add a test with a page size/dict page size of 10MB and am seeing the same
> failure, though.
>
> Something seems kind of odd actually -- when I stepped through the test I
> added w/ debugger, it falls back after invoking isCompressionSatisfying
> with encodedSize == 1, dictionaryByteSize == 0 and rawSize == 0; 1 + 0 < 1
> returns true. (You can also see this in the System.out logs I added, in the
> branch's GHA run logs). This doesn't seem right to me -- does
> isCompressionSatsifying need an extra check to make sure the
> dictionary isn't empty?
>
> Also, thanks, Aaron! I got into this while running some micro-benchmarks on
> Parquet reads when various dictionary/bloom filter/encoding options are
> configured. Happy to share out when I'm done.
>
> Best,
> Claire
>
> On Tue, Sep 19, 2023 at 9:06 PM Gang Wu <us...@gmail.com> wrote:
>
> > Thanks for the investigation!
> >
> > I think the check below makes sense for a single page:
> >   @Override
> >   public boolean isCompressionSatisfying(long rawSize, long encodedSize)
> {
> >     return (encodedSize + dictionaryByteSize) < rawSize;
> >   }
> >
> > The problem is that the fallback check is only performed on the first
> page.
> > In the first page, all values in that page may be distinct, so it will
> > unlikely
> > pass the isCompressionSatisfying check.
> >
> > Best,
> > Gang
> >
> >
> > On Wed, Sep 20, 2023 at 5:04 AM Aaron Niskode-Dossett
> > <an...@etsy.com.invalid> wrote:
> >
> > > Claire, thank you for your research and examples on this topic, I've
> > > learned a lot.  My hunch is that your change would be a good one, but
> I'm
> > > not an expert (and more to the point, not a committer).  I'm looking
> > > forward to learning more as this discussion continues.
> > >
> > > Thank you again, Aaron
> > >
> > > On Tue, Sep 19, 2023 at 2:48 PM Claire McGinty <
> > claire.d.mcginty@gmail.com
> > > >
> > > wrote:
> > >
> > > > I created a quick branch
> > > > <
> > > >
> > >
> >
> https://github.com/apache/parquet-mr/compare/master...clairemcginty:parquet-mr:dict-size-repro?expand=1
> > > > >
> > > > to reproduce what I'm seeing -- the test shows that an Int column
> with
> > > > cardinality 100 successfully results in a dict encoding, but an int
> > > column
> > > > with cardinality 10,000 falls back and doesn't create a dict
> encoding.
> > > This
> > > > seems like a low threshold given the 1MB dictionary page size, so I
> > just
> > > > wanted to check whether this is expected or not :)
> > > >
> > > > Best,
> > > > Claire
> > > >
> > > > On Tue, Sep 19, 2023 at 9:35 AM Claire McGinty <
> > > claire.d.mcginty@gmail.com
> > > > >
> > > > wrote:
> > > >
> > > > > Hi, just wanted to follow up on this!
> > > > >
> > > > > I ran a debugger to find out why my column wasn't ending up with a
> > > > > dictionary encoding and it turns out that even though
> > > > > DictionaryValuesWriter#shouldFallback()
> > > > > <
> > > >
> > >
> >
> https://github.com/apache/parquet-mr/blob/master/parquet-column/src/main/java/org/apache/parquet/column/values/dictionary/DictionaryValuesWriter.java#L117
> > > > >
> > > > > always returned false (dictionaryByteSize was always less than my
> > > > > configured page size),
> DictionaryValuesWriter#isCompressionSatisfying
> > > > > <
> > > >
> > >
> >
> https://github.com/apache/parquet-mr/blob/master/parquet-column/src/main/java/org/apache/parquet/column/values/dictionary/DictionaryValuesWriter.java#L125
> > > >
> > > > was
> > > > > what was causing Parquet to switch
> > > > > <
> > > >
> > >
> >
> https://github.com/apache/parquet-mr/blob/master/parquet-column/src/main/java/org/apache/parquet/column/values/fallback/FallbackValuesWriter.java#L75
> > > > >
> > > > > back to the fallback, non-dict writer.
> > > > >
> > > > > From what I can tell, this check compares the total byte size of
> > > > > *every* element with the byte size of each *distinct* element as a
> > kind
> > > > of
> > > > > proxy for encoding efficiency.... however, it seems strange that
> this
> > > > check
> > > > > can cause the writer to fall back even if the total encoded dict
> size
> > > is
> > > > > far below the configured dictionary page size. Out of curiosity, I
> > > > modified
> > > > > DictionaryValuesWriter#isCompressionSatisfying
> > > > > <
> > > >
> > >
> >
> https://github.com/apache/parquet-mr/blob/master/parquet-column/src/main/java/org/apache/parquet/column/values/dictionary/DictionaryValuesWriter.java#L125
> > > >
> > > > to
> > > > > also check whether total byte size was less than dictionary max
> size
> > > and
> > > > > re-ran my Parquet write with a local snapshot, and my file size
> > dropped
> > > > 50%.
> > > > >
> > > > > Best,
> > > > > Claire
> > > > >
> > > > > On Mon, Sep 18, 2023 at 9:16 AM Claire McGinty <
> > > > claire.d.mcginty@gmail.com>
> > > > > wrote:
> > > > >
> > > > >> Oh, interesting! I'm setting it via the
> > > > >> ParquetWriter#withDictionaryPageSize method, and I do see the
> > overall
> > > > file
> > > > >> size increasing when I bump the value. I'll look into it a bit
> more
> > --
> > > > it
> > > > >> would be helpful for some cases where the # unique values in a
> > column
> > > is
> > > > >> just over the size limit.
> > > > >>
> > > > >> - Claire
> > > > >>
> > > > >> On Fri, Sep 15, 2023 at 9:54 AM Micah Kornfield <
> > > emkornfield@gmail.com>
> > > > >> wrote:
> > > > >>
> > > > >>> I'll note there is also a check for encoding effectiveness [1]
> that
> > > > could
> > > > >>> come into play but I'd guess that isn't the case here.
> > > > >>>
> > > > >>> [1]
> > > > >>>
> > > > >>>
> > > >
> > >
> >
> https://github.com/apache/parquet-mr/blob/9b5a962df3007009a227ef421600197531f970a5/parquet-column/src/main/java/org/apache/parquet/column/values/dictionary/DictionaryValuesWriter.java#L124
> > > > >>>
> > > > >>> On Fri, Sep 15, 2023 at 9:51 AM Micah Kornfield <
> > > emkornfield@gmail.com
> > > > >
> > > > >>> wrote:
> > > > >>>
> > > > >>> > I'm glad I was looking at the right setting for dictionary
> size.
> > I
> > > > just
> > > > >>> >> tried it out with 10x, 50x, and even total file size, though,
> > and
> > > > >>> still am
> > > > >>> >> not seeing a dictionary get created. Is it possible it's
> bounded
> > > by
> > > > >>> file
> > > > >>> >> page size or some other layout option that I need to bump as
> > well?
> > > > >>> >
> > > > >>> >
> > > > >>> > Sorry I'm less familiar with parquet-mr, hopefully someone else
> > to
> > > > >>> chime
> > > > >>> > in.  If I had to guess, maybe somehow the config value isn't
> > making
> > > > it
> > > > >>> to
> > > > >>> > the writer (but there could also be something else at play).
> > > > >>> >
> > > > >>> > On Fri, Sep 15, 2023 at 9:33 AM Claire McGinty <
> > > > >>> claire.d.mcginty@gmail.com>
> > > > >>> > wrote:
> > > > >>> >
> > > > >>> >> Thanks so much, Micah!
> > > > >>> >>
> > > > >>> >> I think you are using the right setting, but maybe it is
> > possible
> > > > the
> > > > >>> >> > strings are still exceeding the threshold (perhaps
> increasing
> > it
> > > > by
> > > > >>> 50x
> > > > >>> >> or
> > > > >>> >> > more to verify)
> > > > >>> >>
> > > > >>> >>
> > > > >>> >> I'm glad I was looking at the right setting for dictionary
> > size. I
> > > > >>> just
> > > > >>> >> tried it out with 10x, 50x, and even total file size, though,
> > and
> > > > >>> still am
> > > > >>> >> not seeing a dictionary get created. Is it possible it's
> bounded
> > > by
> > > > >>> file
> > > > >>> >> page size or some other layout option that I need to bump as
> > well?
> > > > >>> >>
> > > > >>> >> I haven't seen my discussion during my time in the community
> but
> > > > >>> maybe it
> > > > >>> >> > was discussed in the past.  I think the main challenge here
> is
> > > > that
> > > > >>> >> pages
> > > > >>> >> > are either dictionary encoded or not.  I'd guess to make
> this
> > > > >>> practical
> > > > >>> >> > there would need to be a new hybrid page type, which I think
> > it
> > > > >>> might
> > > > >>> >> be an
> > > > >>> >> > interesting idea but quite a bit of work.  Additionally, one
> > > would
> > > > >>> >> likely
> > > > >>> >> > need heuristics for when to potentially use the new mode
> > versus
> > > a
> > > > >>> >> complete
> > > > >>> >> > fallback.
> > > > >>> >> >
> > > > >>> >>
> > > > >>> >> Got it, thanks for the explanation! It does seem like a huge
> > > amount
> > > > of
> > > > >>> >> work
> > > > >>> >>
> > > > >>> >>
> > > > >>> >> Best,
> > > > >>> >> Claire
> > > > >>> >>
> > > > >>> >>
> > > > >>> >>
> > > > >>> >> On Thu, Sep 14, 2023 at 5:16 PM Micah Kornfield <
> > > > >>> emkornfield@gmail.com>
> > > > >>> >> wrote:
> > > > >>> >>
> > > > >>> >> > >
> > > > >>> >> > > - What's the heuristic for Parquet dictionary writing to
> > > succeed
> > > > >>> for a
> > > > >>> >> > > given column?
> > > > >>> >> >
> > > > >>> >> >
> > > > >>> >> >
> > > > >>> >> >
> > > > >>> >>
> > > > >>>
> > > >
> > >
> >
> https://github.com/apache/parquet-mr/blob/9b5a962df3007009a227ef421600197531f970a5/parquet-column/src/main/java/org/apache/parquet/column/values/dictionary/DictionaryValuesWriter.java#L117
> > > > >>> >> >
> > > > >>> >> >
> > > > >>> >> > > - Is that heuristic configurable at all?
> > > > >>> >> >
> > > > >>> >> >
> > > > >>> >> > I think you are using the right setting, but maybe it is
> > > possible
> > > > >>> the
> > > > >>> >> > strings are still exceeding the threshold (perhaps
> increasing
> > it
> > > > by
> > > > >>> 50x
> > > > >>> >> or
> > > > >>> >> > more to verify)
> > > > >>> >> >
> > > > >>> >> >
> > > > >>> >> > > - For high-cardinality datasets, has the idea of a
> > > > frequency-based
> > > > >>> >> > > dictionary encoding been explored? Say, if the data
> follows
> > a
> > > > >>> certain
> > > > >>> >> > > statistical distribution, we can create a dictionary of
> the
> > > most
> > > > >>> >> frequent
> > > > >>> >> > > values only?
> > > > >>> >> >
> > > > >>> >> > I haven't seen my discussion during my time in the community
> > but
> > > > >>> maybe
> > > > >>> >> it
> > > > >>> >> > was discussed in the past.  I think the main challenge here
> is
> > > > that
> > > > >>> >> pages
> > > > >>> >> > are either dictionary encoded or not.  I'd guess to make
> this
> > > > >>> practical
> > > > >>> >> > there would need to be a new hybrid page type, which I think
> > it
> > > > >>> might
> > > > >>> >> be an
> > > > >>> >> > interesting idea but quite a bit of work.  Additionally, one
> > > would
> > > > >>> >> likely
> > > > >>> >> > need heuristics for when to potentially use the new mode
> > versus
> > > a
> > > > >>> >> complete
> > > > >>> >> > fallback.
> > > > >>> >> >
> > > > >>> >> > Cheers,
> > > > >>> >> > Micah
> > > > >>> >> >
> > > > >>> >> > On Thu, Sep 14, 2023 at 12:07 PM Claire McGinty <
> > > > >>> >> > claire.d.mcginty@gmail.com>
> > > > >>> >> > wrote:
> > > > >>> >> >
> > > > >>> >> > > Hi dev@,
> > > > >>> >> > >
> > > > >>> >> > > I'm running some benchmarking on Parquet read/write
> > > performance
> > > > >>> and
> > > > >>> >> have
> > > > >>> >> > a
> > > > >>> >> > > few questions about how dictionary encoding works under
> the
> > > > hood.
> > > > >>> Let
> > > > >>> >> me
> > > > >>> >> > > know if there's a better channel for this :)
> > > > >>> >> > >
> > > > >>> >> > > My test case uses parquet-avro, where I'm writing a single
> > > file
> > > > >>> >> > containing
> > > > >>> >> > > 5 million records. Each record has a single column, an
> Avro
> > > > String
> > > > >>> >> field
> > > > >>> >> > > (Parquet binary field). I ran two configurations of base
> > > setup:
> > > > >>> in the
> > > > >>> >> > > first case, the string field has 5,000 possible unique
> > values.
> > > > In
> > > > >>> the
> > > > >>> >> > > second case, it has 50,000 unique values.
> > > > >>> >> > >
> > > > >>> >> > > In the first case (5k unique values), I used parquet-tools
> > to
> > > > >>> inspect
> > > > >>> >> the
> > > > >>> >> > > file metadata and found that a dictionary had been
> written:
> > > > >>> >> > >
> > > > >>> >> > > % parquet-tools meta testdata-case1.parquet
> > > > >>> >> > > > file schema:  testdata.TestRecord
> > > > >>> >> > > >
> > > > >>> >> > > >
> > > > >>> >> > >
> > > > >>> >> >
> > > > >>> >>
> > > > >>>
> > > >
> > >
> >
> --------------------------------------------------------------------------------
> > > > >>> >> > > > stringField:  REQUIRED BINARY L:STRING R:0 D:0
> > > > >>> >> > > > row group 1:  RC:5000001 TS:18262874 OFFSET:4
> > > > >>> >> > > >
> > > > >>> >> > > >
> > > > >>> >> > >
> > > > >>> >> >
> > > > >>> >>
> > > > >>>
> > > >
> > >
> >
> --------------------------------------------------------------------------------
> > > > >>> >> > > > stringField:   BINARY UNCOMPRESSED DO:4 FPO:38918
> > > > >>> >> > SZ:8181452/8181452/1.00
> > > > >>> >> > > > VC:5000001 ENC:BIT_PACKED,PLAIN_DICTIONARY ST:[min: 0,
> > max:
> > > > 999,
> > > > >>> >> > > num_nulls:
> > > > >>> >> > > > 0]
> > > > >>> >> > >
> > > > >>> >> > >
> > > > >>> >> > > But in the second case (50k unique values), parquet-tools
> > > shows
> > > > >>> that
> > > > >>> >> no
> > > > >>> >> > > dictionary gets created, and the file size is *much*
> bigger:
> > > > >>> >> > >
> > > > >>> >> > > % parquet-tools meta testdata-case2.parquet
> > > > >>> >> > > > file schema:  testdata.TestRecord
> > > > >>> >> > > >
> > > > >>> >> > > >
> > > > >>> >> > >
> > > > >>> >> >
> > > > >>> >>
> > > > >>>
> > > >
> > >
> >
> --------------------------------------------------------------------------------
> > > > >>> >> > > > stringField:  REQUIRED BINARY L:STRING R:0 D:0
> > > > >>> >> > > > row group 1:  RC:5000001 TS:18262874 OFFSET:4
> > > > >>> >> > > >
> > > > >>> >> > > >
> > > > >>> >> > >
> > > > >>> >> >
> > > > >>> >>
> > > > >>>
> > > >
> > >
> >
> --------------------------------------------------------------------------------
> > > > >>> >> > > > stringField:  BINARY UNCOMPRESSED DO:0 FPO:4
> > > > >>> >> SZ:43896278/43896278/1.00
> > > > >>> >> > > > VC:5000001 ENC:PLAIN,BIT_PACKED ST:[min: 0, max: 9999,
> > > > >>> num_nulls: 0]
> > > > >>> >> > >
> > > > >>> >> > >
> > > > >>> >> > > (I created a gist of my test reproduction here
> > > > >>> >> > > <
> > > > >>> >>
> > > > >>>
> > > https://gist.github.com/clairemcginty/c3c0be85f51bc23db45a75e8d8a18806
> > > > >>> >> > >.)
> > > > >>> >> > >
> > > > >>> >> > > Based on this, I'm guessing there's some tip-over point
> > after
> > > > >>> which
> > > > >>> >> > Parquet
> > > > >>> >> > > will give up on writing a dictionary for a given column?
> > After
> > > > >>> reading
> > > > >>> >> > > the Configuration
> > > > >>> >> > > docs
> > > > >>> >> > > <
> > > > >>> >> >
> > > > >>> >>
> > > > >>>
> > > >
> > >
> >
> https://github.com/apache/parquet-mr/blob/master/parquet-hadoop/README.md
> > > > >>> >> > > >,
> > > > >>> >> > > I tried increasing the dictionary page size configuration
> > 5x,
> > > > >>> with the
> > > > >>> >> > same
> > > > >>> >> > > result (no dictionary created).
> > > > >>> >> > >
> > > > >>> >> > > So to summarize, my questions are:
> > > > >>> >> > >
> > > > >>> >> > > - What's the heuristic for Parquet dictionary writing to
> > > succeed
> > > > >>> for a
> > > > >>> >> > > given column?
> > > > >>> >> > > - Is that heuristic configurable at all?
> > > > >>> >> > > - For high-cardinality datasets, has the idea of a
> > > > frequency-based
> > > > >>> >> > > dictionary encoding been explored? Say, if the data
> follows
> > a
> > > > >>> certain
> > > > >>> >> > > statistical distribution, we can create a dictionary of
> the
> > > most
> > > > >>> >> frequent
> > > > >>> >> > > values only?
> > > > >>> >> > >
> > > > >>> >> > > Thanks for your time!
> > > > >>> >> > > - Claire
> > > > >>> >> > >
> > > > >>> >> >
> > > > >>> >>
> > > > >>> >
> > > > >>>
> > > > >>
> > > >
> > >
> > >
> > > --
> > > Aaron Niskode-Dossett, Data Engineering -- Etsy
> > >
> >
>

Re: Parquet dictionary size limits?

Posted by Claire McGinty <cl...@gmail.com>.
(Sorry, mistyped :) 1 + 0 < 1 returns false, of course, causing the check
<https://github.com/apache/parquet-mr/blob/master/parquet-column/src/main/java/org/apache/parquet/column/values/fallback/FallbackValuesWriter.java#L75>
to "!initialWriter.isCompressionSatisfying(rawDataByteSize, bytes.size())"
to return true and fallback to occur.)

- Claire

On Wed, Sep 20, 2023 at 8:28 AM Claire McGinty <cl...@gmail.com>
wrote:

> Hey Gang,
>
> Thanks for the followup! I see what you're saying where it's sometimes
> just bad luck with what ends up in the first page. The intuition seems like
> a larger page size should produce a better encoding in this case... I
> updated my branch
> <https://github.com/apache/parquet-mr/compare/master...clairemcginty:parquet-mr:dict-size-repro?expand=1> to
> add a test with a page size/dict page size of 10MB and am seeing the same
> failure, though.
>
> Something seems kind of odd actually -- when I stepped through the test I
> added w/ debugger, it falls back after invoking isCompressionSatisfying
> with encodedSize == 1, dictionaryByteSize == 0 and rawSize == 0; 1 + 0 < 1
> returns true. (You can also see this in the System.out logs I added, in the
> branch's GHA run logs). This doesn't seem right to me -- does
> isCompressionSatsifying need an extra check to make sure the
> dictionary isn't empty?
>
> Also, thanks, Aaron! I got into this while running some micro-benchmarks
> on Parquet reads when various dictionary/bloom filter/encoding options are
> configured. Happy to share out when I'm done.
>
> Best,
> Claire
>
> On Tue, Sep 19, 2023 at 9:06 PM Gang Wu <us...@gmail.com> wrote:
>
>> Thanks for the investigation!
>>
>> I think the check below makes sense for a single page:
>>   @Override
>>   public boolean isCompressionSatisfying(long rawSize, long encodedSize) {
>>     return (encodedSize + dictionaryByteSize) < rawSize;
>>   }
>>
>> The problem is that the fallback check is only performed on the first
>> page.
>> In the first page, all values in that page may be distinct, so it will
>> unlikely
>> pass the isCompressionSatisfying check.
>>
>> Best,
>> Gang
>>
>>
>> On Wed, Sep 20, 2023 at 5:04 AM Aaron Niskode-Dossett
>> <an...@etsy.com.invalid> wrote:
>>
>> > Claire, thank you for your research and examples on this topic, I've
>> > learned a lot.  My hunch is that your change would be a good one, but
>> I'm
>> > not an expert (and more to the point, not a committer).  I'm looking
>> > forward to learning more as this discussion continues.
>> >
>> > Thank you again, Aaron
>> >
>> > On Tue, Sep 19, 2023 at 2:48 PM Claire McGinty <
>> claire.d.mcginty@gmail.com
>> > >
>> > wrote:
>> >
>> > > I created a quick branch
>> > > <
>> > >
>> >
>> https://github.com/apache/parquet-mr/compare/master...clairemcginty:parquet-mr:dict-size-repro?expand=1
>> > > >
>> > > to reproduce what I'm seeing -- the test shows that an Int column with
>> > > cardinality 100 successfully results in a dict encoding, but an int
>> > column
>> > > with cardinality 10,000 falls back and doesn't create a dict encoding.
>> > This
>> > > seems like a low threshold given the 1MB dictionary page size, so I
>> just
>> > > wanted to check whether this is expected or not :)
>> > >
>> > > Best,
>> > > Claire
>> > >
>> > > On Tue, Sep 19, 2023 at 9:35 AM Claire McGinty <
>> > claire.d.mcginty@gmail.com
>> > > >
>> > > wrote:
>> > >
>> > > > Hi, just wanted to follow up on this!
>> > > >
>> > > > I ran a debugger to find out why my column wasn't ending up with a
>> > > > dictionary encoding and it turns out that even though
>> > > > DictionaryValuesWriter#shouldFallback()
>> > > > <
>> > >
>> >
>> https://github.com/apache/parquet-mr/blob/master/parquet-column/src/main/java/org/apache/parquet/column/values/dictionary/DictionaryValuesWriter.java#L117
>> > > >
>> > > > always returned false (dictionaryByteSize was always less than my
>> > > > configured page size),
>> DictionaryValuesWriter#isCompressionSatisfying
>> > > > <
>> > >
>> >
>> https://github.com/apache/parquet-mr/blob/master/parquet-column/src/main/java/org/apache/parquet/column/values/dictionary/DictionaryValuesWriter.java#L125
>> > >
>> > > was
>> > > > what was causing Parquet to switch
>> > > > <
>> > >
>> >
>> https://github.com/apache/parquet-mr/blob/master/parquet-column/src/main/java/org/apache/parquet/column/values/fallback/FallbackValuesWriter.java#L75
>> > > >
>> > > > back to the fallback, non-dict writer.
>> > > >
>> > > > From what I can tell, this check compares the total byte size of
>> > > > *every* element with the byte size of each *distinct* element as a
>> kind
>> > > of
>> > > > proxy for encoding efficiency.... however, it seems strange that
>> this
>> > > check
>> > > > can cause the writer to fall back even if the total encoded dict
>> size
>> > is
>> > > > far below the configured dictionary page size. Out of curiosity, I
>> > > modified
>> > > > DictionaryValuesWriter#isCompressionSatisfying
>> > > > <
>> > >
>> >
>> https://github.com/apache/parquet-mr/blob/master/parquet-column/src/main/java/org/apache/parquet/column/values/dictionary/DictionaryValuesWriter.java#L125
>> > >
>> > > to
>> > > > also check whether total byte size was less than dictionary max size
>> > and
>> > > > re-ran my Parquet write with a local snapshot, and my file size
>> dropped
>> > > 50%.
>> > > >
>> > > > Best,
>> > > > Claire
>> > > >
>> > > > On Mon, Sep 18, 2023 at 9:16 AM Claire McGinty <
>> > > claire.d.mcginty@gmail.com>
>> > > > wrote:
>> > > >
>> > > >> Oh, interesting! I'm setting it via the
>> > > >> ParquetWriter#withDictionaryPageSize method, and I do see the
>> overall
>> > > file
>> > > >> size increasing when I bump the value. I'll look into it a bit
>> more --
>> > > it
>> > > >> would be helpful for some cases where the # unique values in a
>> column
>> > is
>> > > >> just over the size limit.
>> > > >>
>> > > >> - Claire
>> > > >>
>> > > >> On Fri, Sep 15, 2023 at 9:54 AM Micah Kornfield <
>> > emkornfield@gmail.com>
>> > > >> wrote:
>> > > >>
>> > > >>> I'll note there is also a check for encoding effectiveness [1]
>> that
>> > > could
>> > > >>> come into play but I'd guess that isn't the case here.
>> > > >>>
>> > > >>> [1]
>> > > >>>
>> > > >>>
>> > >
>> >
>> https://github.com/apache/parquet-mr/blob/9b5a962df3007009a227ef421600197531f970a5/parquet-column/src/main/java/org/apache/parquet/column/values/dictionary/DictionaryValuesWriter.java#L124
>> > > >>>
>> > > >>> On Fri, Sep 15, 2023 at 9:51 AM Micah Kornfield <
>> > emkornfield@gmail.com
>> > > >
>> > > >>> wrote:
>> > > >>>
>> > > >>> > I'm glad I was looking at the right setting for dictionary
>> size. I
>> > > just
>> > > >>> >> tried it out with 10x, 50x, and even total file size, though,
>> and
>> > > >>> still am
>> > > >>> >> not seeing a dictionary get created. Is it possible it's
>> bounded
>> > by
>> > > >>> file
>> > > >>> >> page size or some other layout option that I need to bump as
>> well?
>> > > >>> >
>> > > >>> >
>> > > >>> > Sorry I'm less familiar with parquet-mr, hopefully someone else
>> to
>> > > >>> chime
>> > > >>> > in.  If I had to guess, maybe somehow the config value isn't
>> making
>> > > it
>> > > >>> to
>> > > >>> > the writer (but there could also be something else at play).
>> > > >>> >
>> > > >>> > On Fri, Sep 15, 2023 at 9:33 AM Claire McGinty <
>> > > >>> claire.d.mcginty@gmail.com>
>> > > >>> > wrote:
>> > > >>> >
>> > > >>> >> Thanks so much, Micah!
>> > > >>> >>
>> > > >>> >> I think you are using the right setting, but maybe it is
>> possible
>> > > the
>> > > >>> >> > strings are still exceeding the threshold (perhaps
>> increasing it
>> > > by
>> > > >>> 50x
>> > > >>> >> or
>> > > >>> >> > more to verify)
>> > > >>> >>
>> > > >>> >>
>> > > >>> >> I'm glad I was looking at the right setting for dictionary
>> size. I
>> > > >>> just
>> > > >>> >> tried it out with 10x, 50x, and even total file size, though,
>> and
>> > > >>> still am
>> > > >>> >> not seeing a dictionary get created. Is it possible it's
>> bounded
>> > by
>> > > >>> file
>> > > >>> >> page size or some other layout option that I need to bump as
>> well?
>> > > >>> >>
>> > > >>> >> I haven't seen my discussion during my time in the community
>> but
>> > > >>> maybe it
>> > > >>> >> > was discussed in the past.  I think the main challenge here
>> is
>> > > that
>> > > >>> >> pages
>> > > >>> >> > are either dictionary encoded or not.  I'd guess to make this
>> > > >>> practical
>> > > >>> >> > there would need to be a new hybrid page type, which I think
>> it
>> > > >>> might
>> > > >>> >> be an
>> > > >>> >> > interesting idea but quite a bit of work.  Additionally, one
>> > would
>> > > >>> >> likely
>> > > >>> >> > need heuristics for when to potentially use the new mode
>> versus
>> > a
>> > > >>> >> complete
>> > > >>> >> > fallback.
>> > > >>> >> >
>> > > >>> >>
>> > > >>> >> Got it, thanks for the explanation! It does seem like a huge
>> > amount
>> > > of
>> > > >>> >> work
>> > > >>> >>
>> > > >>> >>
>> > > >>> >> Best,
>> > > >>> >> Claire
>> > > >>> >>
>> > > >>> >>
>> > > >>> >>
>> > > >>> >> On Thu, Sep 14, 2023 at 5:16 PM Micah Kornfield <
>> > > >>> emkornfield@gmail.com>
>> > > >>> >> wrote:
>> > > >>> >>
>> > > >>> >> > >
>> > > >>> >> > > - What's the heuristic for Parquet dictionary writing to
>> > succeed
>> > > >>> for a
>> > > >>> >> > > given column?
>> > > >>> >> >
>> > > >>> >> >
>> > > >>> >> >
>> > > >>> >> >
>> > > >>> >>
>> > > >>>
>> > >
>> >
>> https://github.com/apache/parquet-mr/blob/9b5a962df3007009a227ef421600197531f970a5/parquet-column/src/main/java/org/apache/parquet/column/values/dictionary/DictionaryValuesWriter.java#L117
>> > > >>> >> >
>> > > >>> >> >
>> > > >>> >> > > - Is that heuristic configurable at all?
>> > > >>> >> >
>> > > >>> >> >
>> > > >>> >> > I think you are using the right setting, but maybe it is
>> > possible
>> > > >>> the
>> > > >>> >> > strings are still exceeding the threshold (perhaps
>> increasing it
>> > > by
>> > > >>> 50x
>> > > >>> >> or
>> > > >>> >> > more to verify)
>> > > >>> >> >
>> > > >>> >> >
>> > > >>> >> > > - For high-cardinality datasets, has the idea of a
>> > > frequency-based
>> > > >>> >> > > dictionary encoding been explored? Say, if the data
>> follows a
>> > > >>> certain
>> > > >>> >> > > statistical distribution, we can create a dictionary of the
>> > most
>> > > >>> >> frequent
>> > > >>> >> > > values only?
>> > > >>> >> >
>> > > >>> >> > I haven't seen my discussion during my time in the community
>> but
>> > > >>> maybe
>> > > >>> >> it
>> > > >>> >> > was discussed in the past.  I think the main challenge here
>> is
>> > > that
>> > > >>> >> pages
>> > > >>> >> > are either dictionary encoded or not.  I'd guess to make this
>> > > >>> practical
>> > > >>> >> > there would need to be a new hybrid page type, which I think
>> it
>> > > >>> might
>> > > >>> >> be an
>> > > >>> >> > interesting idea but quite a bit of work.  Additionally, one
>> > would
>> > > >>> >> likely
>> > > >>> >> > need heuristics for when to potentially use the new mode
>> versus
>> > a
>> > > >>> >> complete
>> > > >>> >> > fallback.
>> > > >>> >> >
>> > > >>> >> > Cheers,
>> > > >>> >> > Micah
>> > > >>> >> >
>> > > >>> >> > On Thu, Sep 14, 2023 at 12:07 PM Claire McGinty <
>> > > >>> >> > claire.d.mcginty@gmail.com>
>> > > >>> >> > wrote:
>> > > >>> >> >
>> > > >>> >> > > Hi dev@,
>> > > >>> >> > >
>> > > >>> >> > > I'm running some benchmarking on Parquet read/write
>> > performance
>> > > >>> and
>> > > >>> >> have
>> > > >>> >> > a
>> > > >>> >> > > few questions about how dictionary encoding works under the
>> > > hood.
>> > > >>> Let
>> > > >>> >> me
>> > > >>> >> > > know if there's a better channel for this :)
>> > > >>> >> > >
>> > > >>> >> > > My test case uses parquet-avro, where I'm writing a single
>> > file
>> > > >>> >> > containing
>> > > >>> >> > > 5 million records. Each record has a single column, an Avro
>> > > String
>> > > >>> >> field
>> > > >>> >> > > (Parquet binary field). I ran two configurations of base
>> > setup:
>> > > >>> in the
>> > > >>> >> > > first case, the string field has 5,000 possible unique
>> values.
>> > > In
>> > > >>> the
>> > > >>> >> > > second case, it has 50,000 unique values.
>> > > >>> >> > >
>> > > >>> >> > > In the first case (5k unique values), I used parquet-tools
>> to
>> > > >>> inspect
>> > > >>> >> the
>> > > >>> >> > > file metadata and found that a dictionary had been written:
>> > > >>> >> > >
>> > > >>> >> > > % parquet-tools meta testdata-case1.parquet
>> > > >>> >> > > > file schema:  testdata.TestRecord
>> > > >>> >> > > >
>> > > >>> >> > > >
>> > > >>> >> > >
>> > > >>> >> >
>> > > >>> >>
>> > > >>>
>> > >
>> >
>> --------------------------------------------------------------------------------
>> > > >>> >> > > > stringField:  REQUIRED BINARY L:STRING R:0 D:0
>> > > >>> >> > > > row group 1:  RC:5000001 TS:18262874 OFFSET:4
>> > > >>> >> > > >
>> > > >>> >> > > >
>> > > >>> >> > >
>> > > >>> >> >
>> > > >>> >>
>> > > >>>
>> > >
>> >
>> --------------------------------------------------------------------------------
>> > > >>> >> > > > stringField:   BINARY UNCOMPRESSED DO:4 FPO:38918
>> > > >>> >> > SZ:8181452/8181452/1.00
>> > > >>> >> > > > VC:5000001 ENC:BIT_PACKED,PLAIN_DICTIONARY ST:[min: 0,
>> max:
>> > > 999,
>> > > >>> >> > > num_nulls:
>> > > >>> >> > > > 0]
>> > > >>> >> > >
>> > > >>> >> > >
>> > > >>> >> > > But in the second case (50k unique values), parquet-tools
>> > shows
>> > > >>> that
>> > > >>> >> no
>> > > >>> >> > > dictionary gets created, and the file size is *much*
>> bigger:
>> > > >>> >> > >
>> > > >>> >> > > % parquet-tools meta testdata-case2.parquet
>> > > >>> >> > > > file schema:  testdata.TestRecord
>> > > >>> >> > > >
>> > > >>> >> > > >
>> > > >>> >> > >
>> > > >>> >> >
>> > > >>> >>
>> > > >>>
>> > >
>> >
>> --------------------------------------------------------------------------------
>> > > >>> >> > > > stringField:  REQUIRED BINARY L:STRING R:0 D:0
>> > > >>> >> > > > row group 1:  RC:5000001 TS:18262874 OFFSET:4
>> > > >>> >> > > >
>> > > >>> >> > > >
>> > > >>> >> > >
>> > > >>> >> >
>> > > >>> >>
>> > > >>>
>> > >
>> >
>> --------------------------------------------------------------------------------
>> > > >>> >> > > > stringField:  BINARY UNCOMPRESSED DO:0 FPO:4
>> > > >>> >> SZ:43896278/43896278/1.00
>> > > >>> >> > > > VC:5000001 ENC:PLAIN,BIT_PACKED ST:[min: 0, max: 9999,
>> > > >>> num_nulls: 0]
>> > > >>> >> > >
>> > > >>> >> > >
>> > > >>> >> > > (I created a gist of my test reproduction here
>> > > >>> >> > > <
>> > > >>> >>
>> > > >>>
>> > https://gist.github.com/clairemcginty/c3c0be85f51bc23db45a75e8d8a18806
>> > > >>> >> > >.)
>> > > >>> >> > >
>> > > >>> >> > > Based on this, I'm guessing there's some tip-over point
>> after
>> > > >>> which
>> > > >>> >> > Parquet
>> > > >>> >> > > will give up on writing a dictionary for a given column?
>> After
>> > > >>> reading
>> > > >>> >> > > the Configuration
>> > > >>> >> > > docs
>> > > >>> >> > > <
>> > > >>> >> >
>> > > >>> >>
>> > > >>>
>> > >
>> >
>> https://github.com/apache/parquet-mr/blob/master/parquet-hadoop/README.md
>> > > >>> >> > > >,
>> > > >>> >> > > I tried increasing the dictionary page size configuration
>> 5x,
>> > > >>> with the
>> > > >>> >> > same
>> > > >>> >> > > result (no dictionary created).
>> > > >>> >> > >
>> > > >>> >> > > So to summarize, my questions are:
>> > > >>> >> > >
>> > > >>> >> > > - What's the heuristic for Parquet dictionary writing to
>> > succeed
>> > > >>> for a
>> > > >>> >> > > given column?
>> > > >>> >> > > - Is that heuristic configurable at all?
>> > > >>> >> > > - For high-cardinality datasets, has the idea of a
>> > > frequency-based
>> > > >>> >> > > dictionary encoding been explored? Say, if the data
>> follows a
>> > > >>> certain
>> > > >>> >> > > statistical distribution, we can create a dictionary of the
>> > most
>> > > >>> >> frequent
>> > > >>> >> > > values only?
>> > > >>> >> > >
>> > > >>> >> > > Thanks for your time!
>> > > >>> >> > > - Claire
>> > > >>> >> > >
>> > > >>> >> >
>> > > >>> >>
>> > > >>> >
>> > > >>>
>> > > >>
>> > >
>> >
>> >
>> > --
>> > Aaron Niskode-Dossett, Data Engineering -- Etsy
>> >
>>
>

Re: Parquet dictionary size limits?

Posted by Claire McGinty <cl...@gmail.com>.
Hey Gang,

Thanks for the followup! I see what you're saying where it's sometimes just
bad luck with what ends up in the first page. The intuition seems like a
larger page size should produce a better encoding in this case... I updated
my branch
<https://github.com/apache/parquet-mr/compare/master...clairemcginty:parquet-mr:dict-size-repro?expand=1>
to
add a test with a page size/dict page size of 10MB and am seeing the same
failure, though.

Something seems kind of odd actually -- when I stepped through the test I
added w/ debugger, it falls back after invoking isCompressionSatisfying
with encodedSize == 1, dictionaryByteSize == 0 and rawSize == 0; 1 + 0 < 1
returns true. (You can also see this in the System.out logs I added, in the
branch's GHA run logs). This doesn't seem right to me -- does
isCompressionSatsifying need an extra check to make sure the
dictionary isn't empty?

Also, thanks, Aaron! I got into this while running some micro-benchmarks on
Parquet reads when various dictionary/bloom filter/encoding options are
configured. Happy to share out when I'm done.

Best,
Claire

On Tue, Sep 19, 2023 at 9:06 PM Gang Wu <us...@gmail.com> wrote:

> Thanks for the investigation!
>
> I think the check below makes sense for a single page:
>   @Override
>   public boolean isCompressionSatisfying(long rawSize, long encodedSize) {
>     return (encodedSize + dictionaryByteSize) < rawSize;
>   }
>
> The problem is that the fallback check is only performed on the first page.
> In the first page, all values in that page may be distinct, so it will
> unlikely
> pass the isCompressionSatisfying check.
>
> Best,
> Gang
>
>
> On Wed, Sep 20, 2023 at 5:04 AM Aaron Niskode-Dossett
> <an...@etsy.com.invalid> wrote:
>
> > Claire, thank you for your research and examples on this topic, I've
> > learned a lot.  My hunch is that your change would be a good one, but I'm
> > not an expert (and more to the point, not a committer).  I'm looking
> > forward to learning more as this discussion continues.
> >
> > Thank you again, Aaron
> >
> > On Tue, Sep 19, 2023 at 2:48 PM Claire McGinty <
> claire.d.mcginty@gmail.com
> > >
> > wrote:
> >
> > > I created a quick branch
> > > <
> > >
> >
> https://github.com/apache/parquet-mr/compare/master...clairemcginty:parquet-mr:dict-size-repro?expand=1
> > > >
> > > to reproduce what I'm seeing -- the test shows that an Int column with
> > > cardinality 100 successfully results in a dict encoding, but an int
> > column
> > > with cardinality 10,000 falls back and doesn't create a dict encoding.
> > This
> > > seems like a low threshold given the 1MB dictionary page size, so I
> just
> > > wanted to check whether this is expected or not :)
> > >
> > > Best,
> > > Claire
> > >
> > > On Tue, Sep 19, 2023 at 9:35 AM Claire McGinty <
> > claire.d.mcginty@gmail.com
> > > >
> > > wrote:
> > >
> > > > Hi, just wanted to follow up on this!
> > > >
> > > > I ran a debugger to find out why my column wasn't ending up with a
> > > > dictionary encoding and it turns out that even though
> > > > DictionaryValuesWriter#shouldFallback()
> > > > <
> > >
> >
> https://github.com/apache/parquet-mr/blob/master/parquet-column/src/main/java/org/apache/parquet/column/values/dictionary/DictionaryValuesWriter.java#L117
> > > >
> > > > always returned false (dictionaryByteSize was always less than my
> > > > configured page size), DictionaryValuesWriter#isCompressionSatisfying
> > > > <
> > >
> >
> https://github.com/apache/parquet-mr/blob/master/parquet-column/src/main/java/org/apache/parquet/column/values/dictionary/DictionaryValuesWriter.java#L125
> > >
> > > was
> > > > what was causing Parquet to switch
> > > > <
> > >
> >
> https://github.com/apache/parquet-mr/blob/master/parquet-column/src/main/java/org/apache/parquet/column/values/fallback/FallbackValuesWriter.java#L75
> > > >
> > > > back to the fallback, non-dict writer.
> > > >
> > > > From what I can tell, this check compares the total byte size of
> > > > *every* element with the byte size of each *distinct* element as a
> kind
> > > of
> > > > proxy for encoding efficiency.... however, it seems strange that this
> > > check
> > > > can cause the writer to fall back even if the total encoded dict size
> > is
> > > > far below the configured dictionary page size. Out of curiosity, I
> > > modified
> > > > DictionaryValuesWriter#isCompressionSatisfying
> > > > <
> > >
> >
> https://github.com/apache/parquet-mr/blob/master/parquet-column/src/main/java/org/apache/parquet/column/values/dictionary/DictionaryValuesWriter.java#L125
> > >
> > > to
> > > > also check whether total byte size was less than dictionary max size
> > and
> > > > re-ran my Parquet write with a local snapshot, and my file size
> dropped
> > > 50%.
> > > >
> > > > Best,
> > > > Claire
> > > >
> > > > On Mon, Sep 18, 2023 at 9:16 AM Claire McGinty <
> > > claire.d.mcginty@gmail.com>
> > > > wrote:
> > > >
> > > >> Oh, interesting! I'm setting it via the
> > > >> ParquetWriter#withDictionaryPageSize method, and I do see the
> overall
> > > file
> > > >> size increasing when I bump the value. I'll look into it a bit more
> --
> > > it
> > > >> would be helpful for some cases where the # unique values in a
> column
> > is
> > > >> just over the size limit.
> > > >>
> > > >> - Claire
> > > >>
> > > >> On Fri, Sep 15, 2023 at 9:54 AM Micah Kornfield <
> > emkornfield@gmail.com>
> > > >> wrote:
> > > >>
> > > >>> I'll note there is also a check for encoding effectiveness [1] that
> > > could
> > > >>> come into play but I'd guess that isn't the case here.
> > > >>>
> > > >>> [1]
> > > >>>
> > > >>>
> > >
> >
> https://github.com/apache/parquet-mr/blob/9b5a962df3007009a227ef421600197531f970a5/parquet-column/src/main/java/org/apache/parquet/column/values/dictionary/DictionaryValuesWriter.java#L124
> > > >>>
> > > >>> On Fri, Sep 15, 2023 at 9:51 AM Micah Kornfield <
> > emkornfield@gmail.com
> > > >
> > > >>> wrote:
> > > >>>
> > > >>> > I'm glad I was looking at the right setting for dictionary size.
> I
> > > just
> > > >>> >> tried it out with 10x, 50x, and even total file size, though,
> and
> > > >>> still am
> > > >>> >> not seeing a dictionary get created. Is it possible it's bounded
> > by
> > > >>> file
> > > >>> >> page size or some other layout option that I need to bump as
> well?
> > > >>> >
> > > >>> >
> > > >>> > Sorry I'm less familiar with parquet-mr, hopefully someone else
> to
> > > >>> chime
> > > >>> > in.  If I had to guess, maybe somehow the config value isn't
> making
> > > it
> > > >>> to
> > > >>> > the writer (but there could also be something else at play).
> > > >>> >
> > > >>> > On Fri, Sep 15, 2023 at 9:33 AM Claire McGinty <
> > > >>> claire.d.mcginty@gmail.com>
> > > >>> > wrote:
> > > >>> >
> > > >>> >> Thanks so much, Micah!
> > > >>> >>
> > > >>> >> I think you are using the right setting, but maybe it is
> possible
> > > the
> > > >>> >> > strings are still exceeding the threshold (perhaps increasing
> it
> > > by
> > > >>> 50x
> > > >>> >> or
> > > >>> >> > more to verify)
> > > >>> >>
> > > >>> >>
> > > >>> >> I'm glad I was looking at the right setting for dictionary
> size. I
> > > >>> just
> > > >>> >> tried it out with 10x, 50x, and even total file size, though,
> and
> > > >>> still am
> > > >>> >> not seeing a dictionary get created. Is it possible it's bounded
> > by
> > > >>> file
> > > >>> >> page size or some other layout option that I need to bump as
> well?
> > > >>> >>
> > > >>> >> I haven't seen my discussion during my time in the community but
> > > >>> maybe it
> > > >>> >> > was discussed in the past.  I think the main challenge here is
> > > that
> > > >>> >> pages
> > > >>> >> > are either dictionary encoded or not.  I'd guess to make this
> > > >>> practical
> > > >>> >> > there would need to be a new hybrid page type, which I think
> it
> > > >>> might
> > > >>> >> be an
> > > >>> >> > interesting idea but quite a bit of work.  Additionally, one
> > would
> > > >>> >> likely
> > > >>> >> > need heuristics for when to potentially use the new mode
> versus
> > a
> > > >>> >> complete
> > > >>> >> > fallback.
> > > >>> >> >
> > > >>> >>
> > > >>> >> Got it, thanks for the explanation! It does seem like a huge
> > amount
> > > of
> > > >>> >> work
> > > >>> >>
> > > >>> >>
> > > >>> >> Best,
> > > >>> >> Claire
> > > >>> >>
> > > >>> >>
> > > >>> >>
> > > >>> >> On Thu, Sep 14, 2023 at 5:16 PM Micah Kornfield <
> > > >>> emkornfield@gmail.com>
> > > >>> >> wrote:
> > > >>> >>
> > > >>> >> > >
> > > >>> >> > > - What's the heuristic for Parquet dictionary writing to
> > succeed
> > > >>> for a
> > > >>> >> > > given column?
> > > >>> >> >
> > > >>> >> >
> > > >>> >> >
> > > >>> >> >
> > > >>> >>
> > > >>>
> > >
> >
> https://github.com/apache/parquet-mr/blob/9b5a962df3007009a227ef421600197531f970a5/parquet-column/src/main/java/org/apache/parquet/column/values/dictionary/DictionaryValuesWriter.java#L117
> > > >>> >> >
> > > >>> >> >
> > > >>> >> > > - Is that heuristic configurable at all?
> > > >>> >> >
> > > >>> >> >
> > > >>> >> > I think you are using the right setting, but maybe it is
> > possible
> > > >>> the
> > > >>> >> > strings are still exceeding the threshold (perhaps increasing
> it
> > > by
> > > >>> 50x
> > > >>> >> or
> > > >>> >> > more to verify)
> > > >>> >> >
> > > >>> >> >
> > > >>> >> > > - For high-cardinality datasets, has the idea of a
> > > frequency-based
> > > >>> >> > > dictionary encoding been explored? Say, if the data follows
> a
> > > >>> certain
> > > >>> >> > > statistical distribution, we can create a dictionary of the
> > most
> > > >>> >> frequent
> > > >>> >> > > values only?
> > > >>> >> >
> > > >>> >> > I haven't seen my discussion during my time in the community
> but
> > > >>> maybe
> > > >>> >> it
> > > >>> >> > was discussed in the past.  I think the main challenge here is
> > > that
> > > >>> >> pages
> > > >>> >> > are either dictionary encoded or not.  I'd guess to make this
> > > >>> practical
> > > >>> >> > there would need to be a new hybrid page type, which I think
> it
> > > >>> might
> > > >>> >> be an
> > > >>> >> > interesting idea but quite a bit of work.  Additionally, one
> > would
> > > >>> >> likely
> > > >>> >> > need heuristics for when to potentially use the new mode
> versus
> > a
> > > >>> >> complete
> > > >>> >> > fallback.
> > > >>> >> >
> > > >>> >> > Cheers,
> > > >>> >> > Micah
> > > >>> >> >
> > > >>> >> > On Thu, Sep 14, 2023 at 12:07 PM Claire McGinty <
> > > >>> >> > claire.d.mcginty@gmail.com>
> > > >>> >> > wrote:
> > > >>> >> >
> > > >>> >> > > Hi dev@,
> > > >>> >> > >
> > > >>> >> > > I'm running some benchmarking on Parquet read/write
> > performance
> > > >>> and
> > > >>> >> have
> > > >>> >> > a
> > > >>> >> > > few questions about how dictionary encoding works under the
> > > hood.
> > > >>> Let
> > > >>> >> me
> > > >>> >> > > know if there's a better channel for this :)
> > > >>> >> > >
> > > >>> >> > > My test case uses parquet-avro, where I'm writing a single
> > file
> > > >>> >> > containing
> > > >>> >> > > 5 million records. Each record has a single column, an Avro
> > > String
> > > >>> >> field
> > > >>> >> > > (Parquet binary field). I ran two configurations of base
> > setup:
> > > >>> in the
> > > >>> >> > > first case, the string field has 5,000 possible unique
> values.
> > > In
> > > >>> the
> > > >>> >> > > second case, it has 50,000 unique values.
> > > >>> >> > >
> > > >>> >> > > In the first case (5k unique values), I used parquet-tools
> to
> > > >>> inspect
> > > >>> >> the
> > > >>> >> > > file metadata and found that a dictionary had been written:
> > > >>> >> > >
> > > >>> >> > > % parquet-tools meta testdata-case1.parquet
> > > >>> >> > > > file schema:  testdata.TestRecord
> > > >>> >> > > >
> > > >>> >> > > >
> > > >>> >> > >
> > > >>> >> >
> > > >>> >>
> > > >>>
> > >
> >
> --------------------------------------------------------------------------------
> > > >>> >> > > > stringField:  REQUIRED BINARY L:STRING R:0 D:0
> > > >>> >> > > > row group 1:  RC:5000001 TS:18262874 OFFSET:4
> > > >>> >> > > >
> > > >>> >> > > >
> > > >>> >> > >
> > > >>> >> >
> > > >>> >>
> > > >>>
> > >
> >
> --------------------------------------------------------------------------------
> > > >>> >> > > > stringField:   BINARY UNCOMPRESSED DO:4 FPO:38918
> > > >>> >> > SZ:8181452/8181452/1.00
> > > >>> >> > > > VC:5000001 ENC:BIT_PACKED,PLAIN_DICTIONARY ST:[min: 0,
> max:
> > > 999,
> > > >>> >> > > num_nulls:
> > > >>> >> > > > 0]
> > > >>> >> > >
> > > >>> >> > >
> > > >>> >> > > But in the second case (50k unique values), parquet-tools
> > shows
> > > >>> that
> > > >>> >> no
> > > >>> >> > > dictionary gets created, and the file size is *much* bigger:
> > > >>> >> > >
> > > >>> >> > > % parquet-tools meta testdata-case2.parquet
> > > >>> >> > > > file schema:  testdata.TestRecord
> > > >>> >> > > >
> > > >>> >> > > >
> > > >>> >> > >
> > > >>> >> >
> > > >>> >>
> > > >>>
> > >
> >
> --------------------------------------------------------------------------------
> > > >>> >> > > > stringField:  REQUIRED BINARY L:STRING R:0 D:0
> > > >>> >> > > > row group 1:  RC:5000001 TS:18262874 OFFSET:4
> > > >>> >> > > >
> > > >>> >> > > >
> > > >>> >> > >
> > > >>> >> >
> > > >>> >>
> > > >>>
> > >
> >
> --------------------------------------------------------------------------------
> > > >>> >> > > > stringField:  BINARY UNCOMPRESSED DO:0 FPO:4
> > > >>> >> SZ:43896278/43896278/1.00
> > > >>> >> > > > VC:5000001 ENC:PLAIN,BIT_PACKED ST:[min: 0, max: 9999,
> > > >>> num_nulls: 0]
> > > >>> >> > >
> > > >>> >> > >
> > > >>> >> > > (I created a gist of my test reproduction here
> > > >>> >> > > <
> > > >>> >>
> > > >>>
> > https://gist.github.com/clairemcginty/c3c0be85f51bc23db45a75e8d8a18806
> > > >>> >> > >.)
> > > >>> >> > >
> > > >>> >> > > Based on this, I'm guessing there's some tip-over point
> after
> > > >>> which
> > > >>> >> > Parquet
> > > >>> >> > > will give up on writing a dictionary for a given column?
> After
> > > >>> reading
> > > >>> >> > > the Configuration
> > > >>> >> > > docs
> > > >>> >> > > <
> > > >>> >> >
> > > >>> >>
> > > >>>
> > >
> >
> https://github.com/apache/parquet-mr/blob/master/parquet-hadoop/README.md
> > > >>> >> > > >,
> > > >>> >> > > I tried increasing the dictionary page size configuration
> 5x,
> > > >>> with the
> > > >>> >> > same
> > > >>> >> > > result (no dictionary created).
> > > >>> >> > >
> > > >>> >> > > So to summarize, my questions are:
> > > >>> >> > >
> > > >>> >> > > - What's the heuristic for Parquet dictionary writing to
> > succeed
> > > >>> for a
> > > >>> >> > > given column?
> > > >>> >> > > - Is that heuristic configurable at all?
> > > >>> >> > > - For high-cardinality datasets, has the idea of a
> > > frequency-based
> > > >>> >> > > dictionary encoding been explored? Say, if the data follows
> a
> > > >>> certain
> > > >>> >> > > statistical distribution, we can create a dictionary of the
> > most
> > > >>> >> frequent
> > > >>> >> > > values only?
> > > >>> >> > >
> > > >>> >> > > Thanks for your time!
> > > >>> >> > > - Claire
> > > >>> >> > >
> > > >>> >> >
> > > >>> >>
> > > >>> >
> > > >>>
> > > >>
> > >
> >
> >
> > --
> > Aaron Niskode-Dossett, Data Engineering -- Etsy
> >
>

Re: Parquet dictionary size limits?

Posted by Gang Wu <us...@gmail.com>.
Thanks for the investigation!

I think the check below makes sense for a single page:
  @Override
  public boolean isCompressionSatisfying(long rawSize, long encodedSize) {
    return (encodedSize + dictionaryByteSize) < rawSize;
  }

The problem is that the fallback check is only performed on the first page.
In the first page, all values in that page may be distinct, so it will
unlikely
pass the isCompressionSatisfying check.

Best,
Gang


On Wed, Sep 20, 2023 at 5:04 AM Aaron Niskode-Dossett
<an...@etsy.com.invalid> wrote:

> Claire, thank you for your research and examples on this topic, I've
> learned a lot.  My hunch is that your change would be a good one, but I'm
> not an expert (and more to the point, not a committer).  I'm looking
> forward to learning more as this discussion continues.
>
> Thank you again, Aaron
>
> On Tue, Sep 19, 2023 at 2:48 PM Claire McGinty <claire.d.mcginty@gmail.com
> >
> wrote:
>
> > I created a quick branch
> > <
> >
> https://github.com/apache/parquet-mr/compare/master...clairemcginty:parquet-mr:dict-size-repro?expand=1
> > >
> > to reproduce what I'm seeing -- the test shows that an Int column with
> > cardinality 100 successfully results in a dict encoding, but an int
> column
> > with cardinality 10,000 falls back and doesn't create a dict encoding.
> This
> > seems like a low threshold given the 1MB dictionary page size, so I just
> > wanted to check whether this is expected or not :)
> >
> > Best,
> > Claire
> >
> > On Tue, Sep 19, 2023 at 9:35 AM Claire McGinty <
> claire.d.mcginty@gmail.com
> > >
> > wrote:
> >
> > > Hi, just wanted to follow up on this!
> > >
> > > I ran a debugger to find out why my column wasn't ending up with a
> > > dictionary encoding and it turns out that even though
> > > DictionaryValuesWriter#shouldFallback()
> > > <
> >
> https://github.com/apache/parquet-mr/blob/master/parquet-column/src/main/java/org/apache/parquet/column/values/dictionary/DictionaryValuesWriter.java#L117
> > >
> > > always returned false (dictionaryByteSize was always less than my
> > > configured page size), DictionaryValuesWriter#isCompressionSatisfying
> > > <
> >
> https://github.com/apache/parquet-mr/blob/master/parquet-column/src/main/java/org/apache/parquet/column/values/dictionary/DictionaryValuesWriter.java#L125
> >
> > was
> > > what was causing Parquet to switch
> > > <
> >
> https://github.com/apache/parquet-mr/blob/master/parquet-column/src/main/java/org/apache/parquet/column/values/fallback/FallbackValuesWriter.java#L75
> > >
> > > back to the fallback, non-dict writer.
> > >
> > > From what I can tell, this check compares the total byte size of
> > > *every* element with the byte size of each *distinct* element as a kind
> > of
> > > proxy for encoding efficiency.... however, it seems strange that this
> > check
> > > can cause the writer to fall back even if the total encoded dict size
> is
> > > far below the configured dictionary page size. Out of curiosity, I
> > modified
> > > DictionaryValuesWriter#isCompressionSatisfying
> > > <
> >
> https://github.com/apache/parquet-mr/blob/master/parquet-column/src/main/java/org/apache/parquet/column/values/dictionary/DictionaryValuesWriter.java#L125
> >
> > to
> > > also check whether total byte size was less than dictionary max size
> and
> > > re-ran my Parquet write with a local snapshot, and my file size dropped
> > 50%.
> > >
> > > Best,
> > > Claire
> > >
> > > On Mon, Sep 18, 2023 at 9:16 AM Claire McGinty <
> > claire.d.mcginty@gmail.com>
> > > wrote:
> > >
> > >> Oh, interesting! I'm setting it via the
> > >> ParquetWriter#withDictionaryPageSize method, and I do see the overall
> > file
> > >> size increasing when I bump the value. I'll look into it a bit more --
> > it
> > >> would be helpful for some cases where the # unique values in a column
> is
> > >> just over the size limit.
> > >>
> > >> - Claire
> > >>
> > >> On Fri, Sep 15, 2023 at 9:54 AM Micah Kornfield <
> emkornfield@gmail.com>
> > >> wrote:
> > >>
> > >>> I'll note there is also a check for encoding effectiveness [1] that
> > could
> > >>> come into play but I'd guess that isn't the case here.
> > >>>
> > >>> [1]
> > >>>
> > >>>
> >
> https://github.com/apache/parquet-mr/blob/9b5a962df3007009a227ef421600197531f970a5/parquet-column/src/main/java/org/apache/parquet/column/values/dictionary/DictionaryValuesWriter.java#L124
> > >>>
> > >>> On Fri, Sep 15, 2023 at 9:51 AM Micah Kornfield <
> emkornfield@gmail.com
> > >
> > >>> wrote:
> > >>>
> > >>> > I'm glad I was looking at the right setting for dictionary size. I
> > just
> > >>> >> tried it out with 10x, 50x, and even total file size, though, and
> > >>> still am
> > >>> >> not seeing a dictionary get created. Is it possible it's bounded
> by
> > >>> file
> > >>> >> page size or some other layout option that I need to bump as well?
> > >>> >
> > >>> >
> > >>> > Sorry I'm less familiar with parquet-mr, hopefully someone else to
> > >>> chime
> > >>> > in.  If I had to guess, maybe somehow the config value isn't making
> > it
> > >>> to
> > >>> > the writer (but there could also be something else at play).
> > >>> >
> > >>> > On Fri, Sep 15, 2023 at 9:33 AM Claire McGinty <
> > >>> claire.d.mcginty@gmail.com>
> > >>> > wrote:
> > >>> >
> > >>> >> Thanks so much, Micah!
> > >>> >>
> > >>> >> I think you are using the right setting, but maybe it is possible
> > the
> > >>> >> > strings are still exceeding the threshold (perhaps increasing it
> > by
> > >>> 50x
> > >>> >> or
> > >>> >> > more to verify)
> > >>> >>
> > >>> >>
> > >>> >> I'm glad I was looking at the right setting for dictionary size. I
> > >>> just
> > >>> >> tried it out with 10x, 50x, and even total file size, though, and
> > >>> still am
> > >>> >> not seeing a dictionary get created. Is it possible it's bounded
> by
> > >>> file
> > >>> >> page size or some other layout option that I need to bump as well?
> > >>> >>
> > >>> >> I haven't seen my discussion during my time in the community but
> > >>> maybe it
> > >>> >> > was discussed in the past.  I think the main challenge here is
> > that
> > >>> >> pages
> > >>> >> > are either dictionary encoded or not.  I'd guess to make this
> > >>> practical
> > >>> >> > there would need to be a new hybrid page type, which I think it
> > >>> might
> > >>> >> be an
> > >>> >> > interesting idea but quite a bit of work.  Additionally, one
> would
> > >>> >> likely
> > >>> >> > need heuristics for when to potentially use the new mode versus
> a
> > >>> >> complete
> > >>> >> > fallback.
> > >>> >> >
> > >>> >>
> > >>> >> Got it, thanks for the explanation! It does seem like a huge
> amount
> > of
> > >>> >> work
> > >>> >>
> > >>> >>
> > >>> >> Best,
> > >>> >> Claire
> > >>> >>
> > >>> >>
> > >>> >>
> > >>> >> On Thu, Sep 14, 2023 at 5:16 PM Micah Kornfield <
> > >>> emkornfield@gmail.com>
> > >>> >> wrote:
> > >>> >>
> > >>> >> > >
> > >>> >> > > - What's the heuristic for Parquet dictionary writing to
> succeed
> > >>> for a
> > >>> >> > > given column?
> > >>> >> >
> > >>> >> >
> > >>> >> >
> > >>> >> >
> > >>> >>
> > >>>
> >
> https://github.com/apache/parquet-mr/blob/9b5a962df3007009a227ef421600197531f970a5/parquet-column/src/main/java/org/apache/parquet/column/values/dictionary/DictionaryValuesWriter.java#L117
> > >>> >> >
> > >>> >> >
> > >>> >> > > - Is that heuristic configurable at all?
> > >>> >> >
> > >>> >> >
> > >>> >> > I think you are using the right setting, but maybe it is
> possible
> > >>> the
> > >>> >> > strings are still exceeding the threshold (perhaps increasing it
> > by
> > >>> 50x
> > >>> >> or
> > >>> >> > more to verify)
> > >>> >> >
> > >>> >> >
> > >>> >> > > - For high-cardinality datasets, has the idea of a
> > frequency-based
> > >>> >> > > dictionary encoding been explored? Say, if the data follows a
> > >>> certain
> > >>> >> > > statistical distribution, we can create a dictionary of the
> most
> > >>> >> frequent
> > >>> >> > > values only?
> > >>> >> >
> > >>> >> > I haven't seen my discussion during my time in the community but
> > >>> maybe
> > >>> >> it
> > >>> >> > was discussed in the past.  I think the main challenge here is
> > that
> > >>> >> pages
> > >>> >> > are either dictionary encoded or not.  I'd guess to make this
> > >>> practical
> > >>> >> > there would need to be a new hybrid page type, which I think it
> > >>> might
> > >>> >> be an
> > >>> >> > interesting idea but quite a bit of work.  Additionally, one
> would
> > >>> >> likely
> > >>> >> > need heuristics for when to potentially use the new mode versus
> a
> > >>> >> complete
> > >>> >> > fallback.
> > >>> >> >
> > >>> >> > Cheers,
> > >>> >> > Micah
> > >>> >> >
> > >>> >> > On Thu, Sep 14, 2023 at 12:07 PM Claire McGinty <
> > >>> >> > claire.d.mcginty@gmail.com>
> > >>> >> > wrote:
> > >>> >> >
> > >>> >> > > Hi dev@,
> > >>> >> > >
> > >>> >> > > I'm running some benchmarking on Parquet read/write
> performance
> > >>> and
> > >>> >> have
> > >>> >> > a
> > >>> >> > > few questions about how dictionary encoding works under the
> > hood.
> > >>> Let
> > >>> >> me
> > >>> >> > > know if there's a better channel for this :)
> > >>> >> > >
> > >>> >> > > My test case uses parquet-avro, where I'm writing a single
> file
> > >>> >> > containing
> > >>> >> > > 5 million records. Each record has a single column, an Avro
> > String
> > >>> >> field
> > >>> >> > > (Parquet binary field). I ran two configurations of base
> setup:
> > >>> in the
> > >>> >> > > first case, the string field has 5,000 possible unique values.
> > In
> > >>> the
> > >>> >> > > second case, it has 50,000 unique values.
> > >>> >> > >
> > >>> >> > > In the first case (5k unique values), I used parquet-tools to
> > >>> inspect
> > >>> >> the
> > >>> >> > > file metadata and found that a dictionary had been written:
> > >>> >> > >
> > >>> >> > > % parquet-tools meta testdata-case1.parquet
> > >>> >> > > > file schema:  testdata.TestRecord
> > >>> >> > > >
> > >>> >> > > >
> > >>> >> > >
> > >>> >> >
> > >>> >>
> > >>>
> >
> --------------------------------------------------------------------------------
> > >>> >> > > > stringField:  REQUIRED BINARY L:STRING R:0 D:0
> > >>> >> > > > row group 1:  RC:5000001 TS:18262874 OFFSET:4
> > >>> >> > > >
> > >>> >> > > >
> > >>> >> > >
> > >>> >> >
> > >>> >>
> > >>>
> >
> --------------------------------------------------------------------------------
> > >>> >> > > > stringField:   BINARY UNCOMPRESSED DO:4 FPO:38918
> > >>> >> > SZ:8181452/8181452/1.00
> > >>> >> > > > VC:5000001 ENC:BIT_PACKED,PLAIN_DICTIONARY ST:[min: 0, max:
> > 999,
> > >>> >> > > num_nulls:
> > >>> >> > > > 0]
> > >>> >> > >
> > >>> >> > >
> > >>> >> > > But in the second case (50k unique values), parquet-tools
> shows
> > >>> that
> > >>> >> no
> > >>> >> > > dictionary gets created, and the file size is *much* bigger:
> > >>> >> > >
> > >>> >> > > % parquet-tools meta testdata-case2.parquet
> > >>> >> > > > file schema:  testdata.TestRecord
> > >>> >> > > >
> > >>> >> > > >
> > >>> >> > >
> > >>> >> >
> > >>> >>
> > >>>
> >
> --------------------------------------------------------------------------------
> > >>> >> > > > stringField:  REQUIRED BINARY L:STRING R:0 D:0
> > >>> >> > > > row group 1:  RC:5000001 TS:18262874 OFFSET:4
> > >>> >> > > >
> > >>> >> > > >
> > >>> >> > >
> > >>> >> >
> > >>> >>
> > >>>
> >
> --------------------------------------------------------------------------------
> > >>> >> > > > stringField:  BINARY UNCOMPRESSED DO:0 FPO:4
> > >>> >> SZ:43896278/43896278/1.00
> > >>> >> > > > VC:5000001 ENC:PLAIN,BIT_PACKED ST:[min: 0, max: 9999,
> > >>> num_nulls: 0]
> > >>> >> > >
> > >>> >> > >
> > >>> >> > > (I created a gist of my test reproduction here
> > >>> >> > > <
> > >>> >>
> > >>>
> https://gist.github.com/clairemcginty/c3c0be85f51bc23db45a75e8d8a18806
> > >>> >> > >.)
> > >>> >> > >
> > >>> >> > > Based on this, I'm guessing there's some tip-over point after
> > >>> which
> > >>> >> > Parquet
> > >>> >> > > will give up on writing a dictionary for a given column? After
> > >>> reading
> > >>> >> > > the Configuration
> > >>> >> > > docs
> > >>> >> > > <
> > >>> >> >
> > >>> >>
> > >>>
> >
> https://github.com/apache/parquet-mr/blob/master/parquet-hadoop/README.md
> > >>> >> > > >,
> > >>> >> > > I tried increasing the dictionary page size configuration 5x,
> > >>> with the
> > >>> >> > same
> > >>> >> > > result (no dictionary created).
> > >>> >> > >
> > >>> >> > > So to summarize, my questions are:
> > >>> >> > >
> > >>> >> > > - What's the heuristic for Parquet dictionary writing to
> succeed
> > >>> for a
> > >>> >> > > given column?
> > >>> >> > > - Is that heuristic configurable at all?
> > >>> >> > > - For high-cardinality datasets, has the idea of a
> > frequency-based
> > >>> >> > > dictionary encoding been explored? Say, if the data follows a
> > >>> certain
> > >>> >> > > statistical distribution, we can create a dictionary of the
> most
> > >>> >> frequent
> > >>> >> > > values only?
> > >>> >> > >
> > >>> >> > > Thanks for your time!
> > >>> >> > > - Claire
> > >>> >> > >
> > >>> >> >
> > >>> >>
> > >>> >
> > >>>
> > >>
> >
>
>
> --
> Aaron Niskode-Dossett, Data Engineering -- Etsy
>

Re: Parquet dictionary size limits?

Posted by Aaron Niskode-Dossett <an...@etsy.com.INVALID>.
Claire, thank you for your research and examples on this topic, I've
learned a lot.  My hunch is that your change would be a good one, but I'm
not an expert (and more to the point, not a committer).  I'm looking
forward to learning more as this discussion continues.

Thank you again, Aaron

On Tue, Sep 19, 2023 at 2:48 PM Claire McGinty <cl...@gmail.com>
wrote:

> I created a quick branch
> <
> https://github.com/apache/parquet-mr/compare/master...clairemcginty:parquet-mr:dict-size-repro?expand=1
> >
> to reproduce what I'm seeing -- the test shows that an Int column with
> cardinality 100 successfully results in a dict encoding, but an int column
> with cardinality 10,000 falls back and doesn't create a dict encoding. This
> seems like a low threshold given the 1MB dictionary page size, so I just
> wanted to check whether this is expected or not :)
>
> Best,
> Claire
>
> On Tue, Sep 19, 2023 at 9:35 AM Claire McGinty <claire.d.mcginty@gmail.com
> >
> wrote:
>
> > Hi, just wanted to follow up on this!
> >
> > I ran a debugger to find out why my column wasn't ending up with a
> > dictionary encoding and it turns out that even though
> > DictionaryValuesWriter#shouldFallback()
> > <
> https://github.com/apache/parquet-mr/blob/master/parquet-column/src/main/java/org/apache/parquet/column/values/dictionary/DictionaryValuesWriter.java#L117
> >
> > always returned false (dictionaryByteSize was always less than my
> > configured page size), DictionaryValuesWriter#isCompressionSatisfying
> > <
> https://github.com/apache/parquet-mr/blob/master/parquet-column/src/main/java/org/apache/parquet/column/values/dictionary/DictionaryValuesWriter.java#L125>
> was
> > what was causing Parquet to switch
> > <
> https://github.com/apache/parquet-mr/blob/master/parquet-column/src/main/java/org/apache/parquet/column/values/fallback/FallbackValuesWriter.java#L75
> >
> > back to the fallback, non-dict writer.
> >
> > From what I can tell, this check compares the total byte size of
> > *every* element with the byte size of each *distinct* element as a kind
> of
> > proxy for encoding efficiency.... however, it seems strange that this
> check
> > can cause the writer to fall back even if the total encoded dict size is
> > far below the configured dictionary page size. Out of curiosity, I
> modified
> > DictionaryValuesWriter#isCompressionSatisfying
> > <
> https://github.com/apache/parquet-mr/blob/master/parquet-column/src/main/java/org/apache/parquet/column/values/dictionary/DictionaryValuesWriter.java#L125>
> to
> > also check whether total byte size was less than dictionary max size and
> > re-ran my Parquet write with a local snapshot, and my file size dropped
> 50%.
> >
> > Best,
> > Claire
> >
> > On Mon, Sep 18, 2023 at 9:16 AM Claire McGinty <
> claire.d.mcginty@gmail.com>
> > wrote:
> >
> >> Oh, interesting! I'm setting it via the
> >> ParquetWriter#withDictionaryPageSize method, and I do see the overall
> file
> >> size increasing when I bump the value. I'll look into it a bit more --
> it
> >> would be helpful for some cases where the # unique values in a column is
> >> just over the size limit.
> >>
> >> - Claire
> >>
> >> On Fri, Sep 15, 2023 at 9:54 AM Micah Kornfield <em...@gmail.com>
> >> wrote:
> >>
> >>> I'll note there is also a check for encoding effectiveness [1] that
> could
> >>> come into play but I'd guess that isn't the case here.
> >>>
> >>> [1]
> >>>
> >>>
> https://github.com/apache/parquet-mr/blob/9b5a962df3007009a227ef421600197531f970a5/parquet-column/src/main/java/org/apache/parquet/column/values/dictionary/DictionaryValuesWriter.java#L124
> >>>
> >>> On Fri, Sep 15, 2023 at 9:51 AM Micah Kornfield <emkornfield@gmail.com
> >
> >>> wrote:
> >>>
> >>> > I'm glad I was looking at the right setting for dictionary size. I
> just
> >>> >> tried it out with 10x, 50x, and even total file size, though, and
> >>> still am
> >>> >> not seeing a dictionary get created. Is it possible it's bounded by
> >>> file
> >>> >> page size or some other layout option that I need to bump as well?
> >>> >
> >>> >
> >>> > Sorry I'm less familiar with parquet-mr, hopefully someone else to
> >>> chime
> >>> > in.  If I had to guess, maybe somehow the config value isn't making
> it
> >>> to
> >>> > the writer (but there could also be something else at play).
> >>> >
> >>> > On Fri, Sep 15, 2023 at 9:33 AM Claire McGinty <
> >>> claire.d.mcginty@gmail.com>
> >>> > wrote:
> >>> >
> >>> >> Thanks so much, Micah!
> >>> >>
> >>> >> I think you are using the right setting, but maybe it is possible
> the
> >>> >> > strings are still exceeding the threshold (perhaps increasing it
> by
> >>> 50x
> >>> >> or
> >>> >> > more to verify)
> >>> >>
> >>> >>
> >>> >> I'm glad I was looking at the right setting for dictionary size. I
> >>> just
> >>> >> tried it out with 10x, 50x, and even total file size, though, and
> >>> still am
> >>> >> not seeing a dictionary get created. Is it possible it's bounded by
> >>> file
> >>> >> page size or some other layout option that I need to bump as well?
> >>> >>
> >>> >> I haven't seen my discussion during my time in the community but
> >>> maybe it
> >>> >> > was discussed in the past.  I think the main challenge here is
> that
> >>> >> pages
> >>> >> > are either dictionary encoded or not.  I'd guess to make this
> >>> practical
> >>> >> > there would need to be a new hybrid page type, which I think it
> >>> might
> >>> >> be an
> >>> >> > interesting idea but quite a bit of work.  Additionally, one would
> >>> >> likely
> >>> >> > need heuristics for when to potentially use the new mode versus a
> >>> >> complete
> >>> >> > fallback.
> >>> >> >
> >>> >>
> >>> >> Got it, thanks for the explanation! It does seem like a huge amount
> of
> >>> >> work
> >>> >>
> >>> >>
> >>> >> Best,
> >>> >> Claire
> >>> >>
> >>> >>
> >>> >>
> >>> >> On Thu, Sep 14, 2023 at 5:16 PM Micah Kornfield <
> >>> emkornfield@gmail.com>
> >>> >> wrote:
> >>> >>
> >>> >> > >
> >>> >> > > - What's the heuristic for Parquet dictionary writing to succeed
> >>> for a
> >>> >> > > given column?
> >>> >> >
> >>> >> >
> >>> >> >
> >>> >> >
> >>> >>
> >>>
> https://github.com/apache/parquet-mr/blob/9b5a962df3007009a227ef421600197531f970a5/parquet-column/src/main/java/org/apache/parquet/column/values/dictionary/DictionaryValuesWriter.java#L117
> >>> >> >
> >>> >> >
> >>> >> > > - Is that heuristic configurable at all?
> >>> >> >
> >>> >> >
> >>> >> > I think you are using the right setting, but maybe it is possible
> >>> the
> >>> >> > strings are still exceeding the threshold (perhaps increasing it
> by
> >>> 50x
> >>> >> or
> >>> >> > more to verify)
> >>> >> >
> >>> >> >
> >>> >> > > - For high-cardinality datasets, has the idea of a
> frequency-based
> >>> >> > > dictionary encoding been explored? Say, if the data follows a
> >>> certain
> >>> >> > > statistical distribution, we can create a dictionary of the most
> >>> >> frequent
> >>> >> > > values only?
> >>> >> >
> >>> >> > I haven't seen my discussion during my time in the community but
> >>> maybe
> >>> >> it
> >>> >> > was discussed in the past.  I think the main challenge here is
> that
> >>> >> pages
> >>> >> > are either dictionary encoded or not.  I'd guess to make this
> >>> practical
> >>> >> > there would need to be a new hybrid page type, which I think it
> >>> might
> >>> >> be an
> >>> >> > interesting idea but quite a bit of work.  Additionally, one would
> >>> >> likely
> >>> >> > need heuristics for when to potentially use the new mode versus a
> >>> >> complete
> >>> >> > fallback.
> >>> >> >
> >>> >> > Cheers,
> >>> >> > Micah
> >>> >> >
> >>> >> > On Thu, Sep 14, 2023 at 12:07 PM Claire McGinty <
> >>> >> > claire.d.mcginty@gmail.com>
> >>> >> > wrote:
> >>> >> >
> >>> >> > > Hi dev@,
> >>> >> > >
> >>> >> > > I'm running some benchmarking on Parquet read/write performance
> >>> and
> >>> >> have
> >>> >> > a
> >>> >> > > few questions about how dictionary encoding works under the
> hood.
> >>> Let
> >>> >> me
> >>> >> > > know if there's a better channel for this :)
> >>> >> > >
> >>> >> > > My test case uses parquet-avro, where I'm writing a single file
> >>> >> > containing
> >>> >> > > 5 million records. Each record has a single column, an Avro
> String
> >>> >> field
> >>> >> > > (Parquet binary field). I ran two configurations of base setup:
> >>> in the
> >>> >> > > first case, the string field has 5,000 possible unique values.
> In
> >>> the
> >>> >> > > second case, it has 50,000 unique values.
> >>> >> > >
> >>> >> > > In the first case (5k unique values), I used parquet-tools to
> >>> inspect
> >>> >> the
> >>> >> > > file metadata and found that a dictionary had been written:
> >>> >> > >
> >>> >> > > % parquet-tools meta testdata-case1.parquet
> >>> >> > > > file schema:  testdata.TestRecord
> >>> >> > > >
> >>> >> > > >
> >>> >> > >
> >>> >> >
> >>> >>
> >>>
> --------------------------------------------------------------------------------
> >>> >> > > > stringField:  REQUIRED BINARY L:STRING R:0 D:0
> >>> >> > > > row group 1:  RC:5000001 TS:18262874 OFFSET:4
> >>> >> > > >
> >>> >> > > >
> >>> >> > >
> >>> >> >
> >>> >>
> >>>
> --------------------------------------------------------------------------------
> >>> >> > > > stringField:   BINARY UNCOMPRESSED DO:4 FPO:38918
> >>> >> > SZ:8181452/8181452/1.00
> >>> >> > > > VC:5000001 ENC:BIT_PACKED,PLAIN_DICTIONARY ST:[min: 0, max:
> 999,
> >>> >> > > num_nulls:
> >>> >> > > > 0]
> >>> >> > >
> >>> >> > >
> >>> >> > > But in the second case (50k unique values), parquet-tools shows
> >>> that
> >>> >> no
> >>> >> > > dictionary gets created, and the file size is *much* bigger:
> >>> >> > >
> >>> >> > > % parquet-tools meta testdata-case2.parquet
> >>> >> > > > file schema:  testdata.TestRecord
> >>> >> > > >
> >>> >> > > >
> >>> >> > >
> >>> >> >
> >>> >>
> >>>
> --------------------------------------------------------------------------------
> >>> >> > > > stringField:  REQUIRED BINARY L:STRING R:0 D:0
> >>> >> > > > row group 1:  RC:5000001 TS:18262874 OFFSET:4
> >>> >> > > >
> >>> >> > > >
> >>> >> > >
> >>> >> >
> >>> >>
> >>>
> --------------------------------------------------------------------------------
> >>> >> > > > stringField:  BINARY UNCOMPRESSED DO:0 FPO:4
> >>> >> SZ:43896278/43896278/1.00
> >>> >> > > > VC:5000001 ENC:PLAIN,BIT_PACKED ST:[min: 0, max: 9999,
> >>> num_nulls: 0]
> >>> >> > >
> >>> >> > >
> >>> >> > > (I created a gist of my test reproduction here
> >>> >> > > <
> >>> >>
> >>> https://gist.github.com/clairemcginty/c3c0be85f51bc23db45a75e8d8a18806
> >>> >> > >.)
> >>> >> > >
> >>> >> > > Based on this, I'm guessing there's some tip-over point after
> >>> which
> >>> >> > Parquet
> >>> >> > > will give up on writing a dictionary for a given column? After
> >>> reading
> >>> >> > > the Configuration
> >>> >> > > docs
> >>> >> > > <
> >>> >> >
> >>> >>
> >>>
> https://github.com/apache/parquet-mr/blob/master/parquet-hadoop/README.md
> >>> >> > > >,
> >>> >> > > I tried increasing the dictionary page size configuration 5x,
> >>> with the
> >>> >> > same
> >>> >> > > result (no dictionary created).
> >>> >> > >
> >>> >> > > So to summarize, my questions are:
> >>> >> > >
> >>> >> > > - What's the heuristic for Parquet dictionary writing to succeed
> >>> for a
> >>> >> > > given column?
> >>> >> > > - Is that heuristic configurable at all?
> >>> >> > > - For high-cardinality datasets, has the idea of a
> frequency-based
> >>> >> > > dictionary encoding been explored? Say, if the data follows a
> >>> certain
> >>> >> > > statistical distribution, we can create a dictionary of the most
> >>> >> frequent
> >>> >> > > values only?
> >>> >> > >
> >>> >> > > Thanks for your time!
> >>> >> > > - Claire
> >>> >> > >
> >>> >> >
> >>> >>
> >>> >
> >>>
> >>
>


-- 
Aaron Niskode-Dossett, Data Engineering -- Etsy

Re: Parquet dictionary size limits?

Posted by Claire McGinty <cl...@gmail.com>.
I created a quick branch
<https://github.com/apache/parquet-mr/compare/master...clairemcginty:parquet-mr:dict-size-repro?expand=1>
to reproduce what I'm seeing -- the test shows that an Int column with
cardinality 100 successfully results in a dict encoding, but an int column
with cardinality 10,000 falls back and doesn't create a dict encoding. This
seems like a low threshold given the 1MB dictionary page size, so I just
wanted to check whether this is expected or not :)

Best,
Claire

On Tue, Sep 19, 2023 at 9:35 AM Claire McGinty <cl...@gmail.com>
wrote:

> Hi, just wanted to follow up on this!
>
> I ran a debugger to find out why my column wasn't ending up with a
> dictionary encoding and it turns out that even though
> DictionaryValuesWriter#shouldFallback()
> <https://github.com/apache/parquet-mr/blob/master/parquet-column/src/main/java/org/apache/parquet/column/values/dictionary/DictionaryValuesWriter.java#L117>
> always returned false (dictionaryByteSize was always less than my
> configured page size), DictionaryValuesWriter#isCompressionSatisfying
> <https://github.com/apache/parquet-mr/blob/master/parquet-column/src/main/java/org/apache/parquet/column/values/dictionary/DictionaryValuesWriter.java#L125> was
> what was causing Parquet to switch
> <https://github.com/apache/parquet-mr/blob/master/parquet-column/src/main/java/org/apache/parquet/column/values/fallback/FallbackValuesWriter.java#L75>
> back to the fallback, non-dict writer.
>
> From what I can tell, this check compares the total byte size of
> *every* element with the byte size of each *distinct* element as a kind of
> proxy for encoding efficiency.... however, it seems strange that this check
> can cause the writer to fall back even if the total encoded dict size is
> far below the configured dictionary page size. Out of curiosity, I modified
> DictionaryValuesWriter#isCompressionSatisfying
> <https://github.com/apache/parquet-mr/blob/master/parquet-column/src/main/java/org/apache/parquet/column/values/dictionary/DictionaryValuesWriter.java#L125> to
> also check whether total byte size was less than dictionary max size and
> re-ran my Parquet write with a local snapshot, and my file size dropped 50%.
>
> Best,
> Claire
>
> On Mon, Sep 18, 2023 at 9:16 AM Claire McGinty <cl...@gmail.com>
> wrote:
>
>> Oh, interesting! I'm setting it via the
>> ParquetWriter#withDictionaryPageSize method, and I do see the overall file
>> size increasing when I bump the value. I'll look into it a bit more -- it
>> would be helpful for some cases where the # unique values in a column is
>> just over the size limit.
>>
>> - Claire
>>
>> On Fri, Sep 15, 2023 at 9:54 AM Micah Kornfield <em...@gmail.com>
>> wrote:
>>
>>> I'll note there is also a check for encoding effectiveness [1] that could
>>> come into play but I'd guess that isn't the case here.
>>>
>>> [1]
>>>
>>> https://github.com/apache/parquet-mr/blob/9b5a962df3007009a227ef421600197531f970a5/parquet-column/src/main/java/org/apache/parquet/column/values/dictionary/DictionaryValuesWriter.java#L124
>>>
>>> On Fri, Sep 15, 2023 at 9:51 AM Micah Kornfield <em...@gmail.com>
>>> wrote:
>>>
>>> > I'm glad I was looking at the right setting for dictionary size. I just
>>> >> tried it out with 10x, 50x, and even total file size, though, and
>>> still am
>>> >> not seeing a dictionary get created. Is it possible it's bounded by
>>> file
>>> >> page size or some other layout option that I need to bump as well?
>>> >
>>> >
>>> > Sorry I'm less familiar with parquet-mr, hopefully someone else to
>>> chime
>>> > in.  If I had to guess, maybe somehow the config value isn't making it
>>> to
>>> > the writer (but there could also be something else at play).
>>> >
>>> > On Fri, Sep 15, 2023 at 9:33 AM Claire McGinty <
>>> claire.d.mcginty@gmail.com>
>>> > wrote:
>>> >
>>> >> Thanks so much, Micah!
>>> >>
>>> >> I think you are using the right setting, but maybe it is possible the
>>> >> > strings are still exceeding the threshold (perhaps increasing it by
>>> 50x
>>> >> or
>>> >> > more to verify)
>>> >>
>>> >>
>>> >> I'm glad I was looking at the right setting for dictionary size. I
>>> just
>>> >> tried it out with 10x, 50x, and even total file size, though, and
>>> still am
>>> >> not seeing a dictionary get created. Is it possible it's bounded by
>>> file
>>> >> page size or some other layout option that I need to bump as well?
>>> >>
>>> >> I haven't seen my discussion during my time in the community but
>>> maybe it
>>> >> > was discussed in the past.  I think the main challenge here is that
>>> >> pages
>>> >> > are either dictionary encoded or not.  I'd guess to make this
>>> practical
>>> >> > there would need to be a new hybrid page type, which I think it
>>> might
>>> >> be an
>>> >> > interesting idea but quite a bit of work.  Additionally, one would
>>> >> likely
>>> >> > need heuristics for when to potentially use the new mode versus a
>>> >> complete
>>> >> > fallback.
>>> >> >
>>> >>
>>> >> Got it, thanks for the explanation! It does seem like a huge amount of
>>> >> work
>>> >>
>>> >>
>>> >> Best,
>>> >> Claire
>>> >>
>>> >>
>>> >>
>>> >> On Thu, Sep 14, 2023 at 5:16 PM Micah Kornfield <
>>> emkornfield@gmail.com>
>>> >> wrote:
>>> >>
>>> >> > >
>>> >> > > - What's the heuristic for Parquet dictionary writing to succeed
>>> for a
>>> >> > > given column?
>>> >> >
>>> >> >
>>> >> >
>>> >> >
>>> >>
>>> https://github.com/apache/parquet-mr/blob/9b5a962df3007009a227ef421600197531f970a5/parquet-column/src/main/java/org/apache/parquet/column/values/dictionary/DictionaryValuesWriter.java#L117
>>> >> >
>>> >> >
>>> >> > > - Is that heuristic configurable at all?
>>> >> >
>>> >> >
>>> >> > I think you are using the right setting, but maybe it is possible
>>> the
>>> >> > strings are still exceeding the threshold (perhaps increasing it by
>>> 50x
>>> >> or
>>> >> > more to verify)
>>> >> >
>>> >> >
>>> >> > > - For high-cardinality datasets, has the idea of a frequency-based
>>> >> > > dictionary encoding been explored? Say, if the data follows a
>>> certain
>>> >> > > statistical distribution, we can create a dictionary of the most
>>> >> frequent
>>> >> > > values only?
>>> >> >
>>> >> > I haven't seen my discussion during my time in the community but
>>> maybe
>>> >> it
>>> >> > was discussed in the past.  I think the main challenge here is that
>>> >> pages
>>> >> > are either dictionary encoded or not.  I'd guess to make this
>>> practical
>>> >> > there would need to be a new hybrid page type, which I think it
>>> might
>>> >> be an
>>> >> > interesting idea but quite a bit of work.  Additionally, one would
>>> >> likely
>>> >> > need heuristics for when to potentially use the new mode versus a
>>> >> complete
>>> >> > fallback.
>>> >> >
>>> >> > Cheers,
>>> >> > Micah
>>> >> >
>>> >> > On Thu, Sep 14, 2023 at 12:07 PM Claire McGinty <
>>> >> > claire.d.mcginty@gmail.com>
>>> >> > wrote:
>>> >> >
>>> >> > > Hi dev@,
>>> >> > >
>>> >> > > I'm running some benchmarking on Parquet read/write performance
>>> and
>>> >> have
>>> >> > a
>>> >> > > few questions about how dictionary encoding works under the hood.
>>> Let
>>> >> me
>>> >> > > know if there's a better channel for this :)
>>> >> > >
>>> >> > > My test case uses parquet-avro, where I'm writing a single file
>>> >> > containing
>>> >> > > 5 million records. Each record has a single column, an Avro String
>>> >> field
>>> >> > > (Parquet binary field). I ran two configurations of base setup:
>>> in the
>>> >> > > first case, the string field has 5,000 possible unique values. In
>>> the
>>> >> > > second case, it has 50,000 unique values.
>>> >> > >
>>> >> > > In the first case (5k unique values), I used parquet-tools to
>>> inspect
>>> >> the
>>> >> > > file metadata and found that a dictionary had been written:
>>> >> > >
>>> >> > > % parquet-tools meta testdata-case1.parquet
>>> >> > > > file schema:  testdata.TestRecord
>>> >> > > >
>>> >> > > >
>>> >> > >
>>> >> >
>>> >>
>>> --------------------------------------------------------------------------------
>>> >> > > > stringField:  REQUIRED BINARY L:STRING R:0 D:0
>>> >> > > > row group 1:  RC:5000001 TS:18262874 OFFSET:4
>>> >> > > >
>>> >> > > >
>>> >> > >
>>> >> >
>>> >>
>>> --------------------------------------------------------------------------------
>>> >> > > > stringField:   BINARY UNCOMPRESSED DO:4 FPO:38918
>>> >> > SZ:8181452/8181452/1.00
>>> >> > > > VC:5000001 ENC:BIT_PACKED,PLAIN_DICTIONARY ST:[min: 0, max: 999,
>>> >> > > num_nulls:
>>> >> > > > 0]
>>> >> > >
>>> >> > >
>>> >> > > But in the second case (50k unique values), parquet-tools shows
>>> that
>>> >> no
>>> >> > > dictionary gets created, and the file size is *much* bigger:
>>> >> > >
>>> >> > > % parquet-tools meta testdata-case2.parquet
>>> >> > > > file schema:  testdata.TestRecord
>>> >> > > >
>>> >> > > >
>>> >> > >
>>> >> >
>>> >>
>>> --------------------------------------------------------------------------------
>>> >> > > > stringField:  REQUIRED BINARY L:STRING R:0 D:0
>>> >> > > > row group 1:  RC:5000001 TS:18262874 OFFSET:4
>>> >> > > >
>>> >> > > >
>>> >> > >
>>> >> >
>>> >>
>>> --------------------------------------------------------------------------------
>>> >> > > > stringField:  BINARY UNCOMPRESSED DO:0 FPO:4
>>> >> SZ:43896278/43896278/1.00
>>> >> > > > VC:5000001 ENC:PLAIN,BIT_PACKED ST:[min: 0, max: 9999,
>>> num_nulls: 0]
>>> >> > >
>>> >> > >
>>> >> > > (I created a gist of my test reproduction here
>>> >> > > <
>>> >>
>>> https://gist.github.com/clairemcginty/c3c0be85f51bc23db45a75e8d8a18806
>>> >> > >.)
>>> >> > >
>>> >> > > Based on this, I'm guessing there's some tip-over point after
>>> which
>>> >> > Parquet
>>> >> > > will give up on writing a dictionary for a given column? After
>>> reading
>>> >> > > the Configuration
>>> >> > > docs
>>> >> > > <
>>> >> >
>>> >>
>>> https://github.com/apache/parquet-mr/blob/master/parquet-hadoop/README.md
>>> >> > > >,
>>> >> > > I tried increasing the dictionary page size configuration 5x,
>>> with the
>>> >> > same
>>> >> > > result (no dictionary created).
>>> >> > >
>>> >> > > So to summarize, my questions are:
>>> >> > >
>>> >> > > - What's the heuristic for Parquet dictionary writing to succeed
>>> for a
>>> >> > > given column?
>>> >> > > - Is that heuristic configurable at all?
>>> >> > > - For high-cardinality datasets, has the idea of a frequency-based
>>> >> > > dictionary encoding been explored? Say, if the data follows a
>>> certain
>>> >> > > statistical distribution, we can create a dictionary of the most
>>> >> frequent
>>> >> > > values only?
>>> >> > >
>>> >> > > Thanks for your time!
>>> >> > > - Claire
>>> >> > >
>>> >> >
>>> >>
>>> >
>>>
>>

Re: Parquet dictionary size limits?

Posted by Claire McGinty <cl...@gmail.com>.
Hi, just wanted to follow up on this!

I ran a debugger to find out why my column wasn't ending up with a
dictionary encoding and it turns out that even though
DictionaryValuesWriter#shouldFallback()
<https://github.com/apache/parquet-mr/blob/master/parquet-column/src/main/java/org/apache/parquet/column/values/dictionary/DictionaryValuesWriter.java#L117>
always returned false (dictionaryByteSize was always less than my
configured page size), DictionaryValuesWriter#isCompressionSatisfying
<https://github.com/apache/parquet-mr/blob/master/parquet-column/src/main/java/org/apache/parquet/column/values/dictionary/DictionaryValuesWriter.java#L125>
was
what was causing Parquet to switch
<https://github.com/apache/parquet-mr/blob/master/parquet-column/src/main/java/org/apache/parquet/column/values/fallback/FallbackValuesWriter.java#L75>
back to the fallback, non-dict writer.

From what I can tell, this check compares the total byte size of
*every* element with the byte size of each *distinct* element as a kind of
proxy for encoding efficiency.... however, it seems strange that this check
can cause the writer to fall back even if the total encoded dict size is
far below the configured dictionary page size. Out of curiosity, I modified
DictionaryValuesWriter#isCompressionSatisfying
<https://github.com/apache/parquet-mr/blob/master/parquet-column/src/main/java/org/apache/parquet/column/values/dictionary/DictionaryValuesWriter.java#L125>
to
also check whether total byte size was less than dictionary max size and
re-ran my Parquet write with a local snapshot, and my file size dropped 50%.

Best,
Claire

On Mon, Sep 18, 2023 at 9:16 AM Claire McGinty <cl...@gmail.com>
wrote:

> Oh, interesting! I'm setting it via the
> ParquetWriter#withDictionaryPageSize method, and I do see the overall file
> size increasing when I bump the value. I'll look into it a bit more -- it
> would be helpful for some cases where the # unique values in a column is
> just over the size limit.
>
> - Claire
>
> On Fri, Sep 15, 2023 at 9:54 AM Micah Kornfield <em...@gmail.com>
> wrote:
>
>> I'll note there is also a check for encoding effectiveness [1] that could
>> come into play but I'd guess that isn't the case here.
>>
>> [1]
>>
>> https://github.com/apache/parquet-mr/blob/9b5a962df3007009a227ef421600197531f970a5/parquet-column/src/main/java/org/apache/parquet/column/values/dictionary/DictionaryValuesWriter.java#L124
>>
>> On Fri, Sep 15, 2023 at 9:51 AM Micah Kornfield <em...@gmail.com>
>> wrote:
>>
>> > I'm glad I was looking at the right setting for dictionary size. I just
>> >> tried it out with 10x, 50x, and even total file size, though, and
>> still am
>> >> not seeing a dictionary get created. Is it possible it's bounded by
>> file
>> >> page size or some other layout option that I need to bump as well?
>> >
>> >
>> > Sorry I'm less familiar with parquet-mr, hopefully someone else to chime
>> > in.  If I had to guess, maybe somehow the config value isn't making it
>> to
>> > the writer (but there could also be something else at play).
>> >
>> > On Fri, Sep 15, 2023 at 9:33 AM Claire McGinty <
>> claire.d.mcginty@gmail.com>
>> > wrote:
>> >
>> >> Thanks so much, Micah!
>> >>
>> >> I think you are using the right setting, but maybe it is possible the
>> >> > strings are still exceeding the threshold (perhaps increasing it by
>> 50x
>> >> or
>> >> > more to verify)
>> >>
>> >>
>> >> I'm glad I was looking at the right setting for dictionary size. I just
>> >> tried it out with 10x, 50x, and even total file size, though, and
>> still am
>> >> not seeing a dictionary get created. Is it possible it's bounded by
>> file
>> >> page size or some other layout option that I need to bump as well?
>> >>
>> >> I haven't seen my discussion during my time in the community but maybe
>> it
>> >> > was discussed in the past.  I think the main challenge here is that
>> >> pages
>> >> > are either dictionary encoded or not.  I'd guess to make this
>> practical
>> >> > there would need to be a new hybrid page type, which I think it might
>> >> be an
>> >> > interesting idea but quite a bit of work.  Additionally, one would
>> >> likely
>> >> > need heuristics for when to potentially use the new mode versus a
>> >> complete
>> >> > fallback.
>> >> >
>> >>
>> >> Got it, thanks for the explanation! It does seem like a huge amount of
>> >> work
>> >>
>> >>
>> >> Best,
>> >> Claire
>> >>
>> >>
>> >>
>> >> On Thu, Sep 14, 2023 at 5:16 PM Micah Kornfield <emkornfield@gmail.com
>> >
>> >> wrote:
>> >>
>> >> > >
>> >> > > - What's the heuristic for Parquet dictionary writing to succeed
>> for a
>> >> > > given column?
>> >> >
>> >> >
>> >> >
>> >> >
>> >>
>> https://github.com/apache/parquet-mr/blob/9b5a962df3007009a227ef421600197531f970a5/parquet-column/src/main/java/org/apache/parquet/column/values/dictionary/DictionaryValuesWriter.java#L117
>> >> >
>> >> >
>> >> > > - Is that heuristic configurable at all?
>> >> >
>> >> >
>> >> > I think you are using the right setting, but maybe it is possible the
>> >> > strings are still exceeding the threshold (perhaps increasing it by
>> 50x
>> >> or
>> >> > more to verify)
>> >> >
>> >> >
>> >> > > - For high-cardinality datasets, has the idea of a frequency-based
>> >> > > dictionary encoding been explored? Say, if the data follows a
>> certain
>> >> > > statistical distribution, we can create a dictionary of the most
>> >> frequent
>> >> > > values only?
>> >> >
>> >> > I haven't seen my discussion during my time in the community but
>> maybe
>> >> it
>> >> > was discussed in the past.  I think the main challenge here is that
>> >> pages
>> >> > are either dictionary encoded or not.  I'd guess to make this
>> practical
>> >> > there would need to be a new hybrid page type, which I think it might
>> >> be an
>> >> > interesting idea but quite a bit of work.  Additionally, one would
>> >> likely
>> >> > need heuristics for when to potentially use the new mode versus a
>> >> complete
>> >> > fallback.
>> >> >
>> >> > Cheers,
>> >> > Micah
>> >> >
>> >> > On Thu, Sep 14, 2023 at 12:07 PM Claire McGinty <
>> >> > claire.d.mcginty@gmail.com>
>> >> > wrote:
>> >> >
>> >> > > Hi dev@,
>> >> > >
>> >> > > I'm running some benchmarking on Parquet read/write performance and
>> >> have
>> >> > a
>> >> > > few questions about how dictionary encoding works under the hood.
>> Let
>> >> me
>> >> > > know if there's a better channel for this :)
>> >> > >
>> >> > > My test case uses parquet-avro, where I'm writing a single file
>> >> > containing
>> >> > > 5 million records. Each record has a single column, an Avro String
>> >> field
>> >> > > (Parquet binary field). I ran two configurations of base setup: in
>> the
>> >> > > first case, the string field has 5,000 possible unique values. In
>> the
>> >> > > second case, it has 50,000 unique values.
>> >> > >
>> >> > > In the first case (5k unique values), I used parquet-tools to
>> inspect
>> >> the
>> >> > > file metadata and found that a dictionary had been written:
>> >> > >
>> >> > > % parquet-tools meta testdata-case1.parquet
>> >> > > > file schema:  testdata.TestRecord
>> >> > > >
>> >> > > >
>> >> > >
>> >> >
>> >>
>> --------------------------------------------------------------------------------
>> >> > > > stringField:  REQUIRED BINARY L:STRING R:0 D:0
>> >> > > > row group 1:  RC:5000001 TS:18262874 OFFSET:4
>> >> > > >
>> >> > > >
>> >> > >
>> >> >
>> >>
>> --------------------------------------------------------------------------------
>> >> > > > stringField:   BINARY UNCOMPRESSED DO:4 FPO:38918
>> >> > SZ:8181452/8181452/1.00
>> >> > > > VC:5000001 ENC:BIT_PACKED,PLAIN_DICTIONARY ST:[min: 0, max: 999,
>> >> > > num_nulls:
>> >> > > > 0]
>> >> > >
>> >> > >
>> >> > > But in the second case (50k unique values), parquet-tools shows
>> that
>> >> no
>> >> > > dictionary gets created, and the file size is *much* bigger:
>> >> > >
>> >> > > % parquet-tools meta testdata-case2.parquet
>> >> > > > file schema:  testdata.TestRecord
>> >> > > >
>> >> > > >
>> >> > >
>> >> >
>> >>
>> --------------------------------------------------------------------------------
>> >> > > > stringField:  REQUIRED BINARY L:STRING R:0 D:0
>> >> > > > row group 1:  RC:5000001 TS:18262874 OFFSET:4
>> >> > > >
>> >> > > >
>> >> > >
>> >> >
>> >>
>> --------------------------------------------------------------------------------
>> >> > > > stringField:  BINARY UNCOMPRESSED DO:0 FPO:4
>> >> SZ:43896278/43896278/1.00
>> >> > > > VC:5000001 ENC:PLAIN,BIT_PACKED ST:[min: 0, max: 9999,
>> num_nulls: 0]
>> >> > >
>> >> > >
>> >> > > (I created a gist of my test reproduction here
>> >> > > <
>> >> https://gist.github.com/clairemcginty/c3c0be85f51bc23db45a75e8d8a18806
>> >> > >.)
>> >> > >
>> >> > > Based on this, I'm guessing there's some tip-over point after which
>> >> > Parquet
>> >> > > will give up on writing a dictionary for a given column? After
>> reading
>> >> > > the Configuration
>> >> > > docs
>> >> > > <
>> >> >
>> >>
>> https://github.com/apache/parquet-mr/blob/master/parquet-hadoop/README.md
>> >> > > >,
>> >> > > I tried increasing the dictionary page size configuration 5x, with
>> the
>> >> > same
>> >> > > result (no dictionary created).
>> >> > >
>> >> > > So to summarize, my questions are:
>> >> > >
>> >> > > - What's the heuristic for Parquet dictionary writing to succeed
>> for a
>> >> > > given column?
>> >> > > - Is that heuristic configurable at all?
>> >> > > - For high-cardinality datasets, has the idea of a frequency-based
>> >> > > dictionary encoding been explored? Say, if the data follows a
>> certain
>> >> > > statistical distribution, we can create a dictionary of the most
>> >> frequent
>> >> > > values only?
>> >> > >
>> >> > > Thanks for your time!
>> >> > > - Claire
>> >> > >
>> >> >
>> >>
>> >
>>
>

Re: Parquet dictionary size limits?

Posted by Claire McGinty <cl...@gmail.com>.
Oh, interesting! I'm setting it via the
ParquetWriter#withDictionaryPageSize method, and I do see the overall file
size increasing when I bump the value. I'll look into it a bit more -- it
would be helpful for some cases where the # unique values in a column is
just over the size limit.

- Claire

On Fri, Sep 15, 2023 at 9:54 AM Micah Kornfield <em...@gmail.com>
wrote:

> I'll note there is also a check for encoding effectiveness [1] that could
> come into play but I'd guess that isn't the case here.
>
> [1]
>
> https://github.com/apache/parquet-mr/blob/9b5a962df3007009a227ef421600197531f970a5/parquet-column/src/main/java/org/apache/parquet/column/values/dictionary/DictionaryValuesWriter.java#L124
>
> On Fri, Sep 15, 2023 at 9:51 AM Micah Kornfield <em...@gmail.com>
> wrote:
>
> > I'm glad I was looking at the right setting for dictionary size. I just
> >> tried it out with 10x, 50x, and even total file size, though, and still
> am
> >> not seeing a dictionary get created. Is it possible it's bounded by file
> >> page size or some other layout option that I need to bump as well?
> >
> >
> > Sorry I'm less familiar with parquet-mr, hopefully someone else to chime
> > in.  If I had to guess, maybe somehow the config value isn't making it to
> > the writer (but there could also be something else at play).
> >
> > On Fri, Sep 15, 2023 at 9:33 AM Claire McGinty <
> claire.d.mcginty@gmail.com>
> > wrote:
> >
> >> Thanks so much, Micah!
> >>
> >> I think you are using the right setting, but maybe it is possible the
> >> > strings are still exceeding the threshold (perhaps increasing it by
> 50x
> >> or
> >> > more to verify)
> >>
> >>
> >> I'm glad I was looking at the right setting for dictionary size. I just
> >> tried it out with 10x, 50x, and even total file size, though, and still
> am
> >> not seeing a dictionary get created. Is it possible it's bounded by file
> >> page size or some other layout option that I need to bump as well?
> >>
> >> I haven't seen my discussion during my time in the community but maybe
> it
> >> > was discussed in the past.  I think the main challenge here is that
> >> pages
> >> > are either dictionary encoded or not.  I'd guess to make this
> practical
> >> > there would need to be a new hybrid page type, which I think it might
> >> be an
> >> > interesting idea but quite a bit of work.  Additionally, one would
> >> likely
> >> > need heuristics for when to potentially use the new mode versus a
> >> complete
> >> > fallback.
> >> >
> >>
> >> Got it, thanks for the explanation! It does seem like a huge amount of
> >> work
> >>
> >>
> >> Best,
> >> Claire
> >>
> >>
> >>
> >> On Thu, Sep 14, 2023 at 5:16 PM Micah Kornfield <em...@gmail.com>
> >> wrote:
> >>
> >> > >
> >> > > - What's the heuristic for Parquet dictionary writing to succeed
> for a
> >> > > given column?
> >> >
> >> >
> >> >
> >> >
> >>
> https://github.com/apache/parquet-mr/blob/9b5a962df3007009a227ef421600197531f970a5/parquet-column/src/main/java/org/apache/parquet/column/values/dictionary/DictionaryValuesWriter.java#L117
> >> >
> >> >
> >> > > - Is that heuristic configurable at all?
> >> >
> >> >
> >> > I think you are using the right setting, but maybe it is possible the
> >> > strings are still exceeding the threshold (perhaps increasing it by
> 50x
> >> or
> >> > more to verify)
> >> >
> >> >
> >> > > - For high-cardinality datasets, has the idea of a frequency-based
> >> > > dictionary encoding been explored? Say, if the data follows a
> certain
> >> > > statistical distribution, we can create a dictionary of the most
> >> frequent
> >> > > values only?
> >> >
> >> > I haven't seen my discussion during my time in the community but maybe
> >> it
> >> > was discussed in the past.  I think the main challenge here is that
> >> pages
> >> > are either dictionary encoded or not.  I'd guess to make this
> practical
> >> > there would need to be a new hybrid page type, which I think it might
> >> be an
> >> > interesting idea but quite a bit of work.  Additionally, one would
> >> likely
> >> > need heuristics for when to potentially use the new mode versus a
> >> complete
> >> > fallback.
> >> >
> >> > Cheers,
> >> > Micah
> >> >
> >> > On Thu, Sep 14, 2023 at 12:07 PM Claire McGinty <
> >> > claire.d.mcginty@gmail.com>
> >> > wrote:
> >> >
> >> > > Hi dev@,
> >> > >
> >> > > I'm running some benchmarking on Parquet read/write performance and
> >> have
> >> > a
> >> > > few questions about how dictionary encoding works under the hood.
> Let
> >> me
> >> > > know if there's a better channel for this :)
> >> > >
> >> > > My test case uses parquet-avro, where I'm writing a single file
> >> > containing
> >> > > 5 million records. Each record has a single column, an Avro String
> >> field
> >> > > (Parquet binary field). I ran two configurations of base setup: in
> the
> >> > > first case, the string field has 5,000 possible unique values. In
> the
> >> > > second case, it has 50,000 unique values.
> >> > >
> >> > > In the first case (5k unique values), I used parquet-tools to
> inspect
> >> the
> >> > > file metadata and found that a dictionary had been written:
> >> > >
> >> > > % parquet-tools meta testdata-case1.parquet
> >> > > > file schema:  testdata.TestRecord
> >> > > >
> >> > > >
> >> > >
> >> >
> >>
> --------------------------------------------------------------------------------
> >> > > > stringField:  REQUIRED BINARY L:STRING R:0 D:0
> >> > > > row group 1:  RC:5000001 TS:18262874 OFFSET:4
> >> > > >
> >> > > >
> >> > >
> >> >
> >>
> --------------------------------------------------------------------------------
> >> > > > stringField:   BINARY UNCOMPRESSED DO:4 FPO:38918
> >> > SZ:8181452/8181452/1.00
> >> > > > VC:5000001 ENC:BIT_PACKED,PLAIN_DICTIONARY ST:[min: 0, max: 999,
> >> > > num_nulls:
> >> > > > 0]
> >> > >
> >> > >
> >> > > But in the second case (50k unique values), parquet-tools shows that
> >> no
> >> > > dictionary gets created, and the file size is *much* bigger:
> >> > >
> >> > > % parquet-tools meta testdata-case2.parquet
> >> > > > file schema:  testdata.TestRecord
> >> > > >
> >> > > >
> >> > >
> >> >
> >>
> --------------------------------------------------------------------------------
> >> > > > stringField:  REQUIRED BINARY L:STRING R:0 D:0
> >> > > > row group 1:  RC:5000001 TS:18262874 OFFSET:4
> >> > > >
> >> > > >
> >> > >
> >> >
> >>
> --------------------------------------------------------------------------------
> >> > > > stringField:  BINARY UNCOMPRESSED DO:0 FPO:4
> >> SZ:43896278/43896278/1.00
> >> > > > VC:5000001 ENC:PLAIN,BIT_PACKED ST:[min: 0, max: 9999, num_nulls:
> 0]
> >> > >
> >> > >
> >> > > (I created a gist of my test reproduction here
> >> > > <
> >> https://gist.github.com/clairemcginty/c3c0be85f51bc23db45a75e8d8a18806
> >> > >.)
> >> > >
> >> > > Based on this, I'm guessing there's some tip-over point after which
> >> > Parquet
> >> > > will give up on writing a dictionary for a given column? After
> reading
> >> > > the Configuration
> >> > > docs
> >> > > <
> >> >
> >>
> https://github.com/apache/parquet-mr/blob/master/parquet-hadoop/README.md
> >> > > >,
> >> > > I tried increasing the dictionary page size configuration 5x, with
> the
> >> > same
> >> > > result (no dictionary created).
> >> > >
> >> > > So to summarize, my questions are:
> >> > >
> >> > > - What's the heuristic for Parquet dictionary writing to succeed
> for a
> >> > > given column?
> >> > > - Is that heuristic configurable at all?
> >> > > - For high-cardinality datasets, has the idea of a frequency-based
> >> > > dictionary encoding been explored? Say, if the data follows a
> certain
> >> > > statistical distribution, we can create a dictionary of the most
> >> frequent
> >> > > values only?
> >> > >
> >> > > Thanks for your time!
> >> > > - Claire
> >> > >
> >> >
> >>
> >
>

Re: Parquet dictionary size limits?

Posted by Micah Kornfield <em...@gmail.com>.
I'll note there is also a check for encoding effectiveness [1] that could
come into play but I'd guess that isn't the case here.

[1]
https://github.com/apache/parquet-mr/blob/9b5a962df3007009a227ef421600197531f970a5/parquet-column/src/main/java/org/apache/parquet/column/values/dictionary/DictionaryValuesWriter.java#L124

On Fri, Sep 15, 2023 at 9:51 AM Micah Kornfield <em...@gmail.com>
wrote:

> I'm glad I was looking at the right setting for dictionary size. I just
>> tried it out with 10x, 50x, and even total file size, though, and still am
>> not seeing a dictionary get created. Is it possible it's bounded by file
>> page size or some other layout option that I need to bump as well?
>
>
> Sorry I'm less familiar with parquet-mr, hopefully someone else to chime
> in.  If I had to guess, maybe somehow the config value isn't making it to
> the writer (but there could also be something else at play).
>
> On Fri, Sep 15, 2023 at 9:33 AM Claire McGinty <cl...@gmail.com>
> wrote:
>
>> Thanks so much, Micah!
>>
>> I think you are using the right setting, but maybe it is possible the
>> > strings are still exceeding the threshold (perhaps increasing it by 50x
>> or
>> > more to verify)
>>
>>
>> I'm glad I was looking at the right setting for dictionary size. I just
>> tried it out with 10x, 50x, and even total file size, though, and still am
>> not seeing a dictionary get created. Is it possible it's bounded by file
>> page size or some other layout option that I need to bump as well?
>>
>> I haven't seen my discussion during my time in the community but maybe it
>> > was discussed in the past.  I think the main challenge here is that
>> pages
>> > are either dictionary encoded or not.  I'd guess to make this practical
>> > there would need to be a new hybrid page type, which I think it might
>> be an
>> > interesting idea but quite a bit of work.  Additionally, one would
>> likely
>> > need heuristics for when to potentially use the new mode versus a
>> complete
>> > fallback.
>> >
>>
>> Got it, thanks for the explanation! It does seem like a huge amount of
>> work
>>
>>
>> Best,
>> Claire
>>
>>
>>
>> On Thu, Sep 14, 2023 at 5:16 PM Micah Kornfield <em...@gmail.com>
>> wrote:
>>
>> > >
>> > > - What's the heuristic for Parquet dictionary writing to succeed for a
>> > > given column?
>> >
>> >
>> >
>> >
>> https://github.com/apache/parquet-mr/blob/9b5a962df3007009a227ef421600197531f970a5/parquet-column/src/main/java/org/apache/parquet/column/values/dictionary/DictionaryValuesWriter.java#L117
>> >
>> >
>> > > - Is that heuristic configurable at all?
>> >
>> >
>> > I think you are using the right setting, but maybe it is possible the
>> > strings are still exceeding the threshold (perhaps increasing it by 50x
>> or
>> > more to verify)
>> >
>> >
>> > > - For high-cardinality datasets, has the idea of a frequency-based
>> > > dictionary encoding been explored? Say, if the data follows a certain
>> > > statistical distribution, we can create a dictionary of the most
>> frequent
>> > > values only?
>> >
>> > I haven't seen my discussion during my time in the community but maybe
>> it
>> > was discussed in the past.  I think the main challenge here is that
>> pages
>> > are either dictionary encoded or not.  I'd guess to make this practical
>> > there would need to be a new hybrid page type, which I think it might
>> be an
>> > interesting idea but quite a bit of work.  Additionally, one would
>> likely
>> > need heuristics for when to potentially use the new mode versus a
>> complete
>> > fallback.
>> >
>> > Cheers,
>> > Micah
>> >
>> > On Thu, Sep 14, 2023 at 12:07 PM Claire McGinty <
>> > claire.d.mcginty@gmail.com>
>> > wrote:
>> >
>> > > Hi dev@,
>> > >
>> > > I'm running some benchmarking on Parquet read/write performance and
>> have
>> > a
>> > > few questions about how dictionary encoding works under the hood. Let
>> me
>> > > know if there's a better channel for this :)
>> > >
>> > > My test case uses parquet-avro, where I'm writing a single file
>> > containing
>> > > 5 million records. Each record has a single column, an Avro String
>> field
>> > > (Parquet binary field). I ran two configurations of base setup: in the
>> > > first case, the string field has 5,000 possible unique values. In the
>> > > second case, it has 50,000 unique values.
>> > >
>> > > In the first case (5k unique values), I used parquet-tools to inspect
>> the
>> > > file metadata and found that a dictionary had been written:
>> > >
>> > > % parquet-tools meta testdata-case1.parquet
>> > > > file schema:  testdata.TestRecord
>> > > >
>> > > >
>> > >
>> >
>> --------------------------------------------------------------------------------
>> > > > stringField:  REQUIRED BINARY L:STRING R:0 D:0
>> > > > row group 1:  RC:5000001 TS:18262874 OFFSET:4
>> > > >
>> > > >
>> > >
>> >
>> --------------------------------------------------------------------------------
>> > > > stringField:   BINARY UNCOMPRESSED DO:4 FPO:38918
>> > SZ:8181452/8181452/1.00
>> > > > VC:5000001 ENC:BIT_PACKED,PLAIN_DICTIONARY ST:[min: 0, max: 999,
>> > > num_nulls:
>> > > > 0]
>> > >
>> > >
>> > > But in the second case (50k unique values), parquet-tools shows that
>> no
>> > > dictionary gets created, and the file size is *much* bigger:
>> > >
>> > > % parquet-tools meta testdata-case2.parquet
>> > > > file schema:  testdata.TestRecord
>> > > >
>> > > >
>> > >
>> >
>> --------------------------------------------------------------------------------
>> > > > stringField:  REQUIRED BINARY L:STRING R:0 D:0
>> > > > row group 1:  RC:5000001 TS:18262874 OFFSET:4
>> > > >
>> > > >
>> > >
>> >
>> --------------------------------------------------------------------------------
>> > > > stringField:  BINARY UNCOMPRESSED DO:0 FPO:4
>> SZ:43896278/43896278/1.00
>> > > > VC:5000001 ENC:PLAIN,BIT_PACKED ST:[min: 0, max: 9999, num_nulls: 0]
>> > >
>> > >
>> > > (I created a gist of my test reproduction here
>> > > <
>> https://gist.github.com/clairemcginty/c3c0be85f51bc23db45a75e8d8a18806
>> > >.)
>> > >
>> > > Based on this, I'm guessing there's some tip-over point after which
>> > Parquet
>> > > will give up on writing a dictionary for a given column? After reading
>> > > the Configuration
>> > > docs
>> > > <
>> >
>> https://github.com/apache/parquet-mr/blob/master/parquet-hadoop/README.md
>> > > >,
>> > > I tried increasing the dictionary page size configuration 5x, with the
>> > same
>> > > result (no dictionary created).
>> > >
>> > > So to summarize, my questions are:
>> > >
>> > > - What's the heuristic for Parquet dictionary writing to succeed for a
>> > > given column?
>> > > - Is that heuristic configurable at all?
>> > > - For high-cardinality datasets, has the idea of a frequency-based
>> > > dictionary encoding been explored? Say, if the data follows a certain
>> > > statistical distribution, we can create a dictionary of the most
>> frequent
>> > > values only?
>> > >
>> > > Thanks for your time!
>> > > - Claire
>> > >
>> >
>>
>

Re: Parquet dictionary size limits?

Posted by Micah Kornfield <em...@gmail.com>.
>
> I'm glad I was looking at the right setting for dictionary size. I just
> tried it out with 10x, 50x, and even total file size, though, and still am
> not seeing a dictionary get created. Is it possible it's bounded by file
> page size or some other layout option that I need to bump as well?


Sorry I'm less familiar with parquet-mr, hopefully someone else to chime
in.  If I had to guess, maybe somehow the config value isn't making it to
the writer (but there could also be something else at play).

On Fri, Sep 15, 2023 at 9:33 AM Claire McGinty <cl...@gmail.com>
wrote:

> Thanks so much, Micah!
>
> I think you are using the right setting, but maybe it is possible the
> > strings are still exceeding the threshold (perhaps increasing it by 50x
> or
> > more to verify)
>
>
> I'm glad I was looking at the right setting for dictionary size. I just
> tried it out with 10x, 50x, and even total file size, though, and still am
> not seeing a dictionary get created. Is it possible it's bounded by file
> page size or some other layout option that I need to bump as well?
>
> I haven't seen my discussion during my time in the community but maybe it
> > was discussed in the past.  I think the main challenge here is that pages
> > are either dictionary encoded or not.  I'd guess to make this practical
> > there would need to be a new hybrid page type, which I think it might be
> an
> > interesting idea but quite a bit of work.  Additionally, one would likely
> > need heuristics for when to potentially use the new mode versus a
> complete
> > fallback.
> >
>
> Got it, thanks for the explanation! It does seem like a huge amount of work
>
>
> Best,
> Claire
>
>
>
> On Thu, Sep 14, 2023 at 5:16 PM Micah Kornfield <em...@gmail.com>
> wrote:
>
> > >
> > > - What's the heuristic for Parquet dictionary writing to succeed for a
> > > given column?
> >
> >
> >
> >
> https://github.com/apache/parquet-mr/blob/9b5a962df3007009a227ef421600197531f970a5/parquet-column/src/main/java/org/apache/parquet/column/values/dictionary/DictionaryValuesWriter.java#L117
> >
> >
> > > - Is that heuristic configurable at all?
> >
> >
> > I think you are using the right setting, but maybe it is possible the
> > strings are still exceeding the threshold (perhaps increasing it by 50x
> or
> > more to verify)
> >
> >
> > > - For high-cardinality datasets, has the idea of a frequency-based
> > > dictionary encoding been explored? Say, if the data follows a certain
> > > statistical distribution, we can create a dictionary of the most
> frequent
> > > values only?
> >
> > I haven't seen my discussion during my time in the community but maybe it
> > was discussed in the past.  I think the main challenge here is that pages
> > are either dictionary encoded or not.  I'd guess to make this practical
> > there would need to be a new hybrid page type, which I think it might be
> an
> > interesting idea but quite a bit of work.  Additionally, one would likely
> > need heuristics for when to potentially use the new mode versus a
> complete
> > fallback.
> >
> > Cheers,
> > Micah
> >
> > On Thu, Sep 14, 2023 at 12:07 PM Claire McGinty <
> > claire.d.mcginty@gmail.com>
> > wrote:
> >
> > > Hi dev@,
> > >
> > > I'm running some benchmarking on Parquet read/write performance and
> have
> > a
> > > few questions about how dictionary encoding works under the hood. Let
> me
> > > know if there's a better channel for this :)
> > >
> > > My test case uses parquet-avro, where I'm writing a single file
> > containing
> > > 5 million records. Each record has a single column, an Avro String
> field
> > > (Parquet binary field). I ran two configurations of base setup: in the
> > > first case, the string field has 5,000 possible unique values. In the
> > > second case, it has 50,000 unique values.
> > >
> > > In the first case (5k unique values), I used parquet-tools to inspect
> the
> > > file metadata and found that a dictionary had been written:
> > >
> > > % parquet-tools meta testdata-case1.parquet
> > > > file schema:  testdata.TestRecord
> > > >
> > > >
> > >
> >
> --------------------------------------------------------------------------------
> > > > stringField:  REQUIRED BINARY L:STRING R:0 D:0
> > > > row group 1:  RC:5000001 TS:18262874 OFFSET:4
> > > >
> > > >
> > >
> >
> --------------------------------------------------------------------------------
> > > > stringField:   BINARY UNCOMPRESSED DO:4 FPO:38918
> > SZ:8181452/8181452/1.00
> > > > VC:5000001 ENC:BIT_PACKED,PLAIN_DICTIONARY ST:[min: 0, max: 999,
> > > num_nulls:
> > > > 0]
> > >
> > >
> > > But in the second case (50k unique values), parquet-tools shows that no
> > > dictionary gets created, and the file size is *much* bigger:
> > >
> > > % parquet-tools meta testdata-case2.parquet
> > > > file schema:  testdata.TestRecord
> > > >
> > > >
> > >
> >
> --------------------------------------------------------------------------------
> > > > stringField:  REQUIRED BINARY L:STRING R:0 D:0
> > > > row group 1:  RC:5000001 TS:18262874 OFFSET:4
> > > >
> > > >
> > >
> >
> --------------------------------------------------------------------------------
> > > > stringField:  BINARY UNCOMPRESSED DO:0 FPO:4
> SZ:43896278/43896278/1.00
> > > > VC:5000001 ENC:PLAIN,BIT_PACKED ST:[min: 0, max: 9999, num_nulls: 0]
> > >
> > >
> > > (I created a gist of my test reproduction here
> > > <
> https://gist.github.com/clairemcginty/c3c0be85f51bc23db45a75e8d8a18806
> > >.)
> > >
> > > Based on this, I'm guessing there's some tip-over point after which
> > Parquet
> > > will give up on writing a dictionary for a given column? After reading
> > > the Configuration
> > > docs
> > > <
> >
> https://github.com/apache/parquet-mr/blob/master/parquet-hadoop/README.md
> > > >,
> > > I tried increasing the dictionary page size configuration 5x, with the
> > same
> > > result (no dictionary created).
> > >
> > > So to summarize, my questions are:
> > >
> > > - What's the heuristic for Parquet dictionary writing to succeed for a
> > > given column?
> > > - Is that heuristic configurable at all?
> > > - For high-cardinality datasets, has the idea of a frequency-based
> > > dictionary encoding been explored? Say, if the data follows a certain
> > > statistical distribution, we can create a dictionary of the most
> frequent
> > > values only?
> > >
> > > Thanks for your time!
> > > - Claire
> > >
> >
>

Re: Parquet dictionary size limits?

Posted by Claire McGinty <cl...@gmail.com>.
Thanks so much, Micah!

I think you are using the right setting, but maybe it is possible the
> strings are still exceeding the threshold (perhaps increasing it by 50x or
> more to verify)


I'm glad I was looking at the right setting for dictionary size. I just
tried it out with 10x, 50x, and even total file size, though, and still am
not seeing a dictionary get created. Is it possible it's bounded by file
page size or some other layout option that I need to bump as well?

I haven't seen my discussion during my time in the community but maybe it
> was discussed in the past.  I think the main challenge here is that pages
> are either dictionary encoded or not.  I'd guess to make this practical
> there would need to be a new hybrid page type, which I think it might be an
> interesting idea but quite a bit of work.  Additionally, one would likely
> need heuristics for when to potentially use the new mode versus a complete
> fallback.
>

Got it, thanks for the explanation! It does seem like a huge amount of work


Best,
Claire



On Thu, Sep 14, 2023 at 5:16 PM Micah Kornfield <em...@gmail.com>
wrote:

> >
> > - What's the heuristic for Parquet dictionary writing to succeed for a
> > given column?
>
>
>
> https://github.com/apache/parquet-mr/blob/9b5a962df3007009a227ef421600197531f970a5/parquet-column/src/main/java/org/apache/parquet/column/values/dictionary/DictionaryValuesWriter.java#L117
>
>
> > - Is that heuristic configurable at all?
>
>
> I think you are using the right setting, but maybe it is possible the
> strings are still exceeding the threshold (perhaps increasing it by 50x or
> more to verify)
>
>
> > - For high-cardinality datasets, has the idea of a frequency-based
> > dictionary encoding been explored? Say, if the data follows a certain
> > statistical distribution, we can create a dictionary of the most frequent
> > values only?
>
> I haven't seen my discussion during my time in the community but maybe it
> was discussed in the past.  I think the main challenge here is that pages
> are either dictionary encoded or not.  I'd guess to make this practical
> there would need to be a new hybrid page type, which I think it might be an
> interesting idea but quite a bit of work.  Additionally, one would likely
> need heuristics for when to potentially use the new mode versus a complete
> fallback.
>
> Cheers,
> Micah
>
> On Thu, Sep 14, 2023 at 12:07 PM Claire McGinty <
> claire.d.mcginty@gmail.com>
> wrote:
>
> > Hi dev@,
> >
> > I'm running some benchmarking on Parquet read/write performance and have
> a
> > few questions about how dictionary encoding works under the hood. Let me
> > know if there's a better channel for this :)
> >
> > My test case uses parquet-avro, where I'm writing a single file
> containing
> > 5 million records. Each record has a single column, an Avro String field
> > (Parquet binary field). I ran two configurations of base setup: in the
> > first case, the string field has 5,000 possible unique values. In the
> > second case, it has 50,000 unique values.
> >
> > In the first case (5k unique values), I used parquet-tools to inspect the
> > file metadata and found that a dictionary had been written:
> >
> > % parquet-tools meta testdata-case1.parquet
> > > file schema:  testdata.TestRecord
> > >
> > >
> >
> --------------------------------------------------------------------------------
> > > stringField:  REQUIRED BINARY L:STRING R:0 D:0
> > > row group 1:  RC:5000001 TS:18262874 OFFSET:4
> > >
> > >
> >
> --------------------------------------------------------------------------------
> > > stringField:   BINARY UNCOMPRESSED DO:4 FPO:38918
> SZ:8181452/8181452/1.00
> > > VC:5000001 ENC:BIT_PACKED,PLAIN_DICTIONARY ST:[min: 0, max: 999,
> > num_nulls:
> > > 0]
> >
> >
> > But in the second case (50k unique values), parquet-tools shows that no
> > dictionary gets created, and the file size is *much* bigger:
> >
> > % parquet-tools meta testdata-case2.parquet
> > > file schema:  testdata.TestRecord
> > >
> > >
> >
> --------------------------------------------------------------------------------
> > > stringField:  REQUIRED BINARY L:STRING R:0 D:0
> > > row group 1:  RC:5000001 TS:18262874 OFFSET:4
> > >
> > >
> >
> --------------------------------------------------------------------------------
> > > stringField:  BINARY UNCOMPRESSED DO:0 FPO:4 SZ:43896278/43896278/1.00
> > > VC:5000001 ENC:PLAIN,BIT_PACKED ST:[min: 0, max: 9999, num_nulls: 0]
> >
> >
> > (I created a gist of my test reproduction here
> > <https://gist.github.com/clairemcginty/c3c0be85f51bc23db45a75e8d8a18806
> >.)
> >
> > Based on this, I'm guessing there's some tip-over point after which
> Parquet
> > will give up on writing a dictionary for a given column? After reading
> > the Configuration
> > docs
> > <
> https://github.com/apache/parquet-mr/blob/master/parquet-hadoop/README.md
> > >,
> > I tried increasing the dictionary page size configuration 5x, with the
> same
> > result (no dictionary created).
> >
> > So to summarize, my questions are:
> >
> > - What's the heuristic for Parquet dictionary writing to succeed for a
> > given column?
> > - Is that heuristic configurable at all?
> > - For high-cardinality datasets, has the idea of a frequency-based
> > dictionary encoding been explored? Say, if the data follows a certain
> > statistical distribution, we can create a dictionary of the most frequent
> > values only?
> >
> > Thanks for your time!
> > - Claire
> >
>

Re: Parquet dictionary size limits?

Posted by Micah Kornfield <em...@gmail.com>.
>
> - What's the heuristic for Parquet dictionary writing to succeed for a
> given column?


https://github.com/apache/parquet-mr/blob/9b5a962df3007009a227ef421600197531f970a5/parquet-column/src/main/java/org/apache/parquet/column/values/dictionary/DictionaryValuesWriter.java#L117


> - Is that heuristic configurable at all?


I think you are using the right setting, but maybe it is possible the
strings are still exceeding the threshold (perhaps increasing it by 50x or
more to verify)


> - For high-cardinality datasets, has the idea of a frequency-based
> dictionary encoding been explored? Say, if the data follows a certain
> statistical distribution, we can create a dictionary of the most frequent
> values only?

I haven't seen my discussion during my time in the community but maybe it
was discussed in the past.  I think the main challenge here is that pages
are either dictionary encoded or not.  I'd guess to make this practical
there would need to be a new hybrid page type, which I think it might be an
interesting idea but quite a bit of work.  Additionally, one would likely
need heuristics for when to potentially use the new mode versus a complete
fallback.

Cheers,
Micah

On Thu, Sep 14, 2023 at 12:07 PM Claire McGinty <cl...@gmail.com>
wrote:

> Hi dev@,
>
> I'm running some benchmarking on Parquet read/write performance and have a
> few questions about how dictionary encoding works under the hood. Let me
> know if there's a better channel for this :)
>
> My test case uses parquet-avro, where I'm writing a single file containing
> 5 million records. Each record has a single column, an Avro String field
> (Parquet binary field). I ran two configurations of base setup: in the
> first case, the string field has 5,000 possible unique values. In the
> second case, it has 50,000 unique values.
>
> In the first case (5k unique values), I used parquet-tools to inspect the
> file metadata and found that a dictionary had been written:
>
> % parquet-tools meta testdata-case1.parquet
> > file schema:  testdata.TestRecord
> >
> >
> --------------------------------------------------------------------------------
> > stringField:  REQUIRED BINARY L:STRING R:0 D:0
> > row group 1:  RC:5000001 TS:18262874 OFFSET:4
> >
> >
> --------------------------------------------------------------------------------
> > stringField:   BINARY UNCOMPRESSED DO:4 FPO:38918 SZ:8181452/8181452/1.00
> > VC:5000001 ENC:BIT_PACKED,PLAIN_DICTIONARY ST:[min: 0, max: 999,
> num_nulls:
> > 0]
>
>
> But in the second case (50k unique values), parquet-tools shows that no
> dictionary gets created, and the file size is *much* bigger:
>
> % parquet-tools meta testdata-case2.parquet
> > file schema:  testdata.TestRecord
> >
> >
> --------------------------------------------------------------------------------
> > stringField:  REQUIRED BINARY L:STRING R:0 D:0
> > row group 1:  RC:5000001 TS:18262874 OFFSET:4
> >
> >
> --------------------------------------------------------------------------------
> > stringField:  BINARY UNCOMPRESSED DO:0 FPO:4 SZ:43896278/43896278/1.00
> > VC:5000001 ENC:PLAIN,BIT_PACKED ST:[min: 0, max: 9999, num_nulls: 0]
>
>
> (I created a gist of my test reproduction here
> <https://gist.github.com/clairemcginty/c3c0be85f51bc23db45a75e8d8a18806>.)
>
> Based on this, I'm guessing there's some tip-over point after which Parquet
> will give up on writing a dictionary for a given column? After reading
> the Configuration
> docs
> <https://github.com/apache/parquet-mr/blob/master/parquet-hadoop/README.md
> >,
> I tried increasing the dictionary page size configuration 5x, with the same
> result (no dictionary created).
>
> So to summarize, my questions are:
>
> - What's the heuristic for Parquet dictionary writing to succeed for a
> given column?
> - Is that heuristic configurable at all?
> - For high-cardinality datasets, has the idea of a frequency-based
> dictionary encoding been explored? Say, if the data follows a certain
> statistical distribution, we can create a dictionary of the most frequent
> values only?
>
> Thanks for your time!
> - Claire
>