You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@parquet.apache.org by Kirill Safonov <ki...@gmail.com> on 2016/03/06 16:37:52 UTC

achieving better compression with Parquet

Hi guys,

We’re evaluating Parquet as the high compression format for our logs. We took some ~850Gb of TSV data (some columns are JSON) and Parquet (CompressionCodec.GZIP) gave us 6.8x compression whereas plain GZip (with Deflater.BEST_COMPRESSION) gave 4.9x (~1.4 times less) on the same data. 

So the questions are:

1) is this somewhat expected compression rate (compared to GZip)?
2) As we specially crafted Parquet schema with maps and lists for certain fields, is there any tool to show the sizes of individual Parquet columns so we can find the biggest ones?

Thanks in advance,
 Kirill

Re: achieving better compression with Parquet

Posted by Ryan Blue <rb...@netflix.com.INVALID>.
Right now, the spec supports columns in separate files, but I don't think
that the implementation does. It wouldn't be too hard to make that work,
but I don't think it does today.

For predicate push-down in Spark, I've gotten it working and will be
getting the patches into upstream Spark. It mostly works now with a few
settings, but has disabled string/binary stats filtering because of
PARQUET-251. I am also trying to get a few important patches in to help
when writing Parquet files and avoid OOMs.

rb

On Sat, May 7, 2016 at 2:39 PM, Kirill Safonov <ki...@gmail.com>
wrote:

> Hi Ryan, guys,
>
> Let me please follow up on your last answer. Parquet file can be physically
> stored as a single file (written via WriteSupport) or as a folder with a
> collection of "parallel" files (generated by map-reduce or Spark
> via ParquetOutputFormat).
>
> Will a Spark task processing Parquet input benefit equally from min/max
> stats for both cases (single file vs folder)?
>
> Thanks,
>  Kirill
>
> On Wed, Mar 16, 2016 at 8:30 PM, Ryan Blue <rb...@netflix.com.invalid>
> wrote:
>
> > Kirill,
> >
> > Yes, sorting data by the columns you intend to filter by will definitely
> > help query performance because we keep min/max stats for each column
> chunk
> > and page that are used to eliminate row groups when you are passing
> filters
> > into Parquet.
> >
> > rb
> >
> > On Wed, Mar 16, 2016 at 1:07 AM, Kirill Safonov <
> kirill.safonov@gmail.com>
> > wrote:
> >
> > > Antwins,
> > >
> > > Typical query for us is something like ‘Select events where [here come
> > > attributes constraints] and timestamp > 2016-03-16 and timestamp <
> > > 2016-03-17’, that’s why I’m asking if this query can benefit from
> > timestamp
> > > ordering.
> > >
> > > > On 16 Mar 2016, at 03:03, Antwnis <an...@gmail.com> wrote:
> > > >
> > > > Kirill,
> > > >
> > > > I would think that if such a capability is introduced it should be
> > > > `optional` as depending on your query patterns it might make more
> sense
> > > to
> > > > sort on another column.
> > > >
> > > > On Tue, Mar 15, 2016 at 10:18 PM, Kirill Safonov <
> > > kirill.safonov@gmail.com>
> > > > wrote:
> > > >
> > > >> Thanks Ryan,
> > > >>
> > > >> One more question please: as we’re going to store timestamped events
> > in
> > > >> Parquet, would it be beneficial to write the files chronologically
> > > sorted?
> > > >> Namely, will the query for the certain time range over the
> time-sorted
> > > >> Parquet file be optimised so that irrelevant portion of data is
> > skipped
> > > and
> > > >> no "full scan" is done?
> > > >>
> > > >> Kirill
> > > >>
> > > >>> On 14 Mar 2016, at 22:00, Ryan Blue <rb...@netflix.com.INVALID>
> > wrote:
> > > >>>
> > > >>> Adding int64-delta should be weeks. We should also open a bug
> report
> > > for
> > > >>> that line in Spark. It should not fail if an annotation is
> > unsupported.
> > > >> It
> > > >>> should ignore it.
> > > >>>
> > > >>> On Mon, Mar 14, 2016 at 10:11 AM, Kirill Safonov <
> > > >> kirill.safonov@gmail.com>
> > > >>> wrote:
> > > >>>
> > > >>>> Thanks for reply Ryan,
> > > >>>>
> > > >>>>> For 2, PLAIN/gzip is the best option for timestamps right now.
> The
> > > >> format
> > > >>>>> 2.0 encodings include a delta-integer encoding that we expect to
> > work
> > > >>>> really well for timestamps, but that hasn't been committed for
> int64
> > > >> yet.
> > > >>>>
> > > >>>> Is there any ETA on when it can appear? Just the order e.g. weeks
> or
> > > >>>> months?
> > > >>>>
> > > >>>>> Also, it should be safe to store timestamps as int64 using the
> > > >>>> TIMESTAMP_MILLIS annotation.
> > > >>>>
> > > >>>> Unfortunately this is not the case for us as the Parquet complains
> > > with
> > > >>>> "Parquet type not yet supported" [1].
> > > >>>>
> > > >>>> Thanks,
> > > >>>> Kirill
> > > >>>>
> > > >>>> [1]:
> > > >>>>
> > > >>>>
> > > >>
> > >
> >
> https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/CatalystSchemaConverter.scala#L161
> > > >>>>
> > > >>>> -----Original Message-----
> > > >>>> From: Ryan Blue [mailto:rblue@netflix.com.INVALID]
> > > >>>> Sent: Monday, March 14, 2016 7:44 PM
> > > >>>> To: Parquet Dev
> > > >>>> Subject: Re: achieving better compression with Parquet
> > > >>>>
> > > >>>> Kirill,
> > > >>>>
> > > >>>> For 1, the reported size is just the data size. That doesn't
> include
> > > >> page
> > > >>>> headers, statistics, or dictionary pages. You can see the size of
> > the
> > > >>>> dictionary pages in the dump output, which I would expect is where
> > the
> > > >>>> majority of the difference is.
> > > >>>>
> > > >>>> For 2, PLAIN/gzip is the best option for timestamps right now. The
> > > >> format
> > > >>>> 2.0 encodings include a delta-integer encoding that we expect to
> > work
> > > >>>> really well for timestamps, but that hasn't been committed for
> int64
> > > >> yet.
> > > >>>>
> > > >>>> Also, it should be safe to store timestamps as int64 using the
> > > >>>> TIMESTAMP_MILLIS annotation. That's just a way to keep track of
> what
> > > the
> > > >>>> values you write represent. When there isn't specific support for
> > it,
> > > >> you
> > > >>>> should just get an int64. Using that annotation should give you
> the
> > > >> exact
> > > >>>> same behavior as not using it right now, but when you update to a
> > > >> version
> > > >>>> of Spark that supports it you should be able to get timestamps out
> > of
> > > >> your
> > > >>>> existing data.
> > > >>>>
> > > >>>> rb
> > > >>>>
> > > >>>> On Mon, Mar 7, 2016 at 3:29 PM, Kirill Safonov <
> > > >> kirill.safonov@gmail.com>
> > > >>>> wrote:
> > > >>>>
> > > >>>>> Thanks for the hint Ryan!
> > > >>>>>
> > > >>>>> I applied the tool to the file and I’ve got some more questions
> if
> > > you
> > > >>>>> don’t mind :-)
> > > >>>>>
> > > >>>>> 1) We’re using 64Mb page (row group) size so I would expect the
> sum
> > > of
> > > >>>>> all the values in “compressed size” field (which is {x} in
> > > >>>>> SZ:{x}/{y}/{z}
> > > >>>>> notation) to be around 64 Mb, but it’s near 48 Mb. Is this
> > expected?
> > > >>>>> 2) One of the largest field is Unix timestamp (we may have lots
> of
> > > >>>>> timestamps for a single data record) which is written as plain
> > int64
> > > >>>>> (we refrained from using OriginalType.TIMESTAMP_MILLIS as it
> seems
> > to
> > > >>>>> be not yet supported by Spark). The tool says that this column is
> > > >>>>> stored with “ENC:PLAIN” encoding (which I suppose is GZipped
> > > >>>>> afterwards). Is this the most compact way to store timestamps or
> > e.g.
> > > >>>>> giving a "OriginalType.TIMESTAMP_MILLIS” or other hint will make
> an
> > > >>>> improvement?
> > > >>>>>
> > > >>>>> Thanks,
> > > >>>>> Kirill
> > > >>>>>
> > > >>>>>> On 07 Mar 2016, at 00:26, Ryan Blue <rb...@netflix.com.INVALID>
> > > >> wrote:
> > > >>>>>>
> > > >>>>>> Hi Kirill,
> > > >>>>>>
> > > >>>>>> It's hard to say what the expected compression rate should be
> > since
> > > >>>>> that's
> > > >>>>>> heavily data-dependent. Sounds like Parquet isn't doing too bad,
> > > >>>> though.
> > > >>>>>>
> > > >>>>>> For inspecting the files, check out parquet-tools [1]. That can
> > dump
> > > >>>>>> the metadata from a file all the way down to the page level. The
> > > >> "meta"
> > > >>>>> command
> > > >>>>>> will print out each row group and column within those row
> groups,
> > > >>>>>> which should give you the info you're looking for.
> > > >>>>>>
> > > >>>>>> rb
> > > >>>>>>
> > > >>>>>>
> > > >>>>>> [1]:
> > > >>>>>>
> > > >>>>>
> > > http://search.maven.org/#artifactdetails%7Corg.apache.parquet%7Cparque
> > > >>>>> t-tools%7C1.8.1%7Cjar
> > > >>>>>>
> > > >>>>>> On Sun, Mar 6, 2016 at 7:37 AM, Kirill Safonov
> > > >>>>>> <kirill.safonov@gmail.com
> > > >>>>>>
> > > >>>>>> wrote:
> > > >>>>>>
> > > >>>>>>> Hi guys,
> > > >>>>>>>
> > > >>>>>>> We’re evaluating Parquet as the high compression format for our
> > > >>>>>>> logs. We took some ~850Gb of TSV data (some columns are JSON)
> and
> > > >>>>>>> Parquet
> > > >>>>>>> (CompressionCodec.GZIP) gave us 6.8x compression whereas plain
> > GZip
> > > >>>>> (with
> > > >>>>>>> Deflater.BEST_COMPRESSION) gave 4.9x (~1.4 times less) on the
> > same
> > > >>>> data.
> > > >>>>>>>
> > > >>>>>>> So the questions are:
> > > >>>>>>>
> > > >>>>>>> 1) is this somewhat expected compression rate (compared to
> GZip)?
> > > >>>>>>> 2) As we specially crafted Parquet schema with maps and lists
> for
> > > >>>>> certain
> > > >>>>>>> fields, is there any tool to show the sizes of individual
> Parquet
> > > >>>>> columns
> > > >>>>>>> so we can find the biggest ones?
> > > >>>>>>>
> > > >>>>>>> Thanks in advance,
> > > >>>>>>> Kirill
> > > >>>>>>
> > > >>>>>>
> > > >>>>>>
> > > >>>>>>
> > > >>>>>> --
> > > >>>>>> Ryan Blue
> > > >>>>>> Software Engineer
> > > >>>>>> Netflix
> > > >>>>>
> > > >>>>>
> > > >>>>
> > > >>>>
> > > >>>> --
> > > >>>> Ryan Blue
> > > >>>> Software Engineer
> > > >>>> Netflix
> > > >>>>
> > > >>>>
> > > >>>
> > > >>>
> > > >>> --
> > > >>> Ryan Blue
> > > >>> Software Engineer
> > > >>> Netflix
> > > >>
> > > >>
> > >
> > >
> >
> >
> > --
> > Ryan Blue
> > Software Engineer
> > Netflix
> >
>
>
>
> --
>  kir
>



-- 
Ryan Blue
Software Engineer
Netflix

Re: achieving better compression with Parquet

Posted by Kirill Safonov <ki...@gmail.com>.
Hi Ryan, guys,

Let me please follow up on your last answer. Parquet file can be physically
stored as a single file (written via WriteSupport) or as a folder with a
collection of "parallel" files (generated by map-reduce or Spark
via ParquetOutputFormat).

Will a Spark task processing Parquet input benefit equally from min/max
stats for both cases (single file vs folder)?

Thanks,
 Kirill

On Wed, Mar 16, 2016 at 8:30 PM, Ryan Blue <rb...@netflix.com.invalid>
wrote:

> Kirill,
>
> Yes, sorting data by the columns you intend to filter by will definitely
> help query performance because we keep min/max stats for each column chunk
> and page that are used to eliminate row groups when you are passing filters
> into Parquet.
>
> rb
>
> On Wed, Mar 16, 2016 at 1:07 AM, Kirill Safonov <ki...@gmail.com>
> wrote:
>
> > Antwins,
> >
> > Typical query for us is something like ‘Select events where [here come
> > attributes constraints] and timestamp > 2016-03-16 and timestamp <
> > 2016-03-17’, that’s why I’m asking if this query can benefit from
> timestamp
> > ordering.
> >
> > > On 16 Mar 2016, at 03:03, Antwnis <an...@gmail.com> wrote:
> > >
> > > Kirill,
> > >
> > > I would think that if such a capability is introduced it should be
> > > `optional` as depending on your query patterns it might make more sense
> > to
> > > sort on another column.
> > >
> > > On Tue, Mar 15, 2016 at 10:18 PM, Kirill Safonov <
> > kirill.safonov@gmail.com>
> > > wrote:
> > >
> > >> Thanks Ryan,
> > >>
> > >> One more question please: as we’re going to store timestamped events
> in
> > >> Parquet, would it be beneficial to write the files chronologically
> > sorted?
> > >> Namely, will the query for the certain time range over the time-sorted
> > >> Parquet file be optimised so that irrelevant portion of data is
> skipped
> > and
> > >> no "full scan" is done?
> > >>
> > >> Kirill
> > >>
> > >>> On 14 Mar 2016, at 22:00, Ryan Blue <rb...@netflix.com.INVALID>
> wrote:
> > >>>
> > >>> Adding int64-delta should be weeks. We should also open a bug report
> > for
> > >>> that line in Spark. It should not fail if an annotation is
> unsupported.
> > >> It
> > >>> should ignore it.
> > >>>
> > >>> On Mon, Mar 14, 2016 at 10:11 AM, Kirill Safonov <
> > >> kirill.safonov@gmail.com>
> > >>> wrote:
> > >>>
> > >>>> Thanks for reply Ryan,
> > >>>>
> > >>>>> For 2, PLAIN/gzip is the best option for timestamps right now. The
> > >> format
> > >>>>> 2.0 encodings include a delta-integer encoding that we expect to
> work
> > >>>> really well for timestamps, but that hasn't been committed for int64
> > >> yet.
> > >>>>
> > >>>> Is there any ETA on when it can appear? Just the order e.g. weeks or
> > >>>> months?
> > >>>>
> > >>>>> Also, it should be safe to store timestamps as int64 using the
> > >>>> TIMESTAMP_MILLIS annotation.
> > >>>>
> > >>>> Unfortunately this is not the case for us as the Parquet complains
> > with
> > >>>> "Parquet type not yet supported" [1].
> > >>>>
> > >>>> Thanks,
> > >>>> Kirill
> > >>>>
> > >>>> [1]:
> > >>>>
> > >>>>
> > >>
> >
> https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/CatalystSchemaConverter.scala#L161
> > >>>>
> > >>>> -----Original Message-----
> > >>>> From: Ryan Blue [mailto:rblue@netflix.com.INVALID]
> > >>>> Sent: Monday, March 14, 2016 7:44 PM
> > >>>> To: Parquet Dev
> > >>>> Subject: Re: achieving better compression with Parquet
> > >>>>
> > >>>> Kirill,
> > >>>>
> > >>>> For 1, the reported size is just the data size. That doesn't include
> > >> page
> > >>>> headers, statistics, or dictionary pages. You can see the size of
> the
> > >>>> dictionary pages in the dump output, which I would expect is where
> the
> > >>>> majority of the difference is.
> > >>>>
> > >>>> For 2, PLAIN/gzip is the best option for timestamps right now. The
> > >> format
> > >>>> 2.0 encodings include a delta-integer encoding that we expect to
> work
> > >>>> really well for timestamps, but that hasn't been committed for int64
> > >> yet.
> > >>>>
> > >>>> Also, it should be safe to store timestamps as int64 using the
> > >>>> TIMESTAMP_MILLIS annotation. That's just a way to keep track of what
> > the
> > >>>> values you write represent. When there isn't specific support for
> it,
> > >> you
> > >>>> should just get an int64. Using that annotation should give you the
> > >> exact
> > >>>> same behavior as not using it right now, but when you update to a
> > >> version
> > >>>> of Spark that supports it you should be able to get timestamps out
> of
> > >> your
> > >>>> existing data.
> > >>>>
> > >>>> rb
> > >>>>
> > >>>> On Mon, Mar 7, 2016 at 3:29 PM, Kirill Safonov <
> > >> kirill.safonov@gmail.com>
> > >>>> wrote:
> > >>>>
> > >>>>> Thanks for the hint Ryan!
> > >>>>>
> > >>>>> I applied the tool to the file and I’ve got some more questions if
> > you
> > >>>>> don’t mind :-)
> > >>>>>
> > >>>>> 1) We’re using 64Mb page (row group) size so I would expect the sum
> > of
> > >>>>> all the values in “compressed size” field (which is {x} in
> > >>>>> SZ:{x}/{y}/{z}
> > >>>>> notation) to be around 64 Mb, but it’s near 48 Mb. Is this
> expected?
> > >>>>> 2) One of the largest field is Unix timestamp (we may have lots of
> > >>>>> timestamps for a single data record) which is written as plain
> int64
> > >>>>> (we refrained from using OriginalType.TIMESTAMP_MILLIS as it seems
> to
> > >>>>> be not yet supported by Spark). The tool says that this column is
> > >>>>> stored with “ENC:PLAIN” encoding (which I suppose is GZipped
> > >>>>> afterwards). Is this the most compact way to store timestamps or
> e.g.
> > >>>>> giving a "OriginalType.TIMESTAMP_MILLIS” or other hint will make an
> > >>>> improvement?
> > >>>>>
> > >>>>> Thanks,
> > >>>>> Kirill
> > >>>>>
> > >>>>>> On 07 Mar 2016, at 00:26, Ryan Blue <rb...@netflix.com.INVALID>
> > >> wrote:
> > >>>>>>
> > >>>>>> Hi Kirill,
> > >>>>>>
> > >>>>>> It's hard to say what the expected compression rate should be
> since
> > >>>>> that's
> > >>>>>> heavily data-dependent. Sounds like Parquet isn't doing too bad,
> > >>>> though.
> > >>>>>>
> > >>>>>> For inspecting the files, check out parquet-tools [1]. That can
> dump
> > >>>>>> the metadata from a file all the way down to the page level. The
> > >> "meta"
> > >>>>> command
> > >>>>>> will print out each row group and column within those row groups,
> > >>>>>> which should give you the info you're looking for.
> > >>>>>>
> > >>>>>> rb
> > >>>>>>
> > >>>>>>
> > >>>>>> [1]:
> > >>>>>>
> > >>>>>
> > http://search.maven.org/#artifactdetails%7Corg.apache.parquet%7Cparque
> > >>>>> t-tools%7C1.8.1%7Cjar
> > >>>>>>
> > >>>>>> On Sun, Mar 6, 2016 at 7:37 AM, Kirill Safonov
> > >>>>>> <kirill.safonov@gmail.com
> > >>>>>>
> > >>>>>> wrote:
> > >>>>>>
> > >>>>>>> Hi guys,
> > >>>>>>>
> > >>>>>>> We’re evaluating Parquet as the high compression format for our
> > >>>>>>> logs. We took some ~850Gb of TSV data (some columns are JSON) and
> > >>>>>>> Parquet
> > >>>>>>> (CompressionCodec.GZIP) gave us 6.8x compression whereas plain
> GZip
> > >>>>> (with
> > >>>>>>> Deflater.BEST_COMPRESSION) gave 4.9x (~1.4 times less) on the
> same
> > >>>> data.
> > >>>>>>>
> > >>>>>>> So the questions are:
> > >>>>>>>
> > >>>>>>> 1) is this somewhat expected compression rate (compared to GZip)?
> > >>>>>>> 2) As we specially crafted Parquet schema with maps and lists for
> > >>>>> certain
> > >>>>>>> fields, is there any tool to show the sizes of individual Parquet
> > >>>>> columns
> > >>>>>>> so we can find the biggest ones?
> > >>>>>>>
> > >>>>>>> Thanks in advance,
> > >>>>>>> Kirill
> > >>>>>>
> > >>>>>>
> > >>>>>>
> > >>>>>>
> > >>>>>> --
> > >>>>>> Ryan Blue
> > >>>>>> Software Engineer
> > >>>>>> Netflix
> > >>>>>
> > >>>>>
> > >>>>
> > >>>>
> > >>>> --
> > >>>> Ryan Blue
> > >>>> Software Engineer
> > >>>> Netflix
> > >>>>
> > >>>>
> > >>>
> > >>>
> > >>> --
> > >>> Ryan Blue
> > >>> Software Engineer
> > >>> Netflix
> > >>
> > >>
> >
> >
>
>
> --
> Ryan Blue
> Software Engineer
> Netflix
>



-- 
 kir

Re: achieving better compression with Parquet

Posted by Ryan Blue <rb...@netflix.com.INVALID>.
Kirill,

Yes, sorting data by the columns you intend to filter by will definitely
help query performance because we keep min/max stats for each column chunk
and page that are used to eliminate row groups when you are passing filters
into Parquet.

rb

On Wed, Mar 16, 2016 at 1:07 AM, Kirill Safonov <ki...@gmail.com>
wrote:

> Antwins,
>
> Typical query for us is something like ‘Select events where [here come
> attributes constraints] and timestamp > 2016-03-16 and timestamp <
> 2016-03-17’, that’s why I’m asking if this query can benefit from timestamp
> ordering.
>
> > On 16 Mar 2016, at 03:03, Antwnis <an...@gmail.com> wrote:
> >
> > Kirill,
> >
> > I would think that if such a capability is introduced it should be
> > `optional` as depending on your query patterns it might make more sense
> to
> > sort on another column.
> >
> > On Tue, Mar 15, 2016 at 10:18 PM, Kirill Safonov <
> kirill.safonov@gmail.com>
> > wrote:
> >
> >> Thanks Ryan,
> >>
> >> One more question please: as we’re going to store timestamped events in
> >> Parquet, would it be beneficial to write the files chronologically
> sorted?
> >> Namely, will the query for the certain time range over the time-sorted
> >> Parquet file be optimised so that irrelevant portion of data is skipped
> and
> >> no "full scan" is done?
> >>
> >> Kirill
> >>
> >>> On 14 Mar 2016, at 22:00, Ryan Blue <rb...@netflix.com.INVALID> wrote:
> >>>
> >>> Adding int64-delta should be weeks. We should also open a bug report
> for
> >>> that line in Spark. It should not fail if an annotation is unsupported.
> >> It
> >>> should ignore it.
> >>>
> >>> On Mon, Mar 14, 2016 at 10:11 AM, Kirill Safonov <
> >> kirill.safonov@gmail.com>
> >>> wrote:
> >>>
> >>>> Thanks for reply Ryan,
> >>>>
> >>>>> For 2, PLAIN/gzip is the best option for timestamps right now. The
> >> format
> >>>>> 2.0 encodings include a delta-integer encoding that we expect to work
> >>>> really well for timestamps, but that hasn't been committed for int64
> >> yet.
> >>>>
> >>>> Is there any ETA on when it can appear? Just the order e.g. weeks or
> >>>> months?
> >>>>
> >>>>> Also, it should be safe to store timestamps as int64 using the
> >>>> TIMESTAMP_MILLIS annotation.
> >>>>
> >>>> Unfortunately this is not the case for us as the Parquet complains
> with
> >>>> "Parquet type not yet supported" [1].
> >>>>
> >>>> Thanks,
> >>>> Kirill
> >>>>
> >>>> [1]:
> >>>>
> >>>>
> >>
> https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/CatalystSchemaConverter.scala#L161
> >>>>
> >>>> -----Original Message-----
> >>>> From: Ryan Blue [mailto:rblue@netflix.com.INVALID]
> >>>> Sent: Monday, March 14, 2016 7:44 PM
> >>>> To: Parquet Dev
> >>>> Subject: Re: achieving better compression with Parquet
> >>>>
> >>>> Kirill,
> >>>>
> >>>> For 1, the reported size is just the data size. That doesn't include
> >> page
> >>>> headers, statistics, or dictionary pages. You can see the size of the
> >>>> dictionary pages in the dump output, which I would expect is where the
> >>>> majority of the difference is.
> >>>>
> >>>> For 2, PLAIN/gzip is the best option for timestamps right now. The
> >> format
> >>>> 2.0 encodings include a delta-integer encoding that we expect to work
> >>>> really well for timestamps, but that hasn't been committed for int64
> >> yet.
> >>>>
> >>>> Also, it should be safe to store timestamps as int64 using the
> >>>> TIMESTAMP_MILLIS annotation. That's just a way to keep track of what
> the
> >>>> values you write represent. When there isn't specific support for it,
> >> you
> >>>> should just get an int64. Using that annotation should give you the
> >> exact
> >>>> same behavior as not using it right now, but when you update to a
> >> version
> >>>> of Spark that supports it you should be able to get timestamps out of
> >> your
> >>>> existing data.
> >>>>
> >>>> rb
> >>>>
> >>>> On Mon, Mar 7, 2016 at 3:29 PM, Kirill Safonov <
> >> kirill.safonov@gmail.com>
> >>>> wrote:
> >>>>
> >>>>> Thanks for the hint Ryan!
> >>>>>
> >>>>> I applied the tool to the file and I’ve got some more questions if
> you
> >>>>> don’t mind :-)
> >>>>>
> >>>>> 1) We’re using 64Mb page (row group) size so I would expect the sum
> of
> >>>>> all the values in “compressed size” field (which is {x} in
> >>>>> SZ:{x}/{y}/{z}
> >>>>> notation) to be around 64 Mb, but it’s near 48 Mb. Is this expected?
> >>>>> 2) One of the largest field is Unix timestamp (we may have lots of
> >>>>> timestamps for a single data record) which is written as plain int64
> >>>>> (we refrained from using OriginalType.TIMESTAMP_MILLIS as it seems to
> >>>>> be not yet supported by Spark). The tool says that this column is
> >>>>> stored with “ENC:PLAIN” encoding (which I suppose is GZipped
> >>>>> afterwards). Is this the most compact way to store timestamps or e.g.
> >>>>> giving a "OriginalType.TIMESTAMP_MILLIS” or other hint will make an
> >>>> improvement?
> >>>>>
> >>>>> Thanks,
> >>>>> Kirill
> >>>>>
> >>>>>> On 07 Mar 2016, at 00:26, Ryan Blue <rb...@netflix.com.INVALID>
> >> wrote:
> >>>>>>
> >>>>>> Hi Kirill,
> >>>>>>
> >>>>>> It's hard to say what the expected compression rate should be since
> >>>>> that's
> >>>>>> heavily data-dependent. Sounds like Parquet isn't doing too bad,
> >>>> though.
> >>>>>>
> >>>>>> For inspecting the files, check out parquet-tools [1]. That can dump
> >>>>>> the metadata from a file all the way down to the page level. The
> >> "meta"
> >>>>> command
> >>>>>> will print out each row group and column within those row groups,
> >>>>>> which should give you the info you're looking for.
> >>>>>>
> >>>>>> rb
> >>>>>>
> >>>>>>
> >>>>>> [1]:
> >>>>>>
> >>>>>
> http://search.maven.org/#artifactdetails%7Corg.apache.parquet%7Cparque
> >>>>> t-tools%7C1.8.1%7Cjar
> >>>>>>
> >>>>>> On Sun, Mar 6, 2016 at 7:37 AM, Kirill Safonov
> >>>>>> <kirill.safonov@gmail.com
> >>>>>>
> >>>>>> wrote:
> >>>>>>
> >>>>>>> Hi guys,
> >>>>>>>
> >>>>>>> We’re evaluating Parquet as the high compression format for our
> >>>>>>> logs. We took some ~850Gb of TSV data (some columns are JSON) and
> >>>>>>> Parquet
> >>>>>>> (CompressionCodec.GZIP) gave us 6.8x compression whereas plain GZip
> >>>>> (with
> >>>>>>> Deflater.BEST_COMPRESSION) gave 4.9x (~1.4 times less) on the same
> >>>> data.
> >>>>>>>
> >>>>>>> So the questions are:
> >>>>>>>
> >>>>>>> 1) is this somewhat expected compression rate (compared to GZip)?
> >>>>>>> 2) As we specially crafted Parquet schema with maps and lists for
> >>>>> certain
> >>>>>>> fields, is there any tool to show the sizes of individual Parquet
> >>>>> columns
> >>>>>>> so we can find the biggest ones?
> >>>>>>>
> >>>>>>> Thanks in advance,
> >>>>>>> Kirill
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>> --
> >>>>>> Ryan Blue
> >>>>>> Software Engineer
> >>>>>> Netflix
> >>>>>
> >>>>>
> >>>>
> >>>>
> >>>> --
> >>>> Ryan Blue
> >>>> Software Engineer
> >>>> Netflix
> >>>>
> >>>>
> >>>
> >>>
> >>> --
> >>> Ryan Blue
> >>> Software Engineer
> >>> Netflix
> >>
> >>
>
>


-- 
Ryan Blue
Software Engineer
Netflix

Re: achieving better compression with Parquet

Posted by Kirill Safonov <ki...@gmail.com>.
Antwins,

Typical query for us is something like ‘Select events where [here come attributes constraints] and timestamp > 2016-03-16 and timestamp < 2016-03-17’, that’s why I’m asking if this query can benefit from timestamp ordering.

> On 16 Mar 2016, at 03:03, Antwnis <an...@gmail.com> wrote:
> 
> Kirill,
> 
> I would think that if such a capability is introduced it should be
> `optional` as depending on your query patterns it might make more sense to
> sort on another column.
> 
> On Tue, Mar 15, 2016 at 10:18 PM, Kirill Safonov <ki...@gmail.com>
> wrote:
> 
>> Thanks Ryan,
>> 
>> One more question please: as we’re going to store timestamped events in
>> Parquet, would it be beneficial to write the files chronologically sorted?
>> Namely, will the query for the certain time range over the time-sorted
>> Parquet file be optimised so that irrelevant portion of data is skipped and
>> no "full scan" is done?
>> 
>> Kirill
>> 
>>> On 14 Mar 2016, at 22:00, Ryan Blue <rb...@netflix.com.INVALID> wrote:
>>> 
>>> Adding int64-delta should be weeks. We should also open a bug report for
>>> that line in Spark. It should not fail if an annotation is unsupported.
>> It
>>> should ignore it.
>>> 
>>> On Mon, Mar 14, 2016 at 10:11 AM, Kirill Safonov <
>> kirill.safonov@gmail.com>
>>> wrote:
>>> 
>>>> Thanks for reply Ryan,
>>>> 
>>>>> For 2, PLAIN/gzip is the best option for timestamps right now. The
>> format
>>>>> 2.0 encodings include a delta-integer encoding that we expect to work
>>>> really well for timestamps, but that hasn't been committed for int64
>> yet.
>>>> 
>>>> Is there any ETA on when it can appear? Just the order e.g. weeks or
>>>> months?
>>>> 
>>>>> Also, it should be safe to store timestamps as int64 using the
>>>> TIMESTAMP_MILLIS annotation.
>>>> 
>>>> Unfortunately this is not the case for us as the Parquet complains with
>>>> "Parquet type not yet supported" [1].
>>>> 
>>>> Thanks,
>>>> Kirill
>>>> 
>>>> [1]:
>>>> 
>>>> 
>> https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/CatalystSchemaConverter.scala#L161
>>>> 
>>>> -----Original Message-----
>>>> From: Ryan Blue [mailto:rblue@netflix.com.INVALID]
>>>> Sent: Monday, March 14, 2016 7:44 PM
>>>> To: Parquet Dev
>>>> Subject: Re: achieving better compression with Parquet
>>>> 
>>>> Kirill,
>>>> 
>>>> For 1, the reported size is just the data size. That doesn't include
>> page
>>>> headers, statistics, or dictionary pages. You can see the size of the
>>>> dictionary pages in the dump output, which I would expect is where the
>>>> majority of the difference is.
>>>> 
>>>> For 2, PLAIN/gzip is the best option for timestamps right now. The
>> format
>>>> 2.0 encodings include a delta-integer encoding that we expect to work
>>>> really well for timestamps, but that hasn't been committed for int64
>> yet.
>>>> 
>>>> Also, it should be safe to store timestamps as int64 using the
>>>> TIMESTAMP_MILLIS annotation. That's just a way to keep track of what the
>>>> values you write represent. When there isn't specific support for it,
>> you
>>>> should just get an int64. Using that annotation should give you the
>> exact
>>>> same behavior as not using it right now, but when you update to a
>> version
>>>> of Spark that supports it you should be able to get timestamps out of
>> your
>>>> existing data.
>>>> 
>>>> rb
>>>> 
>>>> On Mon, Mar 7, 2016 at 3:29 PM, Kirill Safonov <
>> kirill.safonov@gmail.com>
>>>> wrote:
>>>> 
>>>>> Thanks for the hint Ryan!
>>>>> 
>>>>> I applied the tool to the file and I’ve got some more questions if you
>>>>> don’t mind :-)
>>>>> 
>>>>> 1) We’re using 64Mb page (row group) size so I would expect the sum of
>>>>> all the values in “compressed size” field (which is {x} in
>>>>> SZ:{x}/{y}/{z}
>>>>> notation) to be around 64 Mb, but it’s near 48 Mb. Is this expected?
>>>>> 2) One of the largest field is Unix timestamp (we may have lots of
>>>>> timestamps for a single data record) which is written as plain int64
>>>>> (we refrained from using OriginalType.TIMESTAMP_MILLIS as it seems to
>>>>> be not yet supported by Spark). The tool says that this column is
>>>>> stored with “ENC:PLAIN” encoding (which I suppose is GZipped
>>>>> afterwards). Is this the most compact way to store timestamps or e.g.
>>>>> giving a "OriginalType.TIMESTAMP_MILLIS” or other hint will make an
>>>> improvement?
>>>>> 
>>>>> Thanks,
>>>>> Kirill
>>>>> 
>>>>>> On 07 Mar 2016, at 00:26, Ryan Blue <rb...@netflix.com.INVALID>
>> wrote:
>>>>>> 
>>>>>> Hi Kirill,
>>>>>> 
>>>>>> It's hard to say what the expected compression rate should be since
>>>>> that's
>>>>>> heavily data-dependent. Sounds like Parquet isn't doing too bad,
>>>> though.
>>>>>> 
>>>>>> For inspecting the files, check out parquet-tools [1]. That can dump
>>>>>> the metadata from a file all the way down to the page level. The
>> "meta"
>>>>> command
>>>>>> will print out each row group and column within those row groups,
>>>>>> which should give you the info you're looking for.
>>>>>> 
>>>>>> rb
>>>>>> 
>>>>>> 
>>>>>> [1]:
>>>>>> 
>>>>> http://search.maven.org/#artifactdetails%7Corg.apache.parquet%7Cparque
>>>>> t-tools%7C1.8.1%7Cjar
>>>>>> 
>>>>>> On Sun, Mar 6, 2016 at 7:37 AM, Kirill Safonov
>>>>>> <kirill.safonov@gmail.com
>>>>>> 
>>>>>> wrote:
>>>>>> 
>>>>>>> Hi guys,
>>>>>>> 
>>>>>>> We’re evaluating Parquet as the high compression format for our
>>>>>>> logs. We took some ~850Gb of TSV data (some columns are JSON) and
>>>>>>> Parquet
>>>>>>> (CompressionCodec.GZIP) gave us 6.8x compression whereas plain GZip
>>>>> (with
>>>>>>> Deflater.BEST_COMPRESSION) gave 4.9x (~1.4 times less) on the same
>>>> data.
>>>>>>> 
>>>>>>> So the questions are:
>>>>>>> 
>>>>>>> 1) is this somewhat expected compression rate (compared to GZip)?
>>>>>>> 2) As we specially crafted Parquet schema with maps and lists for
>>>>> certain
>>>>>>> fields, is there any tool to show the sizes of individual Parquet
>>>>> columns
>>>>>>> so we can find the biggest ones?
>>>>>>> 
>>>>>>> Thanks in advance,
>>>>>>> Kirill
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> --
>>>>>> Ryan Blue
>>>>>> Software Engineer
>>>>>> Netflix
>>>>> 
>>>>> 
>>>> 
>>>> 
>>>> --
>>>> Ryan Blue
>>>> Software Engineer
>>>> Netflix
>>>> 
>>>> 
>>> 
>>> 
>>> --
>>> Ryan Blue
>>> Software Engineer
>>> Netflix
>> 
>> 


Re: achieving better compression with Parquet

Posted by Antwnis <an...@gmail.com>.
Kirill,

I would think that if such a capability is introduced it should be
`optional` as depending on your query patterns it might make more sense to
sort on another column.

On Tue, Mar 15, 2016 at 10:18 PM, Kirill Safonov <ki...@gmail.com>
wrote:

> Thanks Ryan,
>
> One more question please: as we’re going to store timestamped events in
> Parquet, would it be beneficial to write the files chronologically sorted?
> Namely, will the query for the certain time range over the time-sorted
> Parquet file be optimised so that irrelevant portion of data is skipped and
> no "full scan" is done?
>
> Kirill
>
> > On 14 Mar 2016, at 22:00, Ryan Blue <rb...@netflix.com.INVALID> wrote:
> >
> > Adding int64-delta should be weeks. We should also open a bug report for
> > that line in Spark. It should not fail if an annotation is unsupported.
> It
> > should ignore it.
> >
> > On Mon, Mar 14, 2016 at 10:11 AM, Kirill Safonov <
> kirill.safonov@gmail.com>
> > wrote:
> >
> >> Thanks for reply Ryan,
> >>
> >>> For 2, PLAIN/gzip is the best option for timestamps right now. The
> format
> >>> 2.0 encodings include a delta-integer encoding that we expect to work
> >> really well for timestamps, but that hasn't been committed for int64
> yet.
> >>
> >> Is there any ETA on when it can appear? Just the order e.g. weeks or
> >> months?
> >>
> >>> Also, it should be safe to store timestamps as int64 using the
> >> TIMESTAMP_MILLIS annotation.
> >>
> >> Unfortunately this is not the case for us as the Parquet complains with
> >> "Parquet type not yet supported" [1].
> >>
> >> Thanks,
> >> Kirill
> >>
> >> [1]:
> >>
> >>
> https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/CatalystSchemaConverter.scala#L161
> >>
> >> -----Original Message-----
> >> From: Ryan Blue [mailto:rblue@netflix.com.INVALID]
> >> Sent: Monday, March 14, 2016 7:44 PM
> >> To: Parquet Dev
> >> Subject: Re: achieving better compression with Parquet
> >>
> >> Kirill,
> >>
> >> For 1, the reported size is just the data size. That doesn't include
> page
> >> headers, statistics, or dictionary pages. You can see the size of the
> >> dictionary pages in the dump output, which I would expect is where the
> >> majority of the difference is.
> >>
> >> For 2, PLAIN/gzip is the best option for timestamps right now. The
> format
> >> 2.0 encodings include a delta-integer encoding that we expect to work
> >> really well for timestamps, but that hasn't been committed for int64
> yet.
> >>
> >> Also, it should be safe to store timestamps as int64 using the
> >> TIMESTAMP_MILLIS annotation. That's just a way to keep track of what the
> >> values you write represent. When there isn't specific support for it,
> you
> >> should just get an int64. Using that annotation should give you the
> exact
> >> same behavior as not using it right now, but when you update to a
> version
> >> of Spark that supports it you should be able to get timestamps out of
> your
> >> existing data.
> >>
> >> rb
> >>
> >> On Mon, Mar 7, 2016 at 3:29 PM, Kirill Safonov <
> kirill.safonov@gmail.com>
> >> wrote:
> >>
> >>> Thanks for the hint Ryan!
> >>>
> >>> I applied the tool to the file and I’ve got some more questions if you
> >>> don’t mind :-)
> >>>
> >>> 1) We’re using 64Mb page (row group) size so I would expect the sum of
> >>> all the values in “compressed size” field (which is {x} in
> >>> SZ:{x}/{y}/{z}
> >>> notation) to be around 64 Mb, but it’s near 48 Mb. Is this expected?
> >>> 2) One of the largest field is Unix timestamp (we may have lots of
> >>> timestamps for a single data record) which is written as plain int64
> >>> (we refrained from using OriginalType.TIMESTAMP_MILLIS as it seems to
> >>> be not yet supported by Spark). The tool says that this column is
> >>> stored with “ENC:PLAIN” encoding (which I suppose is GZipped
> >>> afterwards). Is this the most compact way to store timestamps or e.g.
> >>> giving a "OriginalType.TIMESTAMP_MILLIS” or other hint will make an
> >> improvement?
> >>>
> >>> Thanks,
> >>> Kirill
> >>>
> >>>> On 07 Mar 2016, at 00:26, Ryan Blue <rb...@netflix.com.INVALID>
> wrote:
> >>>>
> >>>> Hi Kirill,
> >>>>
> >>>> It's hard to say what the expected compression rate should be since
> >>> that's
> >>>> heavily data-dependent. Sounds like Parquet isn't doing too bad,
> >> though.
> >>>>
> >>>> For inspecting the files, check out parquet-tools [1]. That can dump
> >>>> the metadata from a file all the way down to the page level. The
> "meta"
> >>> command
> >>>> will print out each row group and column within those row groups,
> >>>> which should give you the info you're looking for.
> >>>>
> >>>> rb
> >>>>
> >>>>
> >>>> [1]:
> >>>>
> >>> http://search.maven.org/#artifactdetails%7Corg.apache.parquet%7Cparque
> >>> t-tools%7C1.8.1%7Cjar
> >>>>
> >>>> On Sun, Mar 6, 2016 at 7:37 AM, Kirill Safonov
> >>>> <kirill.safonov@gmail.com
> >>>>
> >>>> wrote:
> >>>>
> >>>>> Hi guys,
> >>>>>
> >>>>> We’re evaluating Parquet as the high compression format for our
> >>>>> logs. We took some ~850Gb of TSV data (some columns are JSON) and
> >>>>> Parquet
> >>>>> (CompressionCodec.GZIP) gave us 6.8x compression whereas plain GZip
> >>> (with
> >>>>> Deflater.BEST_COMPRESSION) gave 4.9x (~1.4 times less) on the same
> >> data.
> >>>>>
> >>>>> So the questions are:
> >>>>>
> >>>>> 1) is this somewhat expected compression rate (compared to GZip)?
> >>>>> 2) As we specially crafted Parquet schema with maps and lists for
> >>> certain
> >>>>> fields, is there any tool to show the sizes of individual Parquet
> >>> columns
> >>>>> so we can find the biggest ones?
> >>>>>
> >>>>> Thanks in advance,
> >>>>> Kirill
> >>>>
> >>>>
> >>>>
> >>>>
> >>>> --
> >>>> Ryan Blue
> >>>> Software Engineer
> >>>> Netflix
> >>>
> >>>
> >>
> >>
> >> --
> >> Ryan Blue
> >> Software Engineer
> >> Netflix
> >>
> >>
> >
> >
> > --
> > Ryan Blue
> > Software Engineer
> > Netflix
>
>

Re: achieving better compression with Parquet

Posted by Kirill Safonov <ki...@gmail.com>.
Thanks Ryan,

One more question please: as we’re going to store timestamped events in Parquet, would it be beneficial to write the files chronologically sorted? Namely, will the query for the certain time range over the time-sorted Parquet file be optimised so that irrelevant portion of data is skipped and no "full scan" is done?

Kirill

> On 14 Mar 2016, at 22:00, Ryan Blue <rb...@netflix.com.INVALID> wrote:
> 
> Adding int64-delta should be weeks. We should also open a bug report for
> that line in Spark. It should not fail if an annotation is unsupported. It
> should ignore it.
> 
> On Mon, Mar 14, 2016 at 10:11 AM, Kirill Safonov <ki...@gmail.com>
> wrote:
> 
>> Thanks for reply Ryan,
>> 
>>> For 2, PLAIN/gzip is the best option for timestamps right now. The format
>>> 2.0 encodings include a delta-integer encoding that we expect to work
>> really well for timestamps, but that hasn't been committed for int64 yet.
>> 
>> Is there any ETA on when it can appear? Just the order e.g. weeks or
>> months?
>> 
>>> Also, it should be safe to store timestamps as int64 using the
>> TIMESTAMP_MILLIS annotation.
>> 
>> Unfortunately this is not the case for us as the Parquet complains with
>> "Parquet type not yet supported" [1].
>> 
>> Thanks,
>> Kirill
>> 
>> [1]:
>> 
>> https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/CatalystSchemaConverter.scala#L161
>> 
>> -----Original Message-----
>> From: Ryan Blue [mailto:rblue@netflix.com.INVALID]
>> Sent: Monday, March 14, 2016 7:44 PM
>> To: Parquet Dev
>> Subject: Re: achieving better compression with Parquet
>> 
>> Kirill,
>> 
>> For 1, the reported size is just the data size. That doesn't include page
>> headers, statistics, or dictionary pages. You can see the size of the
>> dictionary pages in the dump output, which I would expect is where the
>> majority of the difference is.
>> 
>> For 2, PLAIN/gzip is the best option for timestamps right now. The format
>> 2.0 encodings include a delta-integer encoding that we expect to work
>> really well for timestamps, but that hasn't been committed for int64 yet.
>> 
>> Also, it should be safe to store timestamps as int64 using the
>> TIMESTAMP_MILLIS annotation. That's just a way to keep track of what the
>> values you write represent. When there isn't specific support for it, you
>> should just get an int64. Using that annotation should give you the exact
>> same behavior as not using it right now, but when you update to a version
>> of Spark that supports it you should be able to get timestamps out of your
>> existing data.
>> 
>> rb
>> 
>> On Mon, Mar 7, 2016 at 3:29 PM, Kirill Safonov <ki...@gmail.com>
>> wrote:
>> 
>>> Thanks for the hint Ryan!
>>> 
>>> I applied the tool to the file and I’ve got some more questions if you
>>> don’t mind :-)
>>> 
>>> 1) We’re using 64Mb page (row group) size so I would expect the sum of
>>> all the values in “compressed size” field (which is {x} in
>>> SZ:{x}/{y}/{z}
>>> notation) to be around 64 Mb, but it’s near 48 Mb. Is this expected?
>>> 2) One of the largest field is Unix timestamp (we may have lots of
>>> timestamps for a single data record) which is written as plain int64
>>> (we refrained from using OriginalType.TIMESTAMP_MILLIS as it seems to
>>> be not yet supported by Spark). The tool says that this column is
>>> stored with “ENC:PLAIN” encoding (which I suppose is GZipped
>>> afterwards). Is this the most compact way to store timestamps or e.g.
>>> giving a "OriginalType.TIMESTAMP_MILLIS” or other hint will make an
>> improvement?
>>> 
>>> Thanks,
>>> Kirill
>>> 
>>>> On 07 Mar 2016, at 00:26, Ryan Blue <rb...@netflix.com.INVALID> wrote:
>>>> 
>>>> Hi Kirill,
>>>> 
>>>> It's hard to say what the expected compression rate should be since
>>> that's
>>>> heavily data-dependent. Sounds like Parquet isn't doing too bad,
>> though.
>>>> 
>>>> For inspecting the files, check out parquet-tools [1]. That can dump
>>>> the metadata from a file all the way down to the page level. The "meta"
>>> command
>>>> will print out each row group and column within those row groups,
>>>> which should give you the info you're looking for.
>>>> 
>>>> rb
>>>> 
>>>> 
>>>> [1]:
>>>> 
>>> http://search.maven.org/#artifactdetails%7Corg.apache.parquet%7Cparque
>>> t-tools%7C1.8.1%7Cjar
>>>> 
>>>> On Sun, Mar 6, 2016 at 7:37 AM, Kirill Safonov
>>>> <kirill.safonov@gmail.com
>>>> 
>>>> wrote:
>>>> 
>>>>> Hi guys,
>>>>> 
>>>>> We’re evaluating Parquet as the high compression format for our
>>>>> logs. We took some ~850Gb of TSV data (some columns are JSON) and
>>>>> Parquet
>>>>> (CompressionCodec.GZIP) gave us 6.8x compression whereas plain GZip
>>> (with
>>>>> Deflater.BEST_COMPRESSION) gave 4.9x (~1.4 times less) on the same
>> data.
>>>>> 
>>>>> So the questions are:
>>>>> 
>>>>> 1) is this somewhat expected compression rate (compared to GZip)?
>>>>> 2) As we specially crafted Parquet schema with maps and lists for
>>> certain
>>>>> fields, is there any tool to show the sizes of individual Parquet
>>> columns
>>>>> so we can find the biggest ones?
>>>>> 
>>>>> Thanks in advance,
>>>>> Kirill
>>>> 
>>>> 
>>>> 
>>>> 
>>>> --
>>>> Ryan Blue
>>>> Software Engineer
>>>> Netflix
>>> 
>>> 
>> 
>> 
>> --
>> Ryan Blue
>> Software Engineer
>> Netflix
>> 
>> 
> 
> 
> -- 
> Ryan Blue
> Software Engineer
> Netflix


Re: achieving better compression with Parquet

Posted by Ryan Blue <rb...@netflix.com.INVALID>.
Adding int64-delta should be weeks. We should also open a bug report for
that line in Spark. It should not fail if an annotation is unsupported. It
should ignore it.

On Mon, Mar 14, 2016 at 10:11 AM, Kirill Safonov <ki...@gmail.com>
wrote:

> Thanks for reply Ryan,
>
> > For 2, PLAIN/gzip is the best option for timestamps right now. The format
> > 2.0 encodings include a delta-integer encoding that we expect to work
> really well for timestamps, but that hasn't been committed for int64 yet.
>
> Is there any ETA on when it can appear? Just the order e.g. weeks or
> months?
>
> > Also, it should be safe to store timestamps as int64 using the
> TIMESTAMP_MILLIS annotation.
>
> Unfortunately this is not the case for us as the Parquet complains with
> "Parquet type not yet supported" [1].
>
> Thanks,
>  Kirill
>
> [1]:
>
> https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/CatalystSchemaConverter.scala#L161
>
> -----Original Message-----
> From: Ryan Blue [mailto:rblue@netflix.com.INVALID]
> Sent: Monday, March 14, 2016 7:44 PM
> To: Parquet Dev
> Subject: Re: achieving better compression with Parquet
>
> Kirill,
>
> For 1, the reported size is just the data size. That doesn't include page
> headers, statistics, or dictionary pages. You can see the size of the
> dictionary pages in the dump output, which I would expect is where the
> majority of the difference is.
>
> For 2, PLAIN/gzip is the best option for timestamps right now. The format
> 2.0 encodings include a delta-integer encoding that we expect to work
> really well for timestamps, but that hasn't been committed for int64 yet.
>
> Also, it should be safe to store timestamps as int64 using the
> TIMESTAMP_MILLIS annotation. That's just a way to keep track of what the
> values you write represent. When there isn't specific support for it, you
> should just get an int64. Using that annotation should give you the exact
> same behavior as not using it right now, but when you update to a version
> of Spark that supports it you should be able to get timestamps out of your
> existing data.
>
> rb
>
> On Mon, Mar 7, 2016 at 3:29 PM, Kirill Safonov <ki...@gmail.com>
> wrote:
>
> > Thanks for the hint Ryan!
> >
> > I applied the tool to the file and I’ve got some more questions if you
> > don’t mind :-)
> >
> > 1) We’re using 64Mb page (row group) size so I would expect the sum of
> > all the values in “compressed size” field (which is {x} in
> > SZ:{x}/{y}/{z}
> > notation) to be around 64 Mb, but it’s near 48 Mb. Is this expected?
> > 2) One of the largest field is Unix timestamp (we may have lots of
> > timestamps for a single data record) which is written as plain int64
> > (we refrained from using OriginalType.TIMESTAMP_MILLIS as it seems to
> > be not yet supported by Spark). The tool says that this column is
> > stored with “ENC:PLAIN” encoding (which I suppose is GZipped
> > afterwards). Is this the most compact way to store timestamps or e.g.
> > giving a "OriginalType.TIMESTAMP_MILLIS” or other hint will make an
> improvement?
> >
> > Thanks,
> >  Kirill
> >
> > > On 07 Mar 2016, at 00:26, Ryan Blue <rb...@netflix.com.INVALID> wrote:
> > >
> > > Hi Kirill,
> > >
> > > It's hard to say what the expected compression rate should be since
> > that's
> > > heavily data-dependent. Sounds like Parquet isn't doing too bad,
> though.
> > >
> > > For inspecting the files, check out parquet-tools [1]. That can dump
> > > the metadata from a file all the way down to the page level. The "meta"
> > command
> > > will print out each row group and column within those row groups,
> > > which should give you the info you're looking for.
> > >
> > > rb
> > >
> > >
> > > [1]:
> > >
> > http://search.maven.org/#artifactdetails%7Corg.apache.parquet%7Cparque
> > t-tools%7C1.8.1%7Cjar
> > >
> > > On Sun, Mar 6, 2016 at 7:37 AM, Kirill Safonov
> > > <kirill.safonov@gmail.com
> > >
> > > wrote:
> > >
> > >> Hi guys,
> > >>
> > >> We’re evaluating Parquet as the high compression format for our
> > >> logs. We took some ~850Gb of TSV data (some columns are JSON) and
> > >> Parquet
> > >> (CompressionCodec.GZIP) gave us 6.8x compression whereas plain GZip
> > (with
> > >> Deflater.BEST_COMPRESSION) gave 4.9x (~1.4 times less) on the same
> data.
> > >>
> > >> So the questions are:
> > >>
> > >> 1) is this somewhat expected compression rate (compared to GZip)?
> > >> 2) As we specially crafted Parquet schema with maps and lists for
> > certain
> > >> fields, is there any tool to show the sizes of individual Parquet
> > columns
> > >> so we can find the biggest ones?
> > >>
> > >> Thanks in advance,
> > >> Kirill
> > >
> > >
> > >
> > >
> > > --
> > > Ryan Blue
> > > Software Engineer
> > > Netflix
> >
> >
>
>
> --
> Ryan Blue
> Software Engineer
> Netflix
>
>


-- 
Ryan Blue
Software Engineer
Netflix

RE: achieving better compression with Parquet

Posted by Kirill Safonov <ki...@gmail.com>.
Thanks for reply Ryan,

> For 2, PLAIN/gzip is the best option for timestamps right now. The format
> 2.0 encodings include a delta-integer encoding that we expect to work really well for timestamps, but that hasn't been committed for int64 yet.

Is there any ETA on when it can appear? Just the order e.g. weeks or months?

> Also, it should be safe to store timestamps as int64 using the TIMESTAMP_MILLIS annotation.

Unfortunately this is not the case for us as the Parquet complains with "Parquet type not yet supported" [1].

Thanks,
 Kirill

[1]:
https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/CatalystSchemaConverter.scala#L161

-----Original Message-----
From: Ryan Blue [mailto:rblue@netflix.com.INVALID] 
Sent: Monday, March 14, 2016 7:44 PM
To: Parquet Dev
Subject: Re: achieving better compression with Parquet

Kirill,

For 1, the reported size is just the data size. That doesn't include page headers, statistics, or dictionary pages. You can see the size of the dictionary pages in the dump output, which I would expect is where the majority of the difference is.

For 2, PLAIN/gzip is the best option for timestamps right now. The format
2.0 encodings include a delta-integer encoding that we expect to work really well for timestamps, but that hasn't been committed for int64 yet.

Also, it should be safe to store timestamps as int64 using the TIMESTAMP_MILLIS annotation. That's just a way to keep track of what the values you write represent. When there isn't specific support for it, you should just get an int64. Using that annotation should give you the exact same behavior as not using it right now, but when you update to a version of Spark that supports it you should be able to get timestamps out of your existing data.

rb

On Mon, Mar 7, 2016 at 3:29 PM, Kirill Safonov <ki...@gmail.com>
wrote:

> Thanks for the hint Ryan!
>
> I applied the tool to the file and I’ve got some more questions if you 
> don’t mind :-)
>
> 1) We’re using 64Mb page (row group) size so I would expect the sum of 
> all the values in “compressed size” field (which is {x} in 
> SZ:{x}/{y}/{z}
> notation) to be around 64 Mb, but it’s near 48 Mb. Is this expected?
> 2) One of the largest field is Unix timestamp (we may have lots of 
> timestamps for a single data record) which is written as plain int64 
> (we refrained from using OriginalType.TIMESTAMP_MILLIS as it seems to 
> be not yet supported by Spark). The tool says that this column is 
> stored with “ENC:PLAIN” encoding (which I suppose is GZipped 
> afterwards). Is this the most compact way to store timestamps or e.g. 
> giving a "OriginalType.TIMESTAMP_MILLIS” or other hint will make an improvement?
>
> Thanks,
>  Kirill
>
> > On 07 Mar 2016, at 00:26, Ryan Blue <rb...@netflix.com.INVALID> wrote:
> >
> > Hi Kirill,
> >
> > It's hard to say what the expected compression rate should be since
> that's
> > heavily data-dependent. Sounds like Parquet isn't doing too bad, though.
> >
> > For inspecting the files, check out parquet-tools [1]. That can dump 
> > the metadata from a file all the way down to the page level. The "meta"
> command
> > will print out each row group and column within those row groups, 
> > which should give you the info you're looking for.
> >
> > rb
> >
> >
> > [1]:
> >
> http://search.maven.org/#artifactdetails%7Corg.apache.parquet%7Cparque
> t-tools%7C1.8.1%7Cjar
> >
> > On Sun, Mar 6, 2016 at 7:37 AM, Kirill Safonov 
> > <kirill.safonov@gmail.com
> >
> > wrote:
> >
> >> Hi guys,
> >>
> >> We’re evaluating Parquet as the high compression format for our 
> >> logs. We took some ~850Gb of TSV data (some columns are JSON) and 
> >> Parquet
> >> (CompressionCodec.GZIP) gave us 6.8x compression whereas plain GZip
> (with
> >> Deflater.BEST_COMPRESSION) gave 4.9x (~1.4 times less) on the same data.
> >>
> >> So the questions are:
> >>
> >> 1) is this somewhat expected compression rate (compared to GZip)?
> >> 2) As we specially crafted Parquet schema with maps and lists for
> certain
> >> fields, is there any tool to show the sizes of individual Parquet
> columns
> >> so we can find the biggest ones?
> >>
> >> Thanks in advance,
> >> Kirill
> >
> >
> >
> >
> > --
> > Ryan Blue
> > Software Engineer
> > Netflix
>
>


--
Ryan Blue
Software Engineer
Netflix


Re: achieving better compression with Parquet

Posted by Ryan Blue <rb...@netflix.com.INVALID>.
Kirill,

For 1, the reported size is just the data size. That doesn't include page
headers, statistics, or dictionary pages. You can see the size of the
dictionary pages in the dump output, which I would expect is where the
majority of the difference is.

For 2, PLAIN/gzip is the best option for timestamps right now. The format
2.0 encodings include a delta-integer encoding that we expect to work
really well for timestamps, but that hasn't been committed for int64 yet.

Also, it should be safe to store timestamps as int64 using the
TIMESTAMP_MILLIS annotation. That's just a way to keep track of what the
values you write represent. When there isn't specific support for it, you
should just get an int64. Using that annotation should give you the exact
same behavior as not using it right now, but when you update to a version
of Spark that supports it you should be able to get timestamps out of your
existing data.

rb

On Mon, Mar 7, 2016 at 3:29 PM, Kirill Safonov <ki...@gmail.com>
wrote:

> Thanks for the hint Ryan!
>
> I applied the tool to the file and I’ve got some more questions if you
> don’t mind :-)
>
> 1) We’re using 64Mb page (row group) size so I would expect the sum of all
> the values in “compressed size” field (which is {x} in SZ:{x}/{y}/{z}
> notation) to be around 64 Mb, but it’s near 48 Mb. Is this expected?
> 2) One of the largest field is Unix timestamp (we may have lots of
> timestamps for a single data record) which is written as plain int64 (we
> refrained from using OriginalType.TIMESTAMP_MILLIS as it seems to be not
> yet supported by Spark). The tool says that this column is stored with
> “ENC:PLAIN” encoding (which I suppose is GZipped afterwards). Is this the
> most compact way to store timestamps or e.g. giving a
> "OriginalType.TIMESTAMP_MILLIS” or other hint will make an improvement?
>
> Thanks,
>  Kirill
>
> > On 07 Mar 2016, at 00:26, Ryan Blue <rb...@netflix.com.INVALID> wrote:
> >
> > Hi Kirill,
> >
> > It's hard to say what the expected compression rate should be since
> that's
> > heavily data-dependent. Sounds like Parquet isn't doing too bad, though.
> >
> > For inspecting the files, check out parquet-tools [1]. That can dump the
> > metadata from a file all the way down to the page level. The "meta"
> command
> > will print out each row group and column within those row groups, which
> > should give you the info you're looking for.
> >
> > rb
> >
> >
> > [1]:
> >
> http://search.maven.org/#artifactdetails%7Corg.apache.parquet%7Cparquet-tools%7C1.8.1%7Cjar
> >
> > On Sun, Mar 6, 2016 at 7:37 AM, Kirill Safonov <kirill.safonov@gmail.com
> >
> > wrote:
> >
> >> Hi guys,
> >>
> >> We’re evaluating Parquet as the high compression format for our logs. We
> >> took some ~850Gb of TSV data (some columns are JSON) and Parquet
> >> (CompressionCodec.GZIP) gave us 6.8x compression whereas plain GZip
> (with
> >> Deflater.BEST_COMPRESSION) gave 4.9x (~1.4 times less) on the same data.
> >>
> >> So the questions are:
> >>
> >> 1) is this somewhat expected compression rate (compared to GZip)?
> >> 2) As we specially crafted Parquet schema with maps and lists for
> certain
> >> fields, is there any tool to show the sizes of individual Parquet
> columns
> >> so we can find the biggest ones?
> >>
> >> Thanks in advance,
> >> Kirill
> >
> >
> >
> >
> > --
> > Ryan Blue
> > Software Engineer
> > Netflix
>
>


-- 
Ryan Blue
Software Engineer
Netflix

Re: achieving better compression with Parquet

Posted by Kirill Safonov <ki...@gmail.com>.
Thanks for the hint Ryan!

I applied the tool to the file and I’ve got some more questions if you don’t mind :-)

1) We’re using 64Mb page (row group) size so I would expect the sum of all the values in “compressed size” field (which is {x} in SZ:{x}/{y}/{z} notation) to be around 64 Mb, but it’s near 48 Mb. Is this expected?
2) One of the largest field is Unix timestamp (we may have lots of timestamps for a single data record) which is written as plain int64 (we refrained from using OriginalType.TIMESTAMP_MILLIS as it seems to be not yet supported by Spark). The tool says that this column is stored with “ENC:PLAIN” encoding (which I suppose is GZipped afterwards). Is this the most compact way to store timestamps or e.g. giving a "OriginalType.TIMESTAMP_MILLIS” or other hint will make an improvement?

Thanks,
 Kirill

> On 07 Mar 2016, at 00:26, Ryan Blue <rb...@netflix.com.INVALID> wrote:
> 
> Hi Kirill,
> 
> It's hard to say what the expected compression rate should be since that's
> heavily data-dependent. Sounds like Parquet isn't doing too bad, though.
> 
> For inspecting the files, check out parquet-tools [1]. That can dump the
> metadata from a file all the way down to the page level. The "meta" command
> will print out each row group and column within those row groups, which
> should give you the info you're looking for.
> 
> rb
> 
> 
> [1]:
> http://search.maven.org/#artifactdetails%7Corg.apache.parquet%7Cparquet-tools%7C1.8.1%7Cjar
> 
> On Sun, Mar 6, 2016 at 7:37 AM, Kirill Safonov <ki...@gmail.com>
> wrote:
> 
>> Hi guys,
>> 
>> We’re evaluating Parquet as the high compression format for our logs. We
>> took some ~850Gb of TSV data (some columns are JSON) and Parquet
>> (CompressionCodec.GZIP) gave us 6.8x compression whereas plain GZip (with
>> Deflater.BEST_COMPRESSION) gave 4.9x (~1.4 times less) on the same data.
>> 
>> So the questions are:
>> 
>> 1) is this somewhat expected compression rate (compared to GZip)?
>> 2) As we specially crafted Parquet schema with maps and lists for certain
>> fields, is there any tool to show the sizes of individual Parquet columns
>> so we can find the biggest ones?
>> 
>> Thanks in advance,
>> Kirill
> 
> 
> 
> 
> -- 
> Ryan Blue
> Software Engineer
> Netflix


Re: achieving better compression with Parquet

Posted by Ryan Blue <rb...@netflix.com.INVALID>.
Hi Kirill,

It's hard to say what the expected compression rate should be since that's
heavily data-dependent. Sounds like Parquet isn't doing too bad, though.

For inspecting the files, check out parquet-tools [1]. That can dump the
metadata from a file all the way down to the page level. The "meta" command
will print out each row group and column within those row groups, which
should give you the info you're looking for.

rb


[1]:
http://search.maven.org/#artifactdetails%7Corg.apache.parquet%7Cparquet-tools%7C1.8.1%7Cjar

On Sun, Mar 6, 2016 at 7:37 AM, Kirill Safonov <ki...@gmail.com>
wrote:

> Hi guys,
>
> We’re evaluating Parquet as the high compression format for our logs. We
> took some ~850Gb of TSV data (some columns are JSON) and Parquet
> (CompressionCodec.GZIP) gave us 6.8x compression whereas plain GZip (with
> Deflater.BEST_COMPRESSION) gave 4.9x (~1.4 times less) on the same data.
>
> So the questions are:
>
> 1) is this somewhat expected compression rate (compared to GZip)?
> 2) As we specially crafted Parquet schema with maps and lists for certain
> fields, is there any tool to show the sizes of individual Parquet columns
> so we can find the biggest ones?
>
> Thanks in advance,
>  Kirill




-- 
Ryan Blue
Software Engineer
Netflix