You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@parquet.apache.org by Arun Manivannan <ar...@arunma.com> on 2019/02/26 14:14:13 UTC

Parquet binary file size

Hi,

Apologies in advance if this a silly question.

I was trying to compare the various data formats as an exercise and I
noticed that the size of a parquet output file is huge (2 KB) compared to
thrift (79 bytes), PB (45 bytes) or Avro (333 bytes). I thought the issue
was with Spark and then I used a ThriftParquetWriter to write a single
record (AvroParquetWriter yielded the same result too, the tiny difference
in bytes is because of the schema part). The trouble is that I see 5
instances of the data in the binary (hence the 2KB) size.

Could someone explain or point me to a link that explains why this is the
case?

This is parquet 1.8.2 and I haven't tried any lower version.

*Binary : *
[image: image.png]

*Code (Thrift): *

def serialize(t: TweetThrift, file: File) = {
  val dataFileWriter = new ThriftParquetWriter[TweetThrift](new
Path(file.getAbsolutePath),
    classOf[TweetThrift],
    CompressionCodecName.UNCOMPRESSED)
  dataFileWriter.write(t)
  dataFileWriter.close()
}

val tweet = new TweetThrift(1, 123, "Saturday 8th, June", "arunma")
    tweet.setText("First tweet")
    ParquetSerDe.serialize(tweet, file)


*Code (Avro): *

 val file = new File ("serialized_parquet_file.parquet")
    val tweet = TweetAvro
      .newBuilder
      .setTarget(1)
      .setId(123)
      .setDate("Saturday 8th, June")
      .setUser("arunma")
      .setText("Parquet tweet")
      .build()

    ParquetSerDe.serialize(tweet, TweetAvro.SCHEMA$, file)



Cheers,

Arun

Re: Parquet binary file size

Posted by Wes McKinney <we...@gmail.com>.
hi Arun,

Parquet isn't designed for efficient transport of small bits of data, e.g.
1 record at a time. It's designed to compactly store large analytics
datasets where dictionary encoding, run-length encoding, and compression is
effective at reducing space.

In your example, there are several additional pieces of data stored:

* File footer metadata
* Row group and column chunk metadata, including "column" statistics (min
and max values)
* Data page headers, one for each "column"

If you compared Avro or Protobuf-based storage of e.g. 1 million record
dataset as a single file I would bet that Parquet would be smaller
(possible significantly smaller, 5-10x or more) in typical use cases.

- Wes

On Tue, Feb 26, 2019 at 8:14 AM Arun Manivannan <ar...@arunma.com> wrote:

> Hi,
>
> Apologies in advance if this a silly question.
>
> I was trying to compare the various data formats as an exercise and I
> noticed that the size of a parquet output file is huge (2 KB) compared to
> thrift (79 bytes), PB (45 bytes) or Avro (333 bytes). I thought the issue
> was with Spark and then I used a ThriftParquetWriter to write a single
> record (AvroParquetWriter yielded the same result too, the tiny difference
> in bytes is because of the schema part). The trouble is that I see 5
> instances of the data in the binary (hence the 2KB) size.
>
> Could someone explain or point me to a link that explains why this is the
> case?
>
> This is parquet 1.8.2 and I haven't tried any lower version.
>
> *Binary : *
> [image: image.png]
>
> *Code (Thrift): *
>
> def serialize(t: TweetThrift, file: File) = {
>   val dataFileWriter = new ThriftParquetWriter[TweetThrift](new Path(file.getAbsolutePath),
>     classOf[TweetThrift],
>     CompressionCodecName.UNCOMPRESSED)
>   dataFileWriter.write(t)
>   dataFileWriter.close()
> }
>
> val tweet = new TweetThrift(1, 123, "Saturday 8th, June", "arunma")
>     tweet.setText("First tweet")
>     ParquetSerDe.serialize(tweet, file)
>
>
> *Code (Avro): *
>
>  val file = new File ("serialized_parquet_file.parquet")
>     val tweet = TweetAvro
>       .newBuilder
>       .setTarget(1)
>       .setId(123)
>       .setDate("Saturday 8th, June")
>       .setUser("arunma")
>       .setText("Parquet tweet")
>       .build()
>
>     ParquetSerDe.serialize(tweet, TweetAvro.SCHEMA$, file)
>
>
>
> Cheers,
>
> Arun
>
>

Re: Parquet binary file size

Posted by Arun Manivannan <ar...@arunma.com>.
Thanks a lot, Wes. That makes it very clear.

Just to convince myself, I also ran the parquet-tools dump for a single
column "user" and I could cross check the six instances in the binary and
the dump.  I see the first one is a rowgroup level stats, the second one is
a page level stats and finally the pair.

The data used was :

val tweet1 = TweetAvro
      .newBuilder
      .setTarget(1)
      .setId(123)
      .setDate("Saturday 8th, June")
      .setUser("nus1")
      .setText("Parquet tweet1")
      .build()


    val tweet2 = TweetAvro
      .newBuilder
      .setTarget(2)
      .setId(234)
      .setDate("Sunday 9th, June")
      .setUser("nus2")
      .setText("Parquet tweet2")
      .build()


*parquet-tools dump -c user -n serialized_parquet_file.parquet*
row group 0
--------------------------------------------------------------------------------
user:  BINARY UNCOMPRESSED DO:0 FPO:235 SZ:49/49/1.00 VC:2
ENC:PLAIN,BIT_PACKED ST:[*min: nus1, max: nus2*, num_nulls: 0]

    user TV=2 RL=0 DL=0

----------------------------------------------------------------------------
    page 0:  DLE:BIT_PACKED RLE:BIT_PACKED VLE:PLAIN ST:[*min: nus1, max:
nus2,* num_nulls: 0] SZ:16 VC:2

BINARY user
--------------------------------------------------------------------------------
*** row group 1 of 1, values 1 to 2 ***
value 1: R:0 D:0 *V:nus1*
value 2: R:0 D:0 *V:nus2*



Thanks a ton, again.

Cheers,
Arun



On Tue, Feb 26, 2019 at 10:14 PM Arun Manivannan <ar...@arunma.com> wrote:

> Hi,
>
> Apologies in advance if this a silly question.
>
> I was trying to compare the various data formats as an exercise and I
> noticed that the size of a parquet output file is huge (2 KB) compared to
> thrift (79 bytes), PB (45 bytes) or Avro (333 bytes). I thought the issue
> was with Spark and then I used a ThriftParquetWriter to write a single
> record (AvroParquetWriter yielded the same result too, the tiny difference
> in bytes is because of the schema part). The trouble is that I see 5
> instances of the data in the binary (hence the 2KB) size.
>
> Could someone explain or point me to a link that explains why this is the
> case?
>
> This is parquet 1.8.2 and I haven't tried any lower version.
>
> *Binary : *
> [image: image.png]
>
> *Code (Thrift): *
>
> def serialize(t: TweetThrift, file: File) = {
>   val dataFileWriter = new ThriftParquetWriter[TweetThrift](new Path(file.getAbsolutePath),
>     classOf[TweetThrift],
>     CompressionCodecName.UNCOMPRESSED)
>   dataFileWriter.write(t)
>   dataFileWriter.close()
> }
>
> val tweet = new TweetThrift(1, 123, "Saturday 8th, June", "arunma")
>     tweet.setText("First tweet")
>     ParquetSerDe.serialize(tweet, file)
>
>
> *Code (Avro): *
>
>  val file = new File ("serialized_parquet_file.parquet")
>     val tweet = TweetAvro
>       .newBuilder
>       .setTarget(1)
>       .setId(123)
>       .setDate("Saturday 8th, June")
>       .setUser("arunma")
>       .setText("Parquet tweet")
>       .build()
>
>     ParquetSerDe.serialize(tweet, TweetAvro.SCHEMA$, file)
>
>
>
> Cheers,
>
> Arun
>
>