You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@parquet.apache.org by Arun Manivannan <ar...@arunma.com> on 2019/02/26 14:14:13 UTC
Parquet binary file size
Hi,
Apologies in advance if this a silly question.
I was trying to compare the various data formats as an exercise and I
noticed that the size of a parquet output file is huge (2 KB) compared to
thrift (79 bytes), PB (45 bytes) or Avro (333 bytes). I thought the issue
was with Spark and then I used a ThriftParquetWriter to write a single
record (AvroParquetWriter yielded the same result too, the tiny difference
in bytes is because of the schema part). The trouble is that I see 5
instances of the data in the binary (hence the 2KB) size.
Could someone explain or point me to a link that explains why this is the
case?
This is parquet 1.8.2 and I haven't tried any lower version.
*Binary : *
[image: image.png]
*Code (Thrift): *
def serialize(t: TweetThrift, file: File) = {
val dataFileWriter = new ThriftParquetWriter[TweetThrift](new
Path(file.getAbsolutePath),
classOf[TweetThrift],
CompressionCodecName.UNCOMPRESSED)
dataFileWriter.write(t)
dataFileWriter.close()
}
val tweet = new TweetThrift(1, 123, "Saturday 8th, June", "arunma")
tweet.setText("First tweet")
ParquetSerDe.serialize(tweet, file)
*Code (Avro): *
val file = new File ("serialized_parquet_file.parquet")
val tweet = TweetAvro
.newBuilder
.setTarget(1)
.setId(123)
.setDate("Saturday 8th, June")
.setUser("arunma")
.setText("Parquet tweet")
.build()
ParquetSerDe.serialize(tweet, TweetAvro.SCHEMA$, file)
Cheers,
Arun
Re: Parquet binary file size
Posted by Wes McKinney <we...@gmail.com>.
hi Arun,
Parquet isn't designed for efficient transport of small bits of data, e.g.
1 record at a time. It's designed to compactly store large analytics
datasets where dictionary encoding, run-length encoding, and compression is
effective at reducing space.
In your example, there are several additional pieces of data stored:
* File footer metadata
* Row group and column chunk metadata, including "column" statistics (min
and max values)
* Data page headers, one for each "column"
If you compared Avro or Protobuf-based storage of e.g. 1 million record
dataset as a single file I would bet that Parquet would be smaller
(possible significantly smaller, 5-10x or more) in typical use cases.
- Wes
On Tue, Feb 26, 2019 at 8:14 AM Arun Manivannan <ar...@arunma.com> wrote:
> Hi,
>
> Apologies in advance if this a silly question.
>
> I was trying to compare the various data formats as an exercise and I
> noticed that the size of a parquet output file is huge (2 KB) compared to
> thrift (79 bytes), PB (45 bytes) or Avro (333 bytes). I thought the issue
> was with Spark and then I used a ThriftParquetWriter to write a single
> record (AvroParquetWriter yielded the same result too, the tiny difference
> in bytes is because of the schema part). The trouble is that I see 5
> instances of the data in the binary (hence the 2KB) size.
>
> Could someone explain or point me to a link that explains why this is the
> case?
>
> This is parquet 1.8.2 and I haven't tried any lower version.
>
> *Binary : *
> [image: image.png]
>
> *Code (Thrift): *
>
> def serialize(t: TweetThrift, file: File) = {
> val dataFileWriter = new ThriftParquetWriter[TweetThrift](new Path(file.getAbsolutePath),
> classOf[TweetThrift],
> CompressionCodecName.UNCOMPRESSED)
> dataFileWriter.write(t)
> dataFileWriter.close()
> }
>
> val tweet = new TweetThrift(1, 123, "Saturday 8th, June", "arunma")
> tweet.setText("First tweet")
> ParquetSerDe.serialize(tweet, file)
>
>
> *Code (Avro): *
>
> val file = new File ("serialized_parquet_file.parquet")
> val tweet = TweetAvro
> .newBuilder
> .setTarget(1)
> .setId(123)
> .setDate("Saturday 8th, June")
> .setUser("arunma")
> .setText("Parquet tweet")
> .build()
>
> ParquetSerDe.serialize(tweet, TweetAvro.SCHEMA$, file)
>
>
>
> Cheers,
>
> Arun
>
>
Re: Parquet binary file size
Posted by Arun Manivannan <ar...@arunma.com>.
Thanks a lot, Wes. That makes it very clear.
Just to convince myself, I also ran the parquet-tools dump for a single
column "user" and I could cross check the six instances in the binary and
the dump. I see the first one is a rowgroup level stats, the second one is
a page level stats and finally the pair.
The data used was :
val tweet1 = TweetAvro
.newBuilder
.setTarget(1)
.setId(123)
.setDate("Saturday 8th, June")
.setUser("nus1")
.setText("Parquet tweet1")
.build()
val tweet2 = TweetAvro
.newBuilder
.setTarget(2)
.setId(234)
.setDate("Sunday 9th, June")
.setUser("nus2")
.setText("Parquet tweet2")
.build()
*parquet-tools dump -c user -n serialized_parquet_file.parquet*
row group 0
--------------------------------------------------------------------------------
user: BINARY UNCOMPRESSED DO:0 FPO:235 SZ:49/49/1.00 VC:2
ENC:PLAIN,BIT_PACKED ST:[*min: nus1, max: nus2*, num_nulls: 0]
user TV=2 RL=0 DL=0
----------------------------------------------------------------------------
page 0: DLE:BIT_PACKED RLE:BIT_PACKED VLE:PLAIN ST:[*min: nus1, max:
nus2,* num_nulls: 0] SZ:16 VC:2
BINARY user
--------------------------------------------------------------------------------
*** row group 1 of 1, values 1 to 2 ***
value 1: R:0 D:0 *V:nus1*
value 2: R:0 D:0 *V:nus2*
Thanks a ton, again.
Cheers,
Arun
On Tue, Feb 26, 2019 at 10:14 PM Arun Manivannan <ar...@arunma.com> wrote:
> Hi,
>
> Apologies in advance if this a silly question.
>
> I was trying to compare the various data formats as an exercise and I
> noticed that the size of a parquet output file is huge (2 KB) compared to
> thrift (79 bytes), PB (45 bytes) or Avro (333 bytes). I thought the issue
> was with Spark and then I used a ThriftParquetWriter to write a single
> record (AvroParquetWriter yielded the same result too, the tiny difference
> in bytes is because of the schema part). The trouble is that I see 5
> instances of the data in the binary (hence the 2KB) size.
>
> Could someone explain or point me to a link that explains why this is the
> case?
>
> This is parquet 1.8.2 and I haven't tried any lower version.
>
> *Binary : *
> [image: image.png]
>
> *Code (Thrift): *
>
> def serialize(t: TweetThrift, file: File) = {
> val dataFileWriter = new ThriftParquetWriter[TweetThrift](new Path(file.getAbsolutePath),
> classOf[TweetThrift],
> CompressionCodecName.UNCOMPRESSED)
> dataFileWriter.write(t)
> dataFileWriter.close()
> }
>
> val tweet = new TweetThrift(1, 123, "Saturday 8th, June", "arunma")
> tweet.setText("First tweet")
> ParquetSerDe.serialize(tweet, file)
>
>
> *Code (Avro): *
>
> val file = new File ("serialized_parquet_file.parquet")
> val tweet = TweetAvro
> .newBuilder
> .setTarget(1)
> .setId(123)
> .setDate("Saturday 8th, June")
> .setUser("arunma")
> .setText("Parquet tweet")
> .build()
>
> ParquetSerDe.serialize(tweet, TweetAvro.SCHEMA$, file)
>
>
>
> Cheers,
>
> Arun
>
>