You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@orc.apache.org by Ryan Schachte <co...@gmail.com> on 2021/11/09 16:34:50 UTC

Avro vs ORC in Spark

Hi everyone, I'm looking for a better understanding of ORC compared to Avro
when leveraging a big data compute engine like Spark.

If I have 100GB dataset of Avro and the same dataset in ORC which consumes
10GB, would the ORC dataset be more performant and consume less memory than
the Avro counterpart?

My initial assumption was no because the data would both be deserialized
and I'm consuming the entire dataset for both, but wanted to have the
conversation to see if I'm thinking about that correctly.

Cheers,
Ryan S.

Re: Avro vs ORC in Spark

Posted by Alan Gates <al...@gmail.com>.

As Dongjoon says, it depends on the use case.  ORC is column oriented, so
it only needs to read the columns you request.  This often saves a
significant amount of I/O.  It also uses run length and dictionary encoding
for many column types, so again it will read less data from storage.
However, if you're reading every column of a wide record, row oriented
storage like Avro can be better because the cost of stitching all the
columns together to rebuild a wide row is high.

Alan.

On Tue, Nov 9, 2021 at 11:12 AM Ryan Schachte <co...@gmail.com>
wrote:

> Thanks Dongjoon,
> Just speaking hypothetically. More curious if there are performance gains
> in reading ORC data into a dataframe compared to Avro. Would it operate any
> faster due to the compression, etc?
>
> On Tue, Nov 9, 2021 at 10:39 AM Dongjoon Hyun <do...@gmail.com>
> wrote:
>
>> Hi, Ryan.
>>
>> I don't think you have one 100GB Avro file in production. :)
>> If you have one million 1MB or one thousand 1GB Avro files, it becomes a
>> completely different story.
>>
>> Most big data compute engines like Spark/Hive/Trino/Impala support both
>> of them because the use cases are different.
>> I'd like to recommend you to test both of them simply in your use case. :)
>>
>> BTW, ORC has more advanced features like encryption and bloom filters
>> while Avro doesn't.
>>
>> Dongjoon.
>>
>>
>> On Tue, Nov 9, 2021 at 8:35 AM Ryan Schachte <co...@gmail.com>
>> wrote:
>>
>>> Hi everyone, I'm looking for a better understanding of ORC compared to
>>> Avro when leveraging a big data compute engine like Spark.
>>>
>>> If I have 100GB dataset of Avro and the same dataset in ORC which
>>> consumes 10GB, would the ORC dataset be more performant and consume less
>>> memory than the Avro counterpart?
>>>
>>> My initial assumption was no because the data would both be deserialized
>>> and I'm consuming the entire dataset for both, but wanted to have the
>>> conversation to see if I'm thinking about that correctly.
>>>
>>> Cheers,
>>> Ryan S.
>>>
>>>
>>>

Re: Avro vs ORC in Spark

Posted by Ryan Schachte <co...@gmail.com>.

Thanks Dongjoon,
Just speaking hypothetically. More curious if there are performance gains
in reading ORC data into a dataframe compared to Avro. Would it operate any
faster due to the compression, etc?

On Tue, Nov 9, 2021 at 10:39 AM Dongjoon Hyun <do...@gmail.com>
wrote:

> Hi, Ryan.
>
> I don't think you have one 100GB Avro file in production. :)
> If you have one million 1MB or one thousand 1GB Avro files, it becomes a
> completely different story.
>
> Most big data compute engines like Spark/Hive/Trino/Impala support both of
> them because the use cases are different.
> I'd like to recommend you to test both of them simply in your use case. :)
>
> BTW, ORC has more advanced features like encryption and bloom filters
> while Avro doesn't.
>
> Dongjoon.
>
>
> On Tue, Nov 9, 2021 at 8:35 AM Ryan Schachte <co...@gmail.com>
> wrote:
>
>> Hi everyone, I'm looking for a better understanding of ORC compared to
>> Avro when leveraging a big data compute engine like Spark.
>>
>> If I have 100GB dataset of Avro and the same dataset in ORC which
>> consumes 10GB, would the ORC dataset be more performant and consume less
>> memory than the Avro counterpart?
>>
>> My initial assumption was no because the data would both be deserialized
>> and I'm consuming the entire dataset for both, but wanted to have the
>> conversation to see if I'm thinking about that correctly.
>>
>> Cheers,
>> Ryan S.
>>
>>
>>

Re: Avro vs ORC in Spark

Posted by Dongjoon Hyun <do...@gmail.com>.

Hi, Ryan.

I don't think you have one 100GB Avro file in production. :)
If you have one million 1MB or one thousand 1GB Avro files, it becomes a
completely different story.

Most big data compute engines like Spark/Hive/Trino/Impala support both of
them because the use cases are different.
I'd like to recommend you to test both of them simply in your use case. :)

BTW, ORC has more advanced features like encryption and bloom filters while
Avro doesn't.

Dongjoon.

On Tue, Nov 9, 2021 at 8:35 AM Ryan Schachte <co...@gmail.com>
wrote:

> Hi everyone, I'm looking for a better understanding of ORC compared to
> Avro when leveraging a big data compute engine like Spark.
>
> If I have 100GB dataset of Avro and the same dataset in ORC which consumes
> 10GB, would the ORC dataset be more performant and consume less memory than
> the Avro counterpart?
>
> My initial assumption was no because the data would both be deserialized
> and I'm consuming the entire dataset for both, but wanted to have the
> conversation to see if I'm thinking about that correctly.
>
> Cheers,
> Ryan S.
>
>
>