You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@hive.apache.org by Mich Talebzadeh <mi...@gmail.com> on 2016/03/06 16:26:23 UTC

Parquet versus ORC

Hi.

I have been hearing a fair bit about Parquet versus ORC tables.

In a nutshell I can say that Parquet is a predecessor to ORC (both provide
columnar type storage) but I notice that it is still being used
especially with Spark users.

In mitigation it appears that Spark users are reluctant to use ORC despite
the fact that with inbuilt Store Index it offers superior optimisation with
data and stats at file, stripe and row group level. Both Parquet and ORC
offer SNAPPY compression as well. ORC offers ZLIB as default.

There may be other than technical reasons for this adaption, for example
too much reliance on Hive plus the fact that it is easier to flatten
Parquet than ORC (whatever that means).

I for myself use either text files or ORC with Hive and Spark and don't
really see any reason why I should adopt others like Avro, Parquet etc.

Appreciate any verification or experience on this.

Thanks
,

Dr Mich Talebzadeh



LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*



http://talebzadehmich.wordpress.com

Re: Parquet versus ORC

Posted by Marcin Tustin <mt...@handybook.com>.

If you google, you'll find benchmarks showing each to be faster than the
other. In so far as there's any reality to which is faster in any given
comparison, it seems to be a result of each incorporating ideas from the
other, or at least going through development cycles to beat each other.

ORC is very fast for working with hive, and we use it at Handy. That said,
the broader support for parquet might enable things like performing your
own insertions into tables by dropping new files in there, or doing your
own concatenation and cleanup.

In summary, until you benchmark your own usage I'd assume performance is
the same. If you're not going to benchmark, go by what's likely to be most
convenient.

On Sun, Mar 6, 2016 at 11:06 AM, Mich Talebzadeh <mi...@gmail.com>
wrote:

> Hi,
>
> Thanks for that link.
>
> It appears that the main advantages of Parquet is stated as and I quote:
>
> "Parquet is built to be used by anyone. The Hadoop ecosystem is rich with
> data processing frameworks, and we are not interested in playing favorites.
> We believe that an efficient, well-implemented columnar storage substrate
> should be useful to all frameworks without the cost of extensive and
> difficult to set up dependencies."
>
> Fair enough Parquet provides columnar format and compression. As I stated
> I do not know much about it. However, my understanding of ORC is that it
> provides better encoding of data, Predicate push down for some predicates
> plus support for ACID properties.
>
> As Alan Gates stated before (Hive user forum, "Difference between ORC and
> RC files" , 21 Dec 15) and I quote
>
> "Whether ORC is the best format for what you're doing depends on the data
> you're storing and how you are querying it.  If you are storing data where
> you know the schema and you are doing analytic type queries it's the best
> choice (in fairness, some would dispute this and choose Parquet, though
> much of what I said above (about ORC vs RC applies to Parquet as well).  If
> you are doing queries that select the whole row each time columnar formats
> like ORC won't be your friend.  Also, if you are storing self structured
> data such as JSON or Avro you may find text or Avro storage to be a better
> format.
>
> So what would be the main advantage(s) of Parquet over ORC please besides
> using queries that select whole row (much like "a row based" type
> relational database does).
>
>
> Cheers.
>
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>
>
>
> http://talebzadehmich.wordpress.com
>
>
>
> On 6 March 2016 at 15:34, Uli Bethke <ul...@sonra.io> wrote:
>
>> Curious why you think that Parquet does not have metadat at file, row
>> group or column level.
>> Please refer here to the type of metadata that Parquet supports in the
>> docs http://parquet.apache.org/documentation/latest/
>>
>>
>> n 06/03/2016 15:26, Mich Talebzadeh wrote:
>>
>> Hi.
>>
>> I have been hearing a fair bit about Parquet versus ORC tables.
>>
>> In a nutshell I can say that Parquet is a predecessor to ORC (both
>> provide columnar type storage) but I notice that it is still being used
>> especially with Spark users.
>>
>> In mitigation it appears that Spark users are reluctant to use ORC
>> despite the fact that with inbuilt Store Index it offers superior
>> optimisation with data and stats at file, stripe and row group level. Both
>> Parquet and ORC offer SNAPPY compression as well. ORC offers ZLIB as
>> default.
>>
>> There may be other than technical reasons for this adaption, for example
>> too much reliance on Hive plus the fact that it is easier to flatten
>> Parquet than ORC (whatever that means).
>>
>> I for myself use either text files or ORC with Hive and Spark and don't
>> really see any reason why I should adopt others like Avro, Parquet etc.
>>
>> Appreciate any verification or experience on this.
>>
>> Thanks
>> ,
>>
>> Dr Mich Talebzadeh
>>
>>
>>
>> LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>
>>
>>
>> <http://talebzadehmich.wordpress.com/>http://talebzadehmich.wordpress.com
>>
>>
>>
>>
>>
>> --
>> ___________________________
>> Uli Bethke
>> Chair Hadoop User Group Irelandwww.hugireland.org
>> HUG Ireland is community sponsor of Hadoop Summit Europe in Dublin http://2016.hadoopsummit.org/dublin/
>>
>>
>

-- 
Want to work at Handy? Check out our culture deck and open roles 
<http://www.handy.com/careers>
Latest news <http://www.handy.com/press> at Handy
Handy just raised $50m 
<http://venturebeat.com/2015/11/02/on-demand-home-service-handy-raises-50m-in-round-led-by-fidelity/> led 
by Fidelity

Re: Parquet versus ORC

Posted by Mich Talebzadeh <mi...@gmail.com>.

Hi,

Thanks for that link.

It appears that the main advantages of Parquet is stated as and I quote:

"Parquet is built to be used by anyone. The Hadoop ecosystem is rich with
data processing frameworks, and we are not interested in playing favorites.
We believe that an efficient, well-implemented columnar storage substrate
should be useful to all frameworks without the cost of extensive and
difficult to set up dependencies."

Fair enough Parquet provides columnar format and compression. As I stated I
do not know much about it. However, my understanding of ORC is that it
provides better encoding of data, Predicate push down for some predicates
plus support for ACID properties.

As Alan Gates stated before (Hive user forum, "Difference between ORC and
RC files" , 21 Dec 15) and I quote

"Whether ORC is the best format for what you're doing depends on the data
you're storing and how you are querying it.  If you are storing data where
you know the schema and you are doing analytic type queries it's the best
choice (in fairness, some would dispute this and choose Parquet, though
much of what I said above (about ORC vs RC applies to Parquet as well).  If
you are doing queries that select the whole row each time columnar formats
like ORC won't be your friend.  Also, if you are storing self structured
data such as JSON or Avro you may find text or Avro storage to be a better
format.

So what would be the main advantage(s) of Parquet over ORC please besides
using queries that select whole row (much like "a row based" type
relational database does).

Cheers.

Dr Mich Talebzadeh

LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*

http://talebzadehmich.wordpress.com

On 6 March 2016 at 15:34, Uli Bethke <ul...@sonra.io> wrote:

> Curious why you think that Parquet does not have metadat at file, row
> group or column level.
> Please refer here to the type of metadata that Parquet supports in the
> docs http://parquet.apache.org/documentation/latest/
>
>
> n 06/03/2016 15:26, Mich Talebzadeh wrote:
>
> Hi.
>
> I have been hearing a fair bit about Parquet versus ORC tables.
>
> In a nutshell I can say that Parquet is a predecessor to ORC (both provide
> columnar type storage) but I notice that it is still being used
> especially with Spark users.
>
> In mitigation it appears that Spark users are reluctant to use ORC despite
> the fact that with inbuilt Store Index it offers superior optimisation with
> data and stats at file, stripe and row group level. Both Parquet and ORC
> offer SNAPPY compression as well. ORC offers ZLIB as default.
>
> There may be other than technical reasons for this adaption, for example
> too much reliance on Hive plus the fact that it is easier to flatten
> Parquet than ORC (whatever that means).
>
> I for myself use either text files or ORC with Hive and Spark and don't
> really see any reason why I should adopt others like Avro, Parquet etc.
>
> Appreciate any verification or experience on this.
>
> Thanks
> ,
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>
>
>
> <http://talebzadehmich.wordpress.com/>http://talebzadehmich.wordpress.com
>
>
>
>
>
> --
> ___________________________
> Uli Bethke
> Chair Hadoop User Group Irelandwww.hugireland.org
> HUG Ireland is community sponsor of Hadoop Summit Europe in Dublin http://2016.hadoopsummit.org/dublin/
>
>

Re: Parquet versus ORC

Posted by Uli Bethke <ul...@sonra.io>.

Curious why you think that Parquet does not have metadat at file, row 
group or column level.
Please refer here to the type of metadata that Parquet supports in the 
docs http://parquet.apache.org/documentation/latest/

n 06/03/2016 15:26, Mich Talebzadeh wrote:
> Hi.
>
> I have been hearing a fair bit about Parquet versus ORC tables.
>
> In a nutshell I can say that Parquet is a predecessor to ORC (both 
> provide columnar type storage) but I notice that it is still being 
> used especially with Spark users.
>
> In mitigation it appears that Spark users are reluctant to use ORC 
> despite the fact that with inbuilt Store Index it offers superior 
> optimisation with data and stats at file, stripe and row group level. 
> Both Parquet and ORC offer SNAPPY compression as well. ORC offers ZLIB 
> as default.
>
> There may be other than technical reasons for this adaption, for 
> example too much reliance on Hive plus the fact that it is easier to 
> flatten Parquet than ORC (whatever that means).
>
> I for myself use either text files or ORC with Hive and Spark and 
> don't really see any reason why I should adopt others like Avro, 
> Parquet etc.
>
> Appreciate any verification or experience on this.
>
> Thanks
> ,
>
> Dr Mich Talebzadeh
>
> LinkedIn 
> /https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw/
>
> http://talebzadehmich.wordpress.com <http://talebzadehmich.wordpress.com/>
>


-- 
___________________________
Uli Bethke
Chair Hadoop User Group Ireland
www.hugireland.org
HUG Ireland is community sponsor of Hadoop Summit Europe in Dublin
http://2016.hadoopsummit.org/dublin/