You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@hive.apache.org by Ashok Kumar <as...@yahoo.com> on 2015/12/21 18:45:41 UTC

Difference between ORC and RC files

 Hi Gurus,
I am trying to understand the advantages that ORC file format offers over RC.
I have read the existing documents but I still don't seem to grasp the main differences.
Can someone explain to me as a user where ORC scores when compared to RC. What I like to know is mainly the performance. I am also aware that ORC does some smart compression as well.
Finally is ORC file format is the best choice in Hive.
Thank you


RE: Difference between ORC and RC files

Posted by Mich Talebzadeh <mi...@peridale.co.uk>.
Hi Ashok,

 

I believe Allen has already explained the major differences between RC and ORC files.

 

The important thing to note that ORC file format is a successor to RC file format and I am not sure there is case really a base to use RC files.

 

Much like Data Warehouse (DW) that use columnar implementation of relational model (good example is Sybase IQ), ORC is optimized for columnar use. Something like Sybase IQ, relies on column store technology that allows speedy compression (all data within each column file has the same type, making it ideal for compression) and ad-hoc analysis without additional tuning.  In effect every column in column storage technology is a potential index. To this effect ORC has the ability to use internal indexes (building an index on a column only requires reading that column’s data, not the complete table data)  to speed up the queries where needed. In addition default column group size for ORC is 250MB compared to 64MB for RC. That adds to faster sequential block reads from the disk (and hence less files to read, less demand on NameNode). Some in built statistics like min, max, count etc are also stored in ORC files that gives the optimizer heads up compared to RC files. 

 

I believe if you want to implement some form of DW schema or de-normalized DW schema then ORC files will be the choice. On the other hand as Allen pointed out row type queries on ORC files are an expensive operation as you have to have full row construction, so not suitable for that sort of operation.

 

 

HTH

 

Mich Talebzadeh

 

Sybase ASE 15 Gold Medal Award 2008

A Winning Strategy: Running the most Critical Financial Data on ASE 15

http://login.sybase.com/files/Product_Overviews/ASE-Winning-Strategy-091908.pdf

Author of the books "A Practitioner’s Guide to Upgrading to Sybase ASE 15", ISBN 978-0-9563693-0-7. 

co-author "Sybase Transact SQL Guidelines Best Practices", ISBN 978-0-9759693-0-4

Publications due shortly:

Complex Event Processing in Heterogeneous Environments, ISBN: 978-0-9563693-3-8

Oracle and Sybase, Concepts and Contrasts, ISBN: 978-0-9563693-1-4, volume one out shortly

 

http://talebzadehmich.wordpress.com <http://talebzadehmich.wordpress.com/> 

 

NOTE: The information in this email is proprietary and confidential. This message is for the designated recipient only, if you are not the intended recipient, you should destroy it immediately. Any information in this message shall not be understood as given or endorsed by Peridale Technology Ltd, its subsidiaries or their employees, unless expressly so stated. It is the responsibility of the recipient to ensure that this email is virus free, therefore neither Peridale Ltd, its subsidiaries nor their employees accept any responsibility.

 

From: Ashok Kumar [mailto:ashok34668@yahoo.com] 
Sent: 21 December 2015 19:18
To: user@hive.apache.org
Subject: Re: Difference between ORC and RC files

 

Many thanks Sir. Very useful.

 

Kindly elaborate why RC files do not have these capabilities. As I see them they are Row Columnar files. Am I correct to assume that ORC file is basically an RC file with more optimisation.

 

Are RC and ORC files designed for columnar format similar to the way a columnar data warehouse is built?

 

Regards

On Monday, 21 December 2015, 18:58, Alan Gates <alanfgates@gmail.com <ma...@gmail.com> > wrote:

 

ORC offers a number of features not available in RC files:
* Better encoding of data.  Integer values are run length encoded.  Strings and dates are stored in a dictionary (and the resulting pointers then run length encoded).
* Internal indexes and statistics on the data.  This allows for more efficient reading of the data as well as skipping of sections of the data not relevant to a given query.  These indexes can also be used by the Hive optimizer to help plan query execution.
* Predicate push down for some predicates.  For example, in the query "select * from user where state = 'ca'", ORC could look at a collection of rows and use the indexes to see that no rows in that group have that value, and thus skip the group altogether.
* Tight integration with Hive's vectorized execution, which produces much faster processing of rows
* Support for new ACID features in Hive (transactional insert, update, and delete).
* It has a much faster read time than RCFile and compresses much more efficiently.

Whether ORC is the best format for what you're doing depends on the data you're storing and how you are querying it.  If you are storing data where you know the schema and you are doing analytic type queries it's the best choice (in fairness, some would dispute this and choose Parquet, though much of what I said above about ORC vs RC applies to Parquet as well).  If you are doing queries that select the whole row each time columnar formats like ORC won't be your friend.  Also, if you are storing self structured data such as JSON or Avro you may find text or Avro storage to be a better format.

Alan.








 <ma...@yahoo.com> Ashok Kumar

December 21, 2015 at 9:45

Hi Gurus,

 

I am trying to understand the advantages that ORC file format offers over RC.

 

I have read the existing documents but I still don't seem to grasp the main differences.

 

Can someone explain to me as a user where ORC scores when compared to RC. What I like to know is mainly the performance. I am also aware that ORC does some smart compression as well.

 

Finally is ORC file format is the best choice in Hive.

 

Thank you

 

 

 


Re: Difference between ORC and RC files

Posted by Ashok Kumar <as...@yahoo.com>.
 Many thanks Sir. Very useful.
Kindly elaborate why RC files do not have these capabilities. As I see them they are Row Columnar files. Am I correct to assume that ORC file is basically an RC file with more optimisation.
Are RC and ORC files designed for columnar format similar to the way a columnar data warehouse is built?
Regards

    On Monday, 21 December 2015, 18:58, Alan Gates <al...@gmail.com> wrote:
 

 ORC offers a number of features not available in RC files:
* Better encoding of data.  Integer values are run length encoded.  Strings and dates are stored in a dictionary (and the resulting pointers then run length encoded).
* Internal indexes and statistics on the data.  This allows for more efficient reading of the data as well as skipping of sections of the data not relevant to a given query.  These indexes can also be used by the Hive optimizer to help plan query execution.
* Predicate push down for some predicates.  For example, in the query "select * from user where state = 'ca'", ORC could look at a collection of rows and use the indexes to see that no rows in that group have that value, and thus skip the group altogether.
* Tight integration with Hive's vectorized execution, which produces much faster processing of rows
* Support for new ACID features in Hive (transactional insert, update, and delete).
* It has a much faster read time than RCFile and compresses much more efficiently.

Whether ORC is the best format for what you're doing depends on the data you're storing and how you are querying it.  If you are storing data where you know the schema and you are doing analytic type queries it's the best choice (in fairness, some would dispute this and choose Parquet, though much of what I said above about ORC vs RC applies to Parquet as well).  If you are doing queries that select the whole row each time columnar formats like ORC won't be your friend.  Also, if you are storing self structured data such as JSON or Avro you may find text or Avro storage to be a better format.

Alan.




    Ashok Kumar  December 21, 2015 at 9:45  Hi Gurus,
I am trying to understand the advantages that ORC file format offers over RC.
I have read the existing documents but I still don't seem to grasp the main differences.
Can someone explain to me as a user where ORC scores when compared to RC. What I like to know is mainly the performance. I am also aware that ORC does some smart compression as well.
Finally is ORC file format is the best choice in Hive.
Thank you




  

Re: Difference between ORC and RC files

Posted by Alan Gates <al...@gmail.com>.
ORC offers a number of features not available in RC files:
* Better encoding of data.  Integer values are run length encoded.  
Strings and dates are stored in a dictionary (and the resulting pointers 
then run length encoded).
* Internal indexes and statistics on the data.  This allows for more 
efficient reading of the data as well as skipping of sections of the 
data not relevant to a given query.  These indexes can also be used by 
the Hive optimizer to help plan query execution.
* Predicate push down for some predicates.  For example, in the query 
"select * from user where state = 'ca'", ORC could look at a collection 
of rows and use the indexes to see that no rows in that group have that 
value, and thus skip the group altogether.
* Tight integration with Hive's vectorized execution, which produces 
much faster processing of rows
* Support for new ACID features in Hive (transactional insert, update, 
and delete).
* It has a much faster read time than RCFile and compresses much more 
efficiently.

Whether ORC is the best format for what you're doing depends on the data 
you're storing and how you are querying it.  If you are storing data 
where you know the schema and you are doing analytic type queries it's 
the best choice (in fairness, some would dispute this and choose 
Parquet, though much of what I said above about ORC vs RC applies to 
Parquet as well).  If you are doing queries that select the whole row 
each time columnar formats like ORC won't be your friend.  Also, if you 
are storing self structured data such as JSON or Avro you may find text 
or Avro storage to be a better format.

Alan.



> Ashok Kumar <ma...@yahoo.com>
> December 21, 2015 at 9:45
> Hi Gurus,
>
> I am trying to understand the advantages that ORC file format offers 
> over RC.
>
> I have read the existing documents but I still don't seem to grasp the 
> main differences.
>
> Can someone explain to me as a user where ORC scores when compared to 
> RC. What I like to know is mainly the performance. I am also aware 
> that ORC does some smart compression as well.
>
> Finally is ORC file format is the best choice in Hive.
>
> Thank you
>
>