You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@hive.apache.org by Mich Talebzadeh <mi...@peridale.co.uk> on 2015/04/19 21:32:33 UTC

Orc file and Hive Optimiser

My understanding is that the Optimized Row Columnar (ORC) file format
provides a highly efficient way to store Hive data. 

 

https://cwiki.apache.org/confluence/display/Hive/LanguageManual+ORC

 

 

In a nutshell the columnar storage allows pretty efficient compression of
columns on par with what Data Warehouses databases  like Sybase IQ provide.
In short if a normal Hive table is "Row based implementation of relational
model", then ORC is the equivalent for "Columnar based implementation of
relational model"

 

I find ORC file format pretty interesting as it provides a more efficient
performance compared to other Hive file formats Trying testing it). MY only
question is whether the Cost Based Optimiser (CBO) of Hive is aware of ORC
storage format and it treats the table accordingly?

 

Finally this is more of a speculative question. If we have ORC files that
provide good functionality, is there any reason why one should deploy a
columnar database such as Hbase or Cassandra If Hive can do the job as well?

 

Thanks,

 

 

Mich Talebzadeh

 

http://talebzadehmich.wordpress.com

 

Author of the books "A Practitioner's Guide to Upgrading to Sybase ASE 15",
ISBN 978-0-9563693-0-7. 

co-author "Sybase Transact SQL Guidelines Best Practices", ISBN
978-0-9759693-0-4

Publications due shortly:

Creating in-memory Data Grid for Trading Systems with Oracle TimesTen and
Coherence Cache

Oracle and Sybase, Concepts and Contrasts, ISBN: 978-0-9563693-1-4, volume
one out shortly

 

NOTE: The information in this email is proprietary and confidential. This
message is for the designated recipient only, if you are not the intended
recipient, you should destroy it immediately. Any information in this
message shall not be understood as given or endorsed by Peridale Ltd, its
subsidiaries or their employees, unless expressly so stated. It is the
responsibility of the recipient to ensure that this email is virus free,
therefore neither Peridale Ltd, its subsidiaries nor their employees accept
any responsibility.

Using Hive as a file comparison and grep-ping tool

Posted by Sanjay Subramanian <sa...@yahoo.com>.

hey guys
As data wranglers and programmers we often need quick tools. One such tool I need almost everyday is one that greps a file based on contents of another file. One can write this in perl, python but since I am already using hadoop ecosystem extensively, I said why not do this in Hive ? 
Perhaps you guys already know this and have better solutions....nevertheless :-) here goes...

Best regards
sanjay(Hive super-fan)
I just posted this on my bloghttps://bigdatalatte.wordpress.com/2015/04/20/using-hive-as-a-file-comparison-and-grep-ping-tool/
In case the blog URL does not work for any reason, here is the logic
Using Hive as a file comparison and grep-ping tool==================================================1. Logon to your linux terminal where u run Hive queries from
2. Create a database called "myutils" in Hive   Create two hive tables myutils.file1 and myutils.file2 in Hive    - each of these tables will have a partition called "fn"     ----> fn is short for "filename"    - each of these tables will have just one column called "ln" ----> ln is short for "line"   An easy script to help do that would be as follows     for r in 1 2 ; do hive -e "CREATE DATABASE IF NOT EXISTS myutils; USE myutils; DROP TABLE IF EXISTS file${r}; CREATE EXTERNAL TABLE IF NOT EXISTS file${r}(ln STRING) PARTITIONED BY (fn STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t';"; done
3. Create a permanent base location folder in HDFS   hdfs dfs -mkdir -p /workspace/myutils/filecomparator/file1/   hdfs dfs -mkdir -p /workspace/myutils/filecomparator/file2/   USECASE 1 : ===========Search if a bunch of IP addresses exist in another file containing (larger) bunch of IPs
[1] registeredIPs.txt    10.456.34.90    123.675.654.1    21.87.657.456    234.109.34.234        visitorIPs.txt    10.456.34.90    12.367.54.23    218.7.657.456    23.4.109.3   [2] Output which IPs in File1 are present in File2
[3] Put each file in a separate HDFS location        hdfs dfs -mkdir -p /workspace/myutils/filecomparator/file1/registeredIPs.txt    hdfs dfs -put VisitorIPs.txt  /workspace/myutils/filecomparator/file1/visitorIPs.txt
    hdfs dfs -put registeredIPs.txt  /workspace/myutils/filecomparator/file1/registeredIPs.txt    hdfs dfs -put visitorIPs.txt  /workspace/myutils/filecomparator/file1/visitorIPs.txt
[4] Add partition to  myutils.file1    For simplicity keep the partition names identical to the file names themselves      hive -e "USE myutils; ALTER TABLE file1 ADD PARTITION(ln='registeredIPs.txt') LOCATION '/workspace/myutils/filecomparator/file1/registeredIPs.txt'"
    hive -e "USE myutils; ALTER TABLE file2 ADD PARTITION(ln='visitorIPs.txt') LOCATION '/workspace/myutils/filecomparator/file2/visitorIPs.txt'"    [5] Check that partitions can be accesd by Hive
    # This should give u the same answer as    # wc -l registeredIPs.txt    hive -e "select count(*) from myutils.file1 where fn='registeredIPs.txt'"
    # This should give u the same answer as    # wc -l visitorIPs.txt    hive -e "select count(*) from myutils.file2 where fn='visitorIPs.txt'"
[6] Count the number of IPs in registeredIPs.txt that are in visitorIPs.txt
# This dumps to a local file systemhive -e "SELECT f1.ln FROM (SELECT ln FROM utils.file1 WHERE fn='registeredIPs.txt') f1  JOIN (select ln from myutils.file2 WHERE fn='visitorIPs.txt') f2  ON trim(f1.ln) = trim(f2.ln)" > ./registered_in_visitors_list.txt
# This dumps to a new "internally-managed-by-hive" table # Make sure u already dont have some valuable hive table called "myutils.registered_in_visitors_list" - else this will overwrite that hive table with the results of this hive query hive -e "USE myutils; DROP TABLE IF EXITS registered_in_visitors_list; CREATE TABLE if not exists registered_in_visitors_list AS SELECT f1.ln FROM (select ln FROM utils.file1 WHERE fn='registeredIPs.txt') f1  JOIN (SELECT ln FROM myutils.file2 WHERE fn='visitorIPs.txt') f2  ON trim(f1.ln) = trim(f2.ln)"
# This dumps to a directory on HDFS# Make sure u already dont have some valuable directory called "registered_in_visitors_list" - else this will overwrite that director and all its contents with the results of this hive query hive -e "INSERT OVERWRITE DIRECTORY '/workspace/myutils/filecomparator/registered_in_visitors_list' SELECT f1.ln FROM (select ln FROM utils.file1 WHERE fn='registeredIPs.txt') f1  JOIN (SELECT ln FROM myutils.file2 WHERE fn='visitorIPs.txt') f2  ON trim(f1.ln) = trim(f2.ln)"

Re: Orc file and Hive Optimiser

Posted by Alan Gates <al...@gmail.com>.


> Mich Talebzadeh <ma...@peridale.co.uk>
> April 19, 2015 at 12:32
>
> Finally this is more of a speculative question. If we have ORC files 
> that provide good functionality, is there any reason why one should 
> deploy a columnar database such as Hbase or Cassandra If Hive can do 
> the job as well?
>
Yes, there is.  Hive is designed around the assumption that you will be 
doing scans of significant amounts of data, as are most data warehousing 
type solutions.  It doesn't have the right tools to handle efficient 
lookup of single rows or small ranges of rows.  That's what HBase is 
good at.  I don't know Cassandra as well as HBase, but my impression is 
that efficient single row/small range lookup is it's sweet spot as well, 
it just makes a different consistency/partitioning trade off than HBase 
does.

This means that Hive with ORC is still a bad fit for transactional or 
front end serving applications.

Alan.
>
> Thanks,
>
> Mich Talebzadeh
>
> http://talebzadehmich.wordpress.com
>
> __
>
> Author of the books*"A Practitioner's Guide to Upgrading to 
> Sybase**ASE 15", **ISBN 978-0-9563693-0-7*.
>
> co-author *"Sybase Transact SQL Guidelines Best Practices", ISBN 
> 978-0-9759693-0-4*
>
> _Publications due shortly:_
>
> *Creating in-memory Data Grid for Trading Systems with Oracle TimesTen 
> and Coherence Cache*
>
> *Oracle and Sybase, Concepts and Contrasts*, ISBN:978-0-9563693-1-4, 
> volume one out shortly
>
> NOTE: The information in this email is proprietary and confidential. 
> This message is for the designated recipient only, if you are not the 
> intended recipient, you should destroy it immediately. Any information 
> in this message shall not be understood as given or endorsed by 
> Peridale Ltd, its subsidiaries or their employees, unless expressly so 
> stated. It is the responsibility of the recipient to ensure that this 
> email is virus free, therefore neither Peridale Ltd, its subsidiaries 
> nor their employees accept any responsibility.
>

RE: Orc file and Hive Optimiser

Posted by Mich Talebzadeh <mi...@peridale.co.uk>.

Thanks John for the link

 

Mich Talebzadeh

 

http://talebzadehmich.wordpress.com

 

Author of the books "A Practitioner's Guide to Upgrading to Sybase ASE 15",
ISBN 978-0-9563693-0-7. 

co-author "Sybase Transact SQL Guidelines Best Practices", ISBN
978-0-9759693-0-4

Publications due shortly:

Creating in-memory Data Grid for Trading Systems with Oracle TimesTen and
Coherence Cache

Oracle and Sybase, Concepts and Contrasts, ISBN: 978-0-9563693-1-4, volume
one out shortly

 

NOTE: The information in this email is proprietary and confidential. This
message is for the designated recipient only, if you are not the intended
recipient, you should destroy it immediately. Any information in this
message shall not be understood as given or endorsed by Peridale Ltd, its
subsidiaries or their employees, unless expressly so stated. It is the
responsibility of the recipient to ensure that this email is virus free,
therefore neither Peridale Ltd, its subsidiaries nor their employees accept
any responsibility.

 

From: John Pullokkaran [mailto:jpullokkaran@hortonworks.com] 
Sent: 19 April 2015 20:53
To: user@hive.apache.org
Subject: Re: Orc file and Hive Optimiser

 

If you wish to contribute to CBO, there is a CBO branch on which current dev
work is being done.

Current dev work is captured by
<https://issues.apache.org/jira/browse/HIVE-9132> HIVE-9132.

 

Looking forward to your contributions.

 

Thanks

John

 

From: Mich Talebzadeh <mi...@peridale.co.uk>
Reply-To: "user@hive.apache.org" <us...@hive.apache.org>
Date: Sunday, April 19, 2015 at 12:48 PM
To: "user@hive.apache.org" <us...@hive.apache.org>
Subject: RE: Orc file and Hive Optimiser

 

Thanks John,

 

I have already registered my interest on development work for Hive. So
hopefully I may be able to contribute at some level.

 

Regards,

 

 

Mich Talebzadeh

 

http://talebzadehmich.wordpress.com

 

Author of the books "A Practitioner's Guide to Upgrading to Sybase ASE 15",
ISBN 978-0-9563693-0-7. 

co-author "Sybase Transact SQL Guidelines Best Practices", ISBN
978-0-9759693-0-4

Publications due shortly:

Creating in-memory Data Grid for Trading Systems with Oracle TimesTen and
Coherence Cache

Oracle and Sybase, Concepts and Contrasts, ISBN: 978-0-9563693-1-4, volume
one out shortly

 

NOTE: The information in this email is proprietary and confidential. This
message is for the designated recipient only, if you are not the intended
recipient, you should destroy it immediately. Any information in this
message shall not be understood as given or endorsed by Peridale Ltd, its
subsidiaries or their employees, unless expressly so stated. It is the
responsibility of the recipient to ensure that this email is virus free,
therefore neither Peridale Ltd, its subsidiaries nor their employees accept
any responsibility.

 

From: John Pullokkaran [mailto:jpullokkaran@hortonworks.com] 
Sent: 19 April 2015 20:37
To: user@hive.apache.org
Subject: Re: Orc file and Hive Optimiser

 

ORC format is transparent to CBO.

Currently we are working on a new cost model which might reflect ORC's
performance advantages in optimization decisions.

 

Thanks

John

 

From: Mich Talebzadeh <mi...@peridale.co.uk>
Reply-To: "user@hive.apache.org" <us...@hive.apache.org>
Date: Sunday, April 19, 2015 at 12:32 PM
To: "user@hive.apache.org" <us...@hive.apache.org>
Subject: Orc file and Hive Optimiser

 

My understanding is that the Optimized Row Columnar (ORC) file format
provides a highly efficient way to store Hive data. 

 

https://cwiki.apache.org/confluence/display/Hive/LanguageManual+ORC

 

 

In a nutshell the columnar storage allows pretty efficient compression of
columns on par with what Data Warehouses databases  like Sybase IQ provide.
In short if a normal Hive table is "Row based implementation of relational
model", then ORC is the equivalent for "Columnar based implementation of
relational model"

 

I find ORC file format pretty interesting as it provides a more efficient
performance compared to other Hive file formats Trying testing it). MY only
question is whether the Cost Based Optimiser (CBO) of Hive is aware of ORC
storage format and it treats the table accordingly?

 

Finally this is more of a speculative question. If we have ORC files that
provide good functionality, is there any reason why one should deploy a
columnar database such as Hbase or Cassandra If Hive can do the job as well?

 

Thanks,

 

 

Mich Talebzadeh

 

http://talebzadehmich.wordpress.com

 

Author of the books "A Practitioner's Guide to Upgrading to Sybase ASE 15",
ISBN 978-0-9563693-0-7. 

co-author "Sybase Transact SQL Guidelines Best Practices", ISBN
978-0-9759693-0-4

Publications due shortly:

Creating in-memory Data Grid for Trading Systems with Oracle TimesTen and
Coherence Cache

Oracle and Sybase, Concepts and Contrasts, ISBN: 978-0-9563693-1-4, volume
one out shortly

 

NOTE: The information in this email is proprietary and confidential. This
message is for the designated recipient only, if you are not the intended
recipient, you should destroy it immediately. Any information in this
message shall not be understood as given or endorsed by Peridale Ltd, its
subsidiaries or their employees, unless expressly so stated. It is the
responsibility of the recipient to ensure that this email is virus free,
therefore neither Peridale Ltd, its subsidiaries nor their employees accept
any responsibility.

Re: Orc file and Hive Optimiser

Posted by John Pullokkaran <jp...@hortonworks.com>.

If you wish to contribute to CBO, there is a CBO branch on which current dev work is being done.
Current dev work is captured by HIVE-9132<https://issues.apache.org/jira/browse/HIVE-9132>.

Looking forward to your contributions.

Thanks
John

From: Mich Talebzadeh <mi...@peridale.co.uk>>
Reply-To: "user@hive.apache.org<ma...@hive.apache.org>" <us...@hive.apache.org>>
Date: Sunday, April 19, 2015 at 12:48 PM
To: "user@hive.apache.org<ma...@hive.apache.org>" <us...@hive.apache.org>>
Subject: RE: Orc file and Hive Optimiser

Thanks John,

I have already registered my interest on development work for Hive. So hopefully I may be able to contribute at some level.

Regards,


Mich Talebzadeh

http://talebzadehmich.wordpress.com

Author of the books "A Practitioner’s Guide to Upgrading to Sybase ASE 15", ISBN 978-0-9563693-0-7.
co-author "Sybase Transact SQL Guidelines Best Practices", ISBN 978-0-9759693-0-4
Publications due shortly:
Creating in-memory Data Grid for Trading Systems with Oracle TimesTen and Coherence Cache
Oracle and Sybase, Concepts and Contrasts, ISBN: 978-0-9563693-1-4, volume one out shortly

NOTE: The information in this email is proprietary and confidential. This message is for the designated recipient only, if you are not the intended recipient, you should destroy it immediately. Any information in this message shall not be understood as given or endorsed by Peridale Ltd, its subsidiaries or their employees, unless expressly so stated. It is the responsibility of the recipient to ensure that this email is virus free, therefore neither Peridale Ltd, its subsidiaries nor their employees accept any responsibility.

From: John Pullokkaran [mailto:jpullokkaran@hortonworks.com]
Sent: 19 April 2015 20:37
To: user@hive.apache.org<ma...@hive.apache.org>
Subject: Re: Orc file and Hive Optimiser

ORC format is transparent to CBO.
Currently we are working on a new cost model which might reflect ORC’s performance advantages in optimization decisions.

Thanks
John

From: Mich Talebzadeh <mi...@peridale.co.uk>>
Reply-To: "user@hive.apache.org<ma...@hive.apache.org>" <us...@hive.apache.org>>
Date: Sunday, April 19, 2015 at 12:32 PM
To: "user@hive.apache.org<ma...@hive.apache.org>" <us...@hive.apache.org>>
Subject: Orc file and Hive Optimiser

My understanding is that the Optimized Row Columnar (ORC) file format provides a highly efficient way to store Hive data.

https://cwiki.apache.org/confluence/display/Hive/LanguageManual+ORC


In a nutshell the columnar storage allows pretty efficient compression of columns on par with what Data Warehouses databases  like Sybase IQ provide. In short if a normal Hive table is “Row based implementation of relational model”, then ORC is the equivalent for “Columnar based implementation of relational model”

I find ORC file format pretty interesting as it provides a more efficient performance compared to other Hive file formats Trying testing it). MY only question is whether the Cost Based Optimiser (CBO) of Hive is aware of ORC storage format and it treats the table accordingly?

Finally this is more of a speculative question. If we have ORC files that provide good functionality, is there any reason why one should deploy a columnar database such as Hbase or Cassandra If Hive can do the job as well?

Thanks,


Mich Talebzadeh

http://talebzadehmich.wordpress.com

Author of the books "A Practitioner’s Guide to Upgrading to Sybase ASE 15", ISBN 978-0-9563693-0-7.
co-author "Sybase Transact SQL Guidelines Best Practices", ISBN 978-0-9759693-0-4
Publications due shortly:
Creating in-memory Data Grid for Trading Systems with Oracle TimesTen and Coherence Cache
Oracle and Sybase, Concepts and Contrasts, ISBN: 978-0-9563693-1-4, volume one out shortly

NOTE: The information in this email is proprietary and confidential. This message is for the designated recipient only, if you are not the intended recipient, you should destroy it immediately. Any information in this message shall not be understood as given or endorsed by Peridale Ltd, its subsidiaries or their employees, unless expressly so stated. It is the responsibility of the recipient to ensure that this email is virus free, therefore neither Peridale Ltd, its subsidiaries nor their employees accept any responsibility.

RE: Orc file and Hive Optimiser

Posted by Mich Talebzadeh <mi...@peridale.co.uk>.

Thanks John,

 

I have already registered my interest on development work for Hive. So
hopefully I may be able to contribute at some level.

 

Regards,

 

 

Mich Talebzadeh

 

http://talebzadehmich.wordpress.com

 

Author of the books "A Practitioner's Guide to Upgrading to Sybase ASE 15",
ISBN 978-0-9563693-0-7. 

co-author "Sybase Transact SQL Guidelines Best Practices", ISBN
978-0-9759693-0-4

Publications due shortly:

Creating in-memory Data Grid for Trading Systems with Oracle TimesTen and
Coherence Cache

Oracle and Sybase, Concepts and Contrasts, ISBN: 978-0-9563693-1-4, volume
one out shortly

 

NOTE: The information in this email is proprietary and confidential. This
message is for the designated recipient only, if you are not the intended
recipient, you should destroy it immediately. Any information in this
message shall not be understood as given or endorsed by Peridale Ltd, its
subsidiaries or their employees, unless expressly so stated. It is the
responsibility of the recipient to ensure that this email is virus free,
therefore neither Peridale Ltd, its subsidiaries nor their employees accept
any responsibility.

 

From: John Pullokkaran [mailto:jpullokkaran@hortonworks.com] 
Sent: 19 April 2015 20:37
To: user@hive.apache.org
Subject: Re: Orc file and Hive Optimiser

 

ORC format is transparent to CBO.

Currently we are working on a new cost model which might reflect ORC's
performance advantages in optimization decisions.

 

Thanks

John

 

From: Mich Talebzadeh <mi...@peridale.co.uk>
Reply-To: "user@hive.apache.org" <us...@hive.apache.org>
Date: Sunday, April 19, 2015 at 12:32 PM
To: "user@hive.apache.org" <us...@hive.apache.org>
Subject: Orc file and Hive Optimiser

 

My understanding is that the Optimized Row Columnar (ORC) file format
provides a highly efficient way to store Hive data. 

 

https://cwiki.apache.org/confluence/display/Hive/LanguageManual+ORC

 

 

In a nutshell the columnar storage allows pretty efficient compression of
columns on par with what Data Warehouses databases  like Sybase IQ provide.
In short if a normal Hive table is "Row based implementation of relational
model", then ORC is the equivalent for "Columnar based implementation of
relational model"

 

I find ORC file format pretty interesting as it provides a more efficient
performance compared to other Hive file formats Trying testing it). MY only
question is whether the Cost Based Optimiser (CBO) of Hive is aware of ORC
storage format and it treats the table accordingly?

 

Finally this is more of a speculative question. If we have ORC files that
provide good functionality, is there any reason why one should deploy a
columnar database such as Hbase or Cassandra If Hive can do the job as well?

 

Thanks,

 

 

Mich Talebzadeh

 

http://talebzadehmich.wordpress.com

 

Author of the books "A Practitioner's Guide to Upgrading to Sybase ASE 15",
ISBN 978-0-9563693-0-7. 

co-author "Sybase Transact SQL Guidelines Best Practices", ISBN
978-0-9759693-0-4

Publications due shortly:

Creating in-memory Data Grid for Trading Systems with Oracle TimesTen and
Coherence Cache

Oracle and Sybase, Concepts and Contrasts, ISBN: 978-0-9563693-1-4, volume
one out shortly

 

NOTE: The information in this email is proprietary and confidential. This
message is for the designated recipient only, if you are not the intended
recipient, you should destroy it immediately. Any information in this
message shall not be understood as given or endorsed by Peridale Ltd, its
subsidiaries or their employees, unless expressly so stated. It is the
responsibility of the recipient to ensure that this email is virus free,
therefore neither Peridale Ltd, its subsidiaries nor their employees accept
any responsibility.

Re: Orc file and Hive Optimiser

Posted by John Pullokkaran <jp...@hortonworks.com>.

ORC format is transparent to CBO.
Currently we are working on a new cost model which might reflect ORC's performance advantages in optimization decisions.

Thanks
John

From: Mich Talebzadeh <mi...@peridale.co.uk>>
Reply-To: "user@hive.apache.org<ma...@hive.apache.org>" <us...@hive.apache.org>>
Date: Sunday, April 19, 2015 at 12:32 PM
To: "user@hive.apache.org<ma...@hive.apache.org>" <us...@hive.apache.org>>
Subject: Orc file and Hive Optimiser

My understanding is that the Optimized Row Columnar (ORC) file format provides a highly efficient way to store Hive data.

https://cwiki.apache.org/confluence/display/Hive/LanguageManual+ORC

In a nutshell the columnar storage allows pretty efficient compression of columns on par with what Data Warehouses databases  like Sybase IQ provide. In short if a normal Hive table is "Row based implementation of relational model", then ORC is the equivalent for "Columnar based implementation of relational model"

I find ORC file format pretty interesting as it provides a more efficient performance compared to other Hive file formats Trying testing it). MY only question is whether the Cost Based Optimiser (CBO) of Hive is aware of ORC storage format and it treats the table accordingly?

Finally this is more of a speculative question. If we have ORC files that provide good functionality, is there any reason why one should deploy a columnar database such as Hbase or Cassandra If Hive can do the job as well?

Thanks,

Mich Talebzadeh

http://talebzadehmich.wordpress.com

Author of the books "A Practitioner's Guide to Upgrading to Sybase ASE 15", ISBN 978-0-9563693-0-7.
co-author "Sybase Transact SQL Guidelines Best Practices", ISBN 978-0-9759693-0-4
Publications due shortly:
Creating in-memory Data Grid for Trading Systems with Oracle TimesTen and Coherence Cache
Oracle and Sybase, Concepts and Contrasts, ISBN: 978-0-9563693-1-4, volume one out shortly

NOTE: The information in this email is proprietary and confidential. This message is for the designated recipient only, if you are not the intended recipient, you should destroy it immediately. Any information in this message shall not be understood as given or endorsed by Peridale Ltd, its subsidiaries or their employees, unless expressly so stated. It is the responsibility of the recipient to ensure that this email is virus free, therefore neither Peridale Ltd, its subsidiaries nor their employees accept any responsibility.