You are viewing a plain text version of this content. The canonical link for it is here.

Posted to hdfs-user@hadoop.apache.org by Matthieu Labour <ma...@actionx.com> on 2012/09/27 03:04:07 UTC

Advice on Migrating to hadoop + hive

Hi
I have posted in this user group before and received great help. Thank you!
I am hoping to also get some advice for the following hive/hadoop question:
The way we currently process our log files is the following: we collect log
files. We run a program via cron job that processes/consolidates them and
inserts rows in Postgresql database. Analysts connect to the database,
performs sql queries, generate excel reports. Our logs are growing. The
process of getting the data into the database is getting too slow.
We are thinking leveraging hadoop and my questions are the following.
Should we use hadoop to insert to Postgresql or can we get rid of
Postgresql and rely on Hive only ?
If we use Hive, can we persist the Hive table so we only load the data (run
the hadoop job) one time ?
Can we insert into existing Hive table and add a day of data without the
need to reprocess all previous days files ?
Are there Hive visual tools (Similar to Postgres Maestro) that would make
it easier for the analyst to build/run queries? (Ideally they would need to
work with Amazon EWS)
Thank you for your help
Cheers
Matthieu

Re: Advice on Migrating to hadoop + hive

Posted by Bejoy Ks <be...@gmail.com>.

Hi Matthieu

Adding on to Michael's comments.

Hive is good for batch processing and generating reports over larger dats
sets. It is not meant for point to point queries, if you have much of those
then hive is not the choice.

You can get your daily data processed in hadoop and load them on to hive
tables. Hive has a new feature 'INSERT INTO' for adding data
into exiting tables/partitions. For your case you can create a partitioned
table based on date and load each day's processed data into corresponding
date partitions. With partitions you will have an advantage - if you issue
a query on some date/dates only those partitions will be scanned rather
than the whole table.

Tableau, MicroStragegy, Pentaho etc supports reporting on top of hive
tables.
If you are looking at some static predefined reports, you can do the
aggregation in hive, take the final aggregated results to any rdbms using
Sqoop and connect any reporting tool of your choice to that.

Some urls for reference
https://cwiki.apache.org/Hive/languagemanual-ddl.html#LanguageManualDDL-Partitionedtables
https://cwiki.apache.org/Hive/languagemanual-dml.html#LanguageManualDML-Loadingfilesintotables
https://cwiki.apache.org/Hive/tutorial.html#Tutorial-PartitionBasedQuery


Regards
Bejoy KS

Re: Advice on Migrating to hadoop + hive

Posted by Bejoy Ks <be...@gmail.com>.

Hi Matthieu

Adding on to Michael's comments.

Hive is good for batch processing and generating reports over larger dats
sets. It is not meant for point to point queries, if you have much of those
then hive is not the choice.

You can get your daily data processed in hadoop and load them on to hive
tables. Hive has a new feature 'INSERT INTO' for adding data
into exiting tables/partitions. For your case you can create a partitioned
table based on date and load each day's processed data into corresponding
date partitions. With partitions you will have an advantage - if you issue
a query on some date/dates only those partitions will be scanned rather
than the whole table.

Tableau, MicroStragegy, Pentaho etc supports reporting on top of hive
tables.
If you are looking at some static predefined reports, you can do the
aggregation in hive, take the final aggregated results to any rdbms using
Sqoop and connect any reporting tool of your choice to that.

Some urls for reference
https://cwiki.apache.org/Hive/languagemanual-ddl.html#LanguageManualDDL-Partitionedtables
https://cwiki.apache.org/Hive/languagemanual-dml.html#LanguageManualDML-Loadingfilesintotables
https://cwiki.apache.org/Hive/tutorial.html#Tutorial-PartitionBasedQuery


Regards
Bejoy KS

Re: Advice on Migrating to hadoop + hive

Posted by Bejoy Ks <be...@gmail.com>.

Hi Matthieu

Adding on to Michael's comments.

Hive is good for batch processing and generating reports over larger dats
sets. It is not meant for point to point queries, if you have much of those
then hive is not the choice.

You can get your daily data processed in hadoop and load them on to hive
tables. Hive has a new feature 'INSERT INTO' for adding data
into exiting tables/partitions. For your case you can create a partitioned
table based on date and load each day's processed data into corresponding
date partitions. With partitions you will have an advantage - if you issue
a query on some date/dates only those partitions will be scanned rather
than the whole table.

Tableau, MicroStragegy, Pentaho etc supports reporting on top of hive
tables.
If you are looking at some static predefined reports, you can do the
aggregation in hive, take the final aggregated results to any rdbms using
Sqoop and connect any reporting tool of your choice to that.

Some urls for reference
https://cwiki.apache.org/Hive/languagemanual-ddl.html#LanguageManualDDL-Partitionedtables
https://cwiki.apache.org/Hive/languagemanual-dml.html#LanguageManualDML-Loadingfilesintotables
https://cwiki.apache.org/Hive/tutorial.html#Tutorial-PartitionBasedQuery


Regards
Bejoy KS

Re: Advice on Migrating to hadoop + hive

Posted by Bejoy Ks <be...@gmail.com>.

Hi Matthieu

Adding on to Michael's comments.

Hive is good for batch processing and generating reports over larger dats
sets. It is not meant for point to point queries, if you have much of those
then hive is not the choice.

You can get your daily data processed in hadoop and load them on to hive
tables. Hive has a new feature 'INSERT INTO' for adding data
into exiting tables/partitions. For your case you can create a partitioned
table based on date and load each day's processed data into corresponding
date partitions. With partitions you will have an advantage - if you issue
a query on some date/dates only those partitions will be scanned rather
than the whole table.

Tableau, MicroStragegy, Pentaho etc supports reporting on top of hive
tables.
If you are looking at some static predefined reports, you can do the
aggregation in hive, take the final aggregated results to any rdbms using
Sqoop and connect any reporting tool of your choice to that.

Some urls for reference
https://cwiki.apache.org/Hive/languagemanual-ddl.html#LanguageManualDDL-Partitionedtables
https://cwiki.apache.org/Hive/languagemanual-dml.html#LanguageManualDML-Loadingfilesintotables
https://cwiki.apache.org/Hive/tutorial.html#Tutorial-PartitionBasedQuery


Regards
Bejoy KS

Re: Advice on Migrating to hadoop + hive

Posted by Michael Segel <mi...@hotmail.com>.

You can get rid of Postgres and go with Hive. 
You may want to consider setting up an external table so you just drop your logs in to place. 
(Define once in Hive's metadata store, and then just drop data within the space / partitions) 

Tools? 
Karmasphere and others. 

Sorry for the terse post. This should point you in the right direction. Also check out the new Hive Book which should hit the streets in the next couple of weeks. 

On Sep 26, 2012, at 8:04 PM, Matthieu Labour <ma...@actionx.com> wrote:

> Hi
> I have posted in this user group before and received great help. Thank you! I am hoping to also get some advice for the following hive/hadoop question:
> The way we currently process our log files is the following: we collect log files. We run a program via cron job that processes/consolidates them and inserts rows in Postgresql database. Analysts connect to the database, performs sql queries, generate excel reports. Our logs are growing. The process of getting the data into the database is getting too slow.
> We are thinking leveraging hadoop and my questions are the following. 
> Should we use hadoop to insert to Postgresql or can we get rid of Postgresql and rely on Hive only ?
> If we use Hive, can we persist the Hive table so we only load the data (run the hadoop job) one time ?
> Can we insert into existing Hive table and add a day of data without the need to reprocess all previous days files ?
> Are there Hive visual tools (Similar to Postgres Maestro) that would make it easier for the analyst to build/run queries? (Ideally they would need to work with Amazon EWS)
> Thank you for your help
> Cheers
> Matthieu
>

Re: Advice on Migrating to hadoop + hive

Posted by Michael Segel <mi...@hotmail.com>.

You can get rid of Postgres and go with Hive. 
You may want to consider setting up an external table so you just drop your logs in to place. 
(Define once in Hive's metadata store, and then just drop data within the space / partitions) 

Tools? 
Karmasphere and others. 

Sorry for the terse post. This should point you in the right direction. Also check out the new Hive Book which should hit the streets in the next couple of weeks. 

On Sep 26, 2012, at 8:04 PM, Matthieu Labour <ma...@actionx.com> wrote:

> Hi
> I have posted in this user group before and received great help. Thank you! I am hoping to also get some advice for the following hive/hadoop question:
> The way we currently process our log files is the following: we collect log files. We run a program via cron job that processes/consolidates them and inserts rows in Postgresql database. Analysts connect to the database, performs sql queries, generate excel reports. Our logs are growing. The process of getting the data into the database is getting too slow.
> We are thinking leveraging hadoop and my questions are the following. 
> Should we use hadoop to insert to Postgresql or can we get rid of Postgresql and rely on Hive only ?
> If we use Hive, can we persist the Hive table so we only load the data (run the hadoop job) one time ?
> Can we insert into existing Hive table and add a day of data without the need to reprocess all previous days files ?
> Are there Hive visual tools (Similar to Postgres Maestro) that would make it easier for the analyst to build/run queries? (Ideally they would need to work with Amazon EWS)
> Thank you for your help
> Cheers
> Matthieu
>

Re: Advice on Migrating to hadoop + hive

Posted by Michael Segel <mi...@hotmail.com>.

You can get rid of Postgres and go with Hive. 
You may want to consider setting up an external table so you just drop your logs in to place. 
(Define once in Hive's metadata store, and then just drop data within the space / partitions) 

Tools? 
Karmasphere and others. 

Sorry for the terse post. This should point you in the right direction. Also check out the new Hive Book which should hit the streets in the next couple of weeks. 

On Sep 26, 2012, at 8:04 PM, Matthieu Labour <ma...@actionx.com> wrote:

> Hi
> I have posted in this user group before and received great help. Thank you! I am hoping to also get some advice for the following hive/hadoop question:
> The way we currently process our log files is the following: we collect log files. We run a program via cron job that processes/consolidates them and inserts rows in Postgresql database. Analysts connect to the database, performs sql queries, generate excel reports. Our logs are growing. The process of getting the data into the database is getting too slow.
> We are thinking leveraging hadoop and my questions are the following. 
> Should we use hadoop to insert to Postgresql or can we get rid of Postgresql and rely on Hive only ?
> If we use Hive, can we persist the Hive table so we only load the data (run the hadoop job) one time ?
> Can we insert into existing Hive table and add a day of data without the need to reprocess all previous days files ?
> Are there Hive visual tools (Similar to Postgres Maestro) that would make it easier for the analyst to build/run queries? (Ideally they would need to work with Amazon EWS)
> Thank you for your help
> Cheers
> Matthieu
>

Re: Advice on Migrating to hadoop + hive

Posted by Michael Segel <mi...@hotmail.com>.

You can get rid of Postgres and go with Hive. 
You may want to consider setting up an external table so you just drop your logs in to place. 
(Define once in Hive's metadata store, and then just drop data within the space / partitions) 

Tools? 
Karmasphere and others. 

Sorry for the terse post. This should point you in the right direction. Also check out the new Hive Book which should hit the streets in the next couple of weeks. 

On Sep 26, 2012, at 8:04 PM, Matthieu Labour <ma...@actionx.com> wrote:

> Hi
> I have posted in this user group before and received great help. Thank you! I am hoping to also get some advice for the following hive/hadoop question:
> The way we currently process our log files is the following: we collect log files. We run a program via cron job that processes/consolidates them and inserts rows in Postgresql database. Analysts connect to the database, performs sql queries, generate excel reports. Our logs are growing. The process of getting the data into the database is getting too slow.
> We are thinking leveraging hadoop and my questions are the following. 
> Should we use hadoop to insert to Postgresql or can we get rid of Postgresql and rely on Hive only ?
> If we use Hive, can we persist the Hive table so we only load the data (run the hadoop job) one time ?
> Can we insert into existing Hive table and add a day of data without the need to reprocess all previous days files ?
> Are there Hive visual tools (Similar to Postgres Maestro) that would make it easier for the analyst to build/run queries? (Ideally they would need to work with Amazon EWS)
> Thank you for your help
> Cheers
> Matthieu
>