You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-user@hadoop.apache.org by Raj Hadoop <ha...@yahoo.com> on 2014/02/10 03:45:29 UTC

Add few record(s) to a Hive table or a HDFS file on a daily basis



Hi,

My requirement is a typical Datawarehouse and ETL requirement. I need to accomplish

1) Daily Insert transaction records to a Hive table or a HDFS file. This table or file is not a big table ( approximately 10 records per day). I don't want to Partition the table / file.


I am reading a few articles on this. It was being mentioned that we need to load to a staging table in Hive. And then insert like the below :

insertoverwrite tablefinaltable select*fromstaging;


I am not getting this logic. How should I populate the staging table daily.

Thanks,
Raj

Re: Add few record(s) to a Hive table or a HDFS file on a daily basis

Posted by Mirko Kämpf <mi...@gmail.com>.
Hi Raj,

there is no way of adding new data to a file in HDFS as long as the append
functionality is not available. Adding new "records" to a Hive table means,
creating a new file with those records. You do this in the "staging" table
which might be inefficient for large data sets especially if you run MR
jobs on it. After two years you will see more than 700 files.
To have all records in one file, you run an aggregation procedure with the
select command you mentioned. Select (*) reads all small files and
depending on the number of reducers running (should be only one in this
case) only one file will contain all records for the "finaltable". The same
could be done with a MR job which has the identity mapper and the identiy
reducer and numberRedurcers = 1.
Populating the staging table means just add the new file with the new
records each day to the HDFS-folder, which contains
the table data.

Best wishes
Mirko



2014-02-10 3:45 GMT+01:00 Raj Hadoop <ha...@yahoo.com>:

>
>
> Hi,
>
> My requirement is a typical Datawarehouse and ETL requirement. I need to
> accomplish
>
> 1) Daily Insert transaction records to a Hive table or a HDFS file. This
> table or file is not a big table ( approximately 10 records per day). I
> don't want to Partition the table / file.
>
>
> I am reading a few articles on this. It was being mentioned that we need
> to load to a staging table in Hive. And then insert like the below :
>
>  insert overwrite table finaltable select * from staging;
>
>  I am not getting this logic. How should I populate the staging table
> daily.
>
>  Thanks,
>  Raj
>
>
>

Re: Add few record(s) to a Hive table or a HDFS file on a daily basis

Posted by pa...@gmail.com.
Why not INSERT INTO for appending the new data?


a)Load the new data into staging table

b)INSERT INTO final table.






Sent from Windows Mail





From: Raj Hadoop
Sent: ‎Monday‎, ‎10‎ ‎February‎ ‎2014 ‎08‎:‎15
To: user, User












Hi,




My requirement is a typical Datawarehouse and ETL requirement. I need to accomplish




1) Daily Insert transaction records to a Hive table or a HDFS file. This table or file is not a big table ( approximately 10 records per day). I don't want to Partition the table / file.







I am reading a few articles on this. It was being mentioned that we need to load to a staging table in Hive. And then insert like the below :




insert overwrite table finaltable select * from staging;





I am not getting this logic. How should I populate the staging table daily.




Thanks,

Raj

Re: Add few record(s) to a Hive table or a HDFS file on a daily basis

Posted by Mirko Kämpf <mi...@gmail.com>.
Hi Raj,

there is no way of adding new data to a file in HDFS as long as the append
functionality is not available. Adding new "records" to a Hive table means,
creating a new file with those records. You do this in the "staging" table
which might be inefficient for large data sets especially if you run MR
jobs on it. After two years you will see more than 700 files.
To have all records in one file, you run an aggregation procedure with the
select command you mentioned. Select (*) reads all small files and
depending on the number of reducers running (should be only one in this
case) only one file will contain all records for the "finaltable". The same
could be done with a MR job which has the identity mapper and the identiy
reducer and numberRedurcers = 1.
Populating the staging table means just add the new file with the new
records each day to the HDFS-folder, which contains
the table data.

Best wishes
Mirko



2014-02-10 3:45 GMT+01:00 Raj Hadoop <ha...@yahoo.com>:

>
>
> Hi,
>
> My requirement is a typical Datawarehouse and ETL requirement. I need to
> accomplish
>
> 1) Daily Insert transaction records to a Hive table or a HDFS file. This
> table or file is not a big table ( approximately 10 records per day). I
> don't want to Partition the table / file.
>
>
> I am reading a few articles on this. It was being mentioned that we need
> to load to a staging table in Hive. And then insert like the below :
>
>  insert overwrite table finaltable select * from staging;
>
>  I am not getting this logic. How should I populate the staging table
> daily.
>
>  Thanks,
>  Raj
>
>
>

Re: Add few record(s) to a Hive table or a HDFS file on a daily basis

Posted by Mirko Kämpf <mi...@gmail.com>.
Hi Raj,

there is no way of adding new data to a file in HDFS as long as the append
functionality is not available. Adding new "records" to a Hive table means,
creating a new file with those records. You do this in the "staging" table
which might be inefficient for large data sets especially if you run MR
jobs on it. After two years you will see more than 700 files.
To have all records in one file, you run an aggregation procedure with the
select command you mentioned. Select (*) reads all small files and
depending on the number of reducers running (should be only one in this
case) only one file will contain all records for the "finaltable". The same
could be done with a MR job which has the identity mapper and the identiy
reducer and numberRedurcers = 1.
Populating the staging table means just add the new file with the new
records each day to the HDFS-folder, which contains
the table data.

Best wishes
Mirko



2014-02-10 3:45 GMT+01:00 Raj Hadoop <ha...@yahoo.com>:

>
>
> Hi,
>
> My requirement is a typical Datawarehouse and ETL requirement. I need to
> accomplish
>
> 1) Daily Insert transaction records to a Hive table or a HDFS file. This
> table or file is not a big table ( approximately 10 records per day). I
> don't want to Partition the table / file.
>
>
> I am reading a few articles on this. It was being mentioned that we need
> to load to a staging table in Hive. And then insert like the below :
>
>  insert overwrite table finaltable select * from staging;
>
>  I am not getting this logic. How should I populate the staging table
> daily.
>
>  Thanks,
>  Raj
>
>
>

Re: Add few record(s) to a Hive table or a HDFS file on a daily basis

Posted by pa...@gmail.com.
Why not INSERT INTO for appending the new data?


a)Load the new data into staging table

b)INSERT INTO final table.






Sent from Windows Mail





From: Raj Hadoop
Sent: ‎Monday‎, ‎10‎ ‎February‎ ‎2014 ‎08‎:‎15
To: user, User












Hi,




My requirement is a typical Datawarehouse and ETL requirement. I need to accomplish




1) Daily Insert transaction records to a Hive table or a HDFS file. This table or file is not a big table ( approximately 10 records per day). I don't want to Partition the table / file.







I am reading a few articles on this. It was being mentioned that we need to load to a staging table in Hive. And then insert like the below :




insert overwrite table finaltable select * from staging;





I am not getting this logic. How should I populate the staging table daily.




Thanks,

Raj

Re: Add few record(s) to a Hive table or a HDFS file on a daily basis

Posted by Peyman Mohajerian <mo...@gmail.com>.
The staging table is typically defined as external hive table, data is
loaded directly on HDFS and staging table therefore is able to read that
data directly from HDFS and the transfer it to Hive managed tables, your
current statement. Of course there are variations to this as well.


On Sun, Feb 9, 2014 at 6:45 PM, Raj Hadoop <ha...@yahoo.com> wrote:

>
>
> Hi,
>
> My requirement is a typical Datawarehouse and ETL requirement. I need to
> accomplish
>
> 1) Daily Insert transaction records to a Hive table or a HDFS file. This
> table or file is not a big table ( approximately 10 records per day). I
> don't want to Partition the table / file.
>
>
> I am reading a few articles on this. It was being mentioned that we need
> to load to a staging table in Hive. And then insert like the below :
>
> insert overwrite table finaltable select * from staging;
>
> I am not getting this logic. How should I populate the staging table daily.
>
> Thanks,
> Raj
>
>
>

Re: Add few record(s) to a Hive table or a HDFS file on a daily basis

Posted by pandees waran <pa...@gmail.com>.
Why not INSERT INTO for appending new records?

a)load the new records into a staging table
b)INSERT INTO final table from the staging table
On 10-Feb-2014 8:16 am, "Raj Hadoop" <ha...@yahoo.com> wrote:

>
>
> Hi,
>
> My requirement is a typical Datawarehouse and ETL requirement. I need to
> accomplish
>
> 1) Daily Insert transaction records to a Hive table or a HDFS file. This
> table or file is not a big table ( approximately 10 records per day). I
> don't want to Partition the table / file.
>
>
> I am reading a few articles on this. It was being mentioned that we need
> to load to a staging table in Hive. And then insert like the below :
>
> insert overwrite table finaltable select * from staging;
>
> I am not getting this logic. How should I populate the staging table daily.
>
> Thanks,
> Raj
>
>
>

Re: Add few record(s) to a Hive table or a HDFS file on a daily basis

Posted by Peyman Mohajerian <mo...@gmail.com>.
The staging table is typically defined as external hive table, data is
loaded directly on HDFS and staging table therefore is able to read that
data directly from HDFS and the transfer it to Hive managed tables, your
current statement. Of course there are variations to this as well.


On Sun, Feb 9, 2014 at 6:45 PM, Raj Hadoop <ha...@yahoo.com> wrote:

>
>
> Hi,
>
> My requirement is a typical Datawarehouse and ETL requirement. I need to
> accomplish
>
> 1) Daily Insert transaction records to a Hive table or a HDFS file. This
> table or file is not a big table ( approximately 10 records per day). I
> don't want to Partition the table / file.
>
>
> I am reading a few articles on this. It was being mentioned that we need
> to load to a staging table in Hive. And then insert like the below :
>
> insert overwrite table finaltable select * from staging;
>
> I am not getting this logic. How should I populate the staging table daily.
>
> Thanks,
> Raj
>
>
>

Re: Add few record(s) to a Hive table or a HDFS file on a daily basis

Posted by pa...@gmail.com.
Why not INSERT INTO for appending the new data?


a)Load the new data into staging table

b)INSERT INTO final table.






Sent from Windows Mail





From: Raj Hadoop
Sent: ‎Monday‎, ‎10‎ ‎February‎ ‎2014 ‎08‎:‎15
To: user, User












Hi,




My requirement is a typical Datawarehouse and ETL requirement. I need to accomplish




1) Daily Insert transaction records to a Hive table or a HDFS file. This table or file is not a big table ( approximately 10 records per day). I don't want to Partition the table / file.







I am reading a few articles on this. It was being mentioned that we need to load to a staging table in Hive. And then insert like the below :




insert overwrite table finaltable select * from staging;





I am not getting this logic. How should I populate the staging table daily.




Thanks,

Raj

Re: Add few record(s) to a Hive table or a HDFS file on a daily basis

Posted by pa...@gmail.com.
Why not INSERT INTO for appending the new data?


a)Load the new data into staging table

b)INSERT INTO final table.






Sent from Windows Mail





From: Raj Hadoop
Sent: ‎Monday‎, ‎10‎ ‎February‎ ‎2014 ‎08‎:‎15
To: user, User












Hi,




My requirement is a typical Datawarehouse and ETL requirement. I need to accomplish




1) Daily Insert transaction records to a Hive table or a HDFS file. This table or file is not a big table ( approximately 10 records per day). I don't want to Partition the table / file.







I am reading a few articles on this. It was being mentioned that we need to load to a staging table in Hive. And then insert like the below :




insert overwrite table finaltable select * from staging;





I am not getting this logic. How should I populate the staging table daily.




Thanks,

Raj

Re: Add few record(s) to a Hive table or a HDFS file on a daily basis

Posted by Mirko Kämpf <mi...@gmail.com>.
Hi Raj,

there is no way of adding new data to a file in HDFS as long as the append
functionality is not available. Adding new "records" to a Hive table means,
creating a new file with those records. You do this in the "staging" table
which might be inefficient for large data sets especially if you run MR
jobs on it. After two years you will see more than 700 files.
To have all records in one file, you run an aggregation procedure with the
select command you mentioned. Select (*) reads all small files and
depending on the number of reducers running (should be only one in this
case) only one file will contain all records for the "finaltable". The same
could be done with a MR job which has the identity mapper and the identiy
reducer and numberRedurcers = 1.
Populating the staging table means just add the new file with the new
records each day to the HDFS-folder, which contains
the table data.

Best wishes
Mirko



2014-02-10 3:45 GMT+01:00 Raj Hadoop <ha...@yahoo.com>:

>
>
> Hi,
>
> My requirement is a typical Datawarehouse and ETL requirement. I need to
> accomplish
>
> 1) Daily Insert transaction records to a Hive table or a HDFS file. This
> table or file is not a big table ( approximately 10 records per day). I
> don't want to Partition the table / file.
>
>
> I am reading a few articles on this. It was being mentioned that we need
> to load to a staging table in Hive. And then insert like the below :
>
>  insert overwrite table finaltable select * from staging;
>
>  I am not getting this logic. How should I populate the staging table
> daily.
>
>  Thanks,
>  Raj
>
>
>

Re: Add few record(s) to a Hive table or a HDFS file on a daily basis

Posted by pa...@gmail.com.
Why not INSERT INTO for appending the new data?


a)Load the new data into staging table

b)INSERT INTO final table.






Sent from Windows Mail





From: Raj Hadoop
Sent: ‎Monday‎, ‎10‎ ‎February‎ ‎2014 ‎08‎:‎15
To: user, User












Hi,




My requirement is a typical Datawarehouse and ETL requirement. I need to accomplish




1) Daily Insert transaction records to a Hive table or a HDFS file. This table or file is not a big table ( approximately 10 records per day). I don't want to Partition the table / file.







I am reading a few articles on this. It was being mentioned that we need to load to a staging table in Hive. And then insert like the below :




insert overwrite table finaltable select * from staging;





I am not getting this logic. How should I populate the staging table daily.




Thanks,

Raj

Re: Add few record(s) to a Hive table or a HDFS file on a daily basis

Posted by Peyman Mohajerian <mo...@gmail.com>.
The staging table is typically defined as external hive table, data is
loaded directly on HDFS and staging table therefore is able to read that
data directly from HDFS and the transfer it to Hive managed tables, your
current statement. Of course there are variations to this as well.


On Sun, Feb 9, 2014 at 6:45 PM, Raj Hadoop <ha...@yahoo.com> wrote:

>
>
> Hi,
>
> My requirement is a typical Datawarehouse and ETL requirement. I need to
> accomplish
>
> 1) Daily Insert transaction records to a Hive table or a HDFS file. This
> table or file is not a big table ( approximately 10 records per day). I
> don't want to Partition the table / file.
>
>
> I am reading a few articles on this. It was being mentioned that we need
> to load to a staging table in Hive. And then insert like the below :
>
> insert overwrite table finaltable select * from staging;
>
> I am not getting this logic. How should I populate the staging table daily.
>
> Thanks,
> Raj
>
>
>

Re: Add few record(s) to a Hive table or a HDFS file on a daily basis

Posted by Peyman Mohajerian <mo...@gmail.com>.
The staging table is typically defined as external hive table, data is
loaded directly on HDFS and staging table therefore is able to read that
data directly from HDFS and the transfer it to Hive managed tables, your
current statement. Of course there are variations to this as well.


On Sun, Feb 9, 2014 at 6:45 PM, Raj Hadoop <ha...@yahoo.com> wrote:

>
>
> Hi,
>
> My requirement is a typical Datawarehouse and ETL requirement. I need to
> accomplish
>
> 1) Daily Insert transaction records to a Hive table or a HDFS file. This
> table or file is not a big table ( approximately 10 records per day). I
> don't want to Partition the table / file.
>
>
> I am reading a few articles on this. It was being mentioned that we need
> to load to a staging table in Hive. And then insert like the below :
>
> insert overwrite table finaltable select * from staging;
>
> I am not getting this logic. How should I populate the staging table daily.
>
> Thanks,
> Raj
>
>
>