You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@hive.apache.org by Ke...@wellsfargo.com on 2009/11/25 00:24:10 UTC
Modifying data before importation into Hive
Hello,
I'm using Cloudera's hive-0.4.0+14.tar.gz with hadoop-0.20.1+152.tar.gz on a Centos machine.
I've been able to load syslog files into Hive using the RegexSerDe class - this works great. But what if your log files are missing a column, or the data needs to be manipulated in some way before being put in the table? In our case, we'd like to add a YEAR column as it's not included in the log files. We'd like to avoid having to rewrite all the logs to put them in that format though.
One suggestion from Ashish to a user was to do something like a left outer join with data staged in another table and to target the results into a table with the desired structure. But the lines of our log file don't have a unique key we could use to do such a join - just things like host, day, month, etc.
Is there any other way to add information in conjunction with doing LOAD DATA INPATH, given that we can't add data after it's in the table?
Thanks
Ken
RE: Modifying data before importation into Hive
Posted by Ashish Thusoo <at...@facebook.com>.
There is a way.
Create an external table on that data set and then run a query
INSERT ... yourtable
SELECT ... FROM external_table
The idea being that you put all the transformation logic in the select list of the query.
Ashish
________________________________
From: Ken.Barclay@wellsfargo.com [mailto:Ken.Barclay@wellsfargo.com]
Sent: Tuesday, November 24, 2009 4:20 PM
To: hive-user@hadoop.apache.org
Subject: RE: Modifying data before importation into Hive
Thanks Ashish.
In general though if there's a need to say transform or add to data at LOAD time before putting in a Hive table, as yet there's no way to do it, right?
From: Ashish Thusoo [mailto:athusoo@facebook.com]
Sent: Tuesday, November 24, 2009 4:13 PM
To: hive-user@hadoop.apache.org
Subject: RE: Modifying data before importation into Hive
For this particular case, the LOAD DATA approach is better as that really does not do any transformation to the data and yet adds the information that you want as a partitioning column, So I would recommend that and not the SELECT TRANSFORM approach.
Regarding the SELECT TRANSFORM approach, it would lead to a full scan and transform of the data and keep redundant values in the data itself so probably an overkill for your particular case.
Ashish
________________________________
From: Ken.Barclay@wellsfargo.com [mailto:Ken.Barclay@wellsfargo.com]
Sent: Tuesday, November 24, 2009 4:08 PM
To: hive-user@hadoop.apache.org
Subject: RE: Modifying data before importation into Hive
Ashish,
We just know we want to add it in because it doesn't exist. It will have the static value 2009 for this year's files, and starting next year we'll put in the value 2010.
I was thinking I could write a Python script that will simply load the files one at a time, using the date as a partition, like in the README:
LOAD DATA LOCAL INPATH './examples/files/kv2.txt' OVERWRITE INTO TABLE invites PARTITION (ds='2008-08-15');
Ultimately we want to do Hive queries to search for certain patterns that occur in the logs lines between certain dates, so we could include the partition in the search to do that.
Would using SELECT TRANSFORM with a python script be another approach?
-Ken
From: Ashish Thusoo [mailto:athusoo@facebook.com]
Sent: Tuesday, November 24, 2009 3:56 PM
To: hive-user@hadoop.apache.org
Subject: RE: Modifying data before importation into Hive
How are you generating the value in the YEAR column? Is it a static value or something that gets computed from the data?
Ashish
________________________________
From: Ken.Barclay@wellsfargo.com [mailto:Ken.Barclay@wellsfargo.com]
Sent: Tuesday, November 24, 2009 3:24 PM
To: hive-user@hadoop.apache.org
Subject: Modifying data before importation into Hive
Hello,
I'm using Cloudera's hive-0.4.0+14.tar.gz with hadoop-0.20.1+152.tar.gz on a Centos machine.
I've been able to load syslog files into Hive using the RegexSerDe class - this works great. But what if your log files are missing a column, or the data needs to be manipulated in some way before being put in the table? In our case, we'd like to add a YEAR column as it's not included in the log files. We'd like to avoid having to rewrite all the logs to put them in that format though.
One suggestion from Ashish to a user was to do something like a left outer join with data staged in another table and to target the results into a table with the desired structure. But the lines of our log file don't have a unique key we could use to do such a join - just things like host, day, month, etc.
Is there any other way to add information in conjunction with doing LOAD DATA INPATH, given that we can't add data after it's in the table?
Thanks
Ken
Re: Modifying data before importation into Hive
Posted by Carl Steinbach <ca...@cloudera.com>.
> In general though if there’s a need to say transform or add to data at LOAD
> time before putting in a Hive table, as yet there’s no way to do it, right?
>
One option is to create a temporary EXTERNAL table using your input data,
and then insert this data into a permanent table using a select/transform
statement. I think this is largely equivalent to doing a TRANSFORM at LOAD
time, the main difference being that the external table's data source is not
deleted when the table is created or dropped.
Carl
RE: Modifying data before importation into Hive
Posted by Ke...@wellsfargo.com.
Thanks Ashish.
In general though if there's a need to say transform or add to data at LOAD time before putting in a Hive table, as yet there's no way to do it, right?
From: Ashish Thusoo [mailto:athusoo@facebook.com]
Sent: Tuesday, November 24, 2009 4:13 PM
To: hive-user@hadoop.apache.org
Subject: RE: Modifying data before importation into Hive
For this particular case, the LOAD DATA approach is better as that really does not do any transformation to the data and yet adds the information that you want as a partitioning column, So I would recommend that and not the SELECT TRANSFORM approach.
Regarding the SELECT TRANSFORM approach, it would lead to a full scan and transform of the data and keep redundant values in the data itself so probably an overkill for your particular case.
Ashish
________________________________
From: Ken.Barclay@wellsfargo.com [mailto:Ken.Barclay@wellsfargo.com]
Sent: Tuesday, November 24, 2009 4:08 PM
To: hive-user@hadoop.apache.org
Subject: RE: Modifying data before importation into Hive
Ashish,
We just know we want to add it in because it doesn't exist. It will have the static value 2009 for this year's files, and starting next year we'll put in the value 2010.
I was thinking I could write a Python script that will simply load the files one at a time, using the date as a partition, like in the README:
LOAD DATA LOCAL INPATH './examples/files/kv2.txt' OVERWRITE INTO TABLE invites PARTITION (ds='2008-08-15');
Ultimately we want to do Hive queries to search for certain patterns that occur in the logs lines between certain dates, so we could include the partition in the search to do that.
Would using SELECT TRANSFORM with a python script be another approach?
-Ken
From: Ashish Thusoo [mailto:athusoo@facebook.com]
Sent: Tuesday, November 24, 2009 3:56 PM
To: hive-user@hadoop.apache.org
Subject: RE: Modifying data before importation into Hive
How are you generating the value in the YEAR column? Is it a static value or something that gets computed from the data?
Ashish
________________________________
From: Ken.Barclay@wellsfargo.com [mailto:Ken.Barclay@wellsfargo.com]
Sent: Tuesday, November 24, 2009 3:24 PM
To: hive-user@hadoop.apache.org
Subject: Modifying data before importation into Hive
Hello,
I'm using Cloudera's hive-0.4.0+14.tar.gz with hadoop-0.20.1+152.tar.gz on a Centos machine.
I've been able to load syslog files into Hive using the RegexSerDe class - this works great. But what if your log files are missing a column, or the data needs to be manipulated in some way before being put in the table? In our case, we'd like to add a YEAR column as it's not included in the log files. We'd like to avoid having to rewrite all the logs to put them in that format though.
One suggestion from Ashish to a user was to do something like a left outer join with data staged in another table and to target the results into a table with the desired structure. But the lines of our log file don't have a unique key we could use to do such a join - just things like host, day, month, etc.
Is there any other way to add information in conjunction with doing LOAD DATA INPATH, given that we can't add data after it's in the table?
Thanks
Ken
RE: Modifying data before importation into Hive
Posted by Ashish Thusoo <at...@facebook.com>.
For this particular case, the LOAD DATA approach is better as that really does not do any transformation to the data and yet adds the information that you want as a partitioning column, So I would recommend that and not the SELECT TRANSFORM approach.
Regarding the SELECT TRANSFORM approach, it would lead to a full scan and transform of the data and keep redundant values in the data itself so probably an overkill for your particular case.
Ashish
________________________________
From: Ken.Barclay@wellsfargo.com [mailto:Ken.Barclay@wellsfargo.com]
Sent: Tuesday, November 24, 2009 4:08 PM
To: hive-user@hadoop.apache.org
Subject: RE: Modifying data before importation into Hive
Ashish,
We just know we want to add it in because it doesn't exist. It will have the static value 2009 for this year's files, and starting next year we'll put in the value 2010.
I was thinking I could write a Python script that will simply load the files one at a time, using the date as a partition, like in the README:
LOAD DATA LOCAL INPATH './examples/files/kv2.txt' OVERWRITE INTO TABLE invites PARTITION (ds='2008-08-15');
Ultimately we want to do Hive queries to search for certain patterns that occur in the logs lines between certain dates, so we could include the partition in the search to do that.
Would using SELECT TRANSFORM with a python script be another approach?
-Ken
From: Ashish Thusoo [mailto:athusoo@facebook.com]
Sent: Tuesday, November 24, 2009 3:56 PM
To: hive-user@hadoop.apache.org
Subject: RE: Modifying data before importation into Hive
How are you generating the value in the YEAR column? Is it a static value or something that gets computed from the data?
Ashish
________________________________
From: Ken.Barclay@wellsfargo.com [mailto:Ken.Barclay@wellsfargo.com]
Sent: Tuesday, November 24, 2009 3:24 PM
To: hive-user@hadoop.apache.org
Subject: Modifying data before importation into Hive
Hello,
I'm using Cloudera's hive-0.4.0+14.tar.gz with hadoop-0.20.1+152.tar.gz on a Centos machine.
I've been able to load syslog files into Hive using the RegexSerDe class - this works great. But what if your log files are missing a column, or the data needs to be manipulated in some way before being put in the table? In our case, we'd like to add a YEAR column as it's not included in the log files. We'd like to avoid having to rewrite all the logs to put them in that format though.
One suggestion from Ashish to a user was to do something like a left outer join with data staged in another table and to target the results into a table with the desired structure. But the lines of our log file don't have a unique key we could use to do such a join - just things like host, day, month, etc.
Is there any other way to add information in conjunction with doing LOAD DATA INPATH, given that we can't add data after it's in the table?
Thanks
Ken
RE: Modifying data before importation into Hive
Posted by Ke...@wellsfargo.com.
Ashish,
We just know we want to add it in because it doesn't exist. It will have the static value 2009 for this year's files, and starting next year we'll put in the value 2010.
I was thinking I could write a Python script that will simply load the files one at a time, using the date as a partition, like in the README:
LOAD DATA LOCAL INPATH './examples/files/kv2.txt' OVERWRITE INTO TABLE invites PARTITION (ds='2008-08-15');
Ultimately we want to do Hive queries to search for certain patterns that occur in the logs lines between certain dates, so we could include the partition in the search to do that.
Would using SELECT TRANSFORM with a python script be another approach?
-Ken
From: Ashish Thusoo [mailto:athusoo@facebook.com]
Sent: Tuesday, November 24, 2009 3:56 PM
To: hive-user@hadoop.apache.org
Subject: RE: Modifying data before importation into Hive
How are you generating the value in the YEAR column? Is it a static value or something that gets computed from the data?
Ashish
________________________________
From: Ken.Barclay@wellsfargo.com [mailto:Ken.Barclay@wellsfargo.com]
Sent: Tuesday, November 24, 2009 3:24 PM
To: hive-user@hadoop.apache.org
Subject: Modifying data before importation into Hive
Hello,
I'm using Cloudera's hive-0.4.0+14.tar.gz with hadoop-0.20.1+152.tar.gz on a Centos machine.
I've been able to load syslog files into Hive using the RegexSerDe class - this works great. But what if your log files are missing a column, or the data needs to be manipulated in some way before being put in the table? In our case, we'd like to add a YEAR column as it's not included in the log files. We'd like to avoid having to rewrite all the logs to put them in that format though.
One suggestion from Ashish to a user was to do something like a left outer join with data staged in another table and to target the results into a table with the desired structure. But the lines of our log file don't have a unique key we could use to do such a join - just things like host, day, month, etc.
Is there any other way to add information in conjunction with doing LOAD DATA INPATH, given that we can't add data after it's in the table?
Thanks
Ken
RE: Modifying data before importation into Hive
Posted by Ashish Thusoo <at...@facebook.com>.
How are you generating the value in the YEAR column? Is it a static value or something that gets computed from the data?
Ashish
________________________________
From: Ken.Barclay@wellsfargo.com [mailto:Ken.Barclay@wellsfargo.com]
Sent: Tuesday, November 24, 2009 3:24 PM
To: hive-user@hadoop.apache.org
Subject: Modifying data before importation into Hive
Hello,
I'm using Cloudera's hive-0.4.0+14.tar.gz with hadoop-0.20.1+152.tar.gz on a Centos machine.
I've been able to load syslog files into Hive using the RegexSerDe class - this works great. But what if your log files are missing a column, or the data needs to be manipulated in some way before being put in the table? In our case, we'd like to add a YEAR column as it's not included in the log files. We'd like to avoid having to rewrite all the logs to put them in that format though.
One suggestion from Ashish to a user was to do something like a left outer join with data staged in another table and to target the results into a table with the desired structure. But the lines of our log file don't have a unique key we could use to do such a join - just things like host, day, month, etc.
Is there any other way to add information in conjunction with doing LOAD DATA INPATH, given that we can't add data after it's in the table?
Thanks
Ken