You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@hive.apache.org by Marc Canaleta <mc...@gmail.com> on 2012/11/30 09:59:51 UTC
Storing data in TSV with changing headers
Hi all!
We want to use hive to analyze our logs. Our logs will be TSV files, one
per hour, and as it is possible that we add/remove more columns in the
future, we will include headers (column names) in each file.
So it is possible that two TSV files for different days/hours have
different headers.
Is it possible to do this with Hive?
Thanks!
Re: Storing data in TSV with changing headers
Posted by Mark Grover <gr...@gmail.com>.
Hi Marc,
While what Dean said is true for different schemas in general, there is a
way to do it all in the same table if the schema changes to the TSV file
are just additions of new tab-separated columns at the very end of each row
and no existing columns are being deleted.
Let's say your TSV file looked like
a_val1 b_val1 c_val1
a_val2 b_val2 c_val2
Your table definition could look like:
CREATE TABLE t1(a STRING, b STRING, c STRING) ROW FORMAT DELIMITED FIELDS
TERMINATED BY '\t';
Now say you insert more data into this table (externally through HDFS
commands) which looks like
a_val3 b_val3 c_val3 d_val3
a_val4 b_val4 c_val4 d_val4
Now, if you haven't changed the table definition just yet, it still says 3
columns, so a select * from t1 would still give you 3 columns. In order
words, the newly added 4th column doesn't get read and doesn't have any
adverse effect on the table if the table definition is still old.
a_val1 b_val1 c_val1
a_val2 b_val2 c_val2
a_val3 b_val3 c_val3
a_val4 b_val4 c_val4
Now let's go ahead and change the table definition. You could use one of
the ALTER TABLE commands for this. I personally like to create external
tables, drop the tables and recreate them on top of the same data.
Your new table create statement may look something like:
CREATE TABLE t1(a STRING, b STRING, c STRING,d STRING) ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\t';
A select * from t1 in this case would reveal:
a_val1 b_val1 c_val1 NULL
a_val2 b_val2 c_val2 NULL
a_val3 b_val3 c_val3 d_val3
a_val4 b_val4 c_val4 d_val4
New columns show up correctly. Values from old data for which those columns
didn't exist show up as NULLs.
So, to summarize, a single table may work, if the following 3 conditions
are met.
1. New columns get added to the very end of the existing columns
2. No existing columns are deleted
3. You are ok with NULL showing up as value for old records that don't have
a given new column.
I tested the above with CSV files, TSV should be the same as well.
Mark
On Fri, Nov 30, 2012 at 5:40 AM, Dean Wampler <
dean.wampler@thinkbiganalytics.com> wrote:
> You'll have to define separate tables for the different schemas. You can
> "unify" them in a query with the union feature. You should also remove the
> header lines in the files, if you still have them, because Hive does not
> ignore them, but treats them as "data".
>
> dean
>
>
> On Fri, Nov 30, 2012 at 2:59 AM, Marc Canaleta <mc...@gmail.com>wrote:
>
>> Hi all!
>>
>> We want to use hive to analyze our logs. Our logs will be TSV files, one
>> per hour, and as it is possible that we add/remove more columns in the
>> future, we will include headers (column names) in each file.
>>
>> So it is possible that two TSV files for different days/hours have
>> different headers.
>>
>> Is it possible to do this with Hive?
>>
>> Thanks!
>>
>
>
>
> --
> *Dean Wampler, Ph.D.*
> thinkbiganalytics.com
> +1-312-339-1330
>
>
>
Re: Storing data in TSV with changing headers
Posted by Dean Wampler <de...@thinkbiganalytics.com>.
You'll have to define separate tables for the different schemas. You can
"unify" them in a query with the union feature. You should also remove the
header lines in the files, if you still have them, because Hive does not
ignore them, but treats them as "data".
dean
On Fri, Nov 30, 2012 at 2:59 AM, Marc Canaleta <mc...@gmail.com> wrote:
> Hi all!
>
> We want to use hive to analyze our logs. Our logs will be TSV files, one
> per hour, and as it is possible that we add/remove more columns in the
> future, we will include headers (column names) in each file.
>
> So it is possible that two TSV files for different days/hours have
> different headers.
>
> Is it possible to do this with Hive?
>
> Thanks!
>
--
*Dean Wampler, Ph.D.*
thinkbiganalytics.com
+1-312-339-1330