You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@hive.apache.org by Time Less <ti...@gmail.com> on 2011/08/22 23:34:34 UTC

One Schema Per Partition? (Multiple schemas per table?)

I found a set of slides from Facebook online about Hive that claims you can
have a schema per partition in the table, this is exciting to us, because we
have a table like so:

id     int
name   string
level  int
date   string

And it's broken up into partitions by date. However, on a particular date
last year, the table dramatically changed its schema to:

id       int
level    int
date     string
name_id  int

So now if I do "select * from table" in hive, the data is completely garbled
for whichever portion of data doesn't fit the Hive schema. We are
considering re-writing the datafiles so they're the same before/after that
date, but if Hive supports having two entirely different schemas depending
on the partition, that'd be really convenient, since these datafiles are
hundreds of gigabytes in size (and we do sort of like the idea of knowing
how the datafile looked back then...).

This page:
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL#LanguageManualDDL-AlterTable%2FPartitionStatementsdoesn't
seem to have an appropriate example, so I'm left wondering.

Has anyone done anything like this?

-- 
Tim Ellis
Data Architect, Riot Games

Re: One Schema Per Partition? (Multiple schemas per table?)

Posted by Ashutosh Chauhan <ha...@apache.org>.

How is it broken? What was the result you were expecting?

Ashutosh

On Thu, Oct 6, 2011 at 11:13, Time Less <ti...@gmail.com> wrote:

> I have finally gotten around to testing this functionality, and it would
> doesn't work. The ALTER table change columns command just changes the
> metadata for the table, not for the partition. Follows is exactly what I did
> to test this, and the (broken) result:
>
> hive> create table multischema_test (
>     >   id      int,
>     >   crdt    string,
>     >   name    string,
>     >   age     int
>     > )
>     > partitioned by (dtp string)
>     > row format delimited
>     > fields terminated by '\t'
>     > lines terminated by '\n'
>     > stored as textfile;
> OK
> Time taken: 0.345 seconds
> hive> alter table multischema_test add partition (dtp=20110101) location
> '/user/hive/warehouse/test/multischema_test/20110101';
> OK
> Time taken: 0.662 seconds
> hive> alter table multischema_test replace columns (id int, name string,
> gender string, age int, crdt string) ;
> OK
> Time taken: 0.119 seconds
> hive> alter table multischema_test add partition (dtp=20110102) location
> '/user/hive/warehouse/test/multischema_test/20110102';
> OK
> Time taken: 0.186 seconds
> hive> select * from multischema_test ;
> OK
> 1    2010-07-01    Jeff    32    NULL    20110101
> 2    2010-07-01    Lisa    33    NULL    20110101
> 3    2010-07-01    Bob    22    NULL    20110101
> 4    2010-07-01    Fred    27    NULL    20110101
> 100    Gregory    Male    45    2010-08-01    20110102
> 101    Horus    Male    14    2010-08-01    20110102
> 102    Verdann    Male    33    2010-08-01    20110102
> 103    Gennefer    Female    32    2010-08-01    20110102
> Time taken: 0.348 seconds
> hive> select name,gender from multischema_test ;
> Total MapReduce jobs = 1
> Launching Job 1 out of 1
> Number of reduce tasks is set to 0 since there's no reduce operator
> Starting Job = job_201108291505_28322, Tracking URL =
> http://laxhadoop1-001:50030/jobdetails.jsp?jobid=job_201108291505_28322
> Kill Command = /usr/lib/hadoop/bin/hadoop job
> -Dmapred.job.tracker=laxhadoop1-001:54311 -kill job_201108291505_28322
> 2011-10-06 11:02:27,099 Stage-1 map = 0%,  reduce = 0%
> 2011-10-06 11:02:31,129 Stage-1 map = 100%,  reduce = 0%
> 2011-10-06 11:02:33,144 Stage-1 map = 100%,  reduce = 100%
> Ended Job = job_201108291505_28322
> OK
> Gregory    Male
> Horus    Male
> Verdann    Male
> Gennefer    Female
> 2010-07-01    Jeff
> 2010-07-01    Lisa
> 2010-07-01    Bob
> 2010-07-01    Fred
> Time taken: 8.977 seconds
> hive>
>
> The two text files that make up the data in the table are like this:
>
> [hdfs@laxhadoop1-012 ~/Tim] :) cat multischema-datafile-20110101 | hadoop
> fs -put - /user/hive/warehouse/test/multischema_test/20110101/datafile
> [hdfs@laxhadoop1-012 ~/Tim] :) cat multischema-datafile-20110102 | hadoop
> fs -put - /user/hive/warehouse/test/multischema_test/20110102/datafile
> [hdfs@laxhadoop1-012 ~/Tim] :) cat multischema-datafile-20110101
> 1    2010-07-01    Jeff    32
> 2    2010-07-01    Lisa    33
> 3    2010-07-01    Bob    22
> 4    2010-07-01    Fred    27
> [hdfs@laxhadoop1-012 ~/Tim] :) cat multischema-datafile-20110102
> 100    Gregory    Male    45    2010-08-01
> 101    Horus    Male    14    2010-08-01
> 102    Verdann    Male    33    2010-08-01
> 103    Gennefer    Female    32    2010-08-01
>
> Did I do something wrong?
>
>
>
> On Mon, Aug 29, 2011 at 10:46 PM, Ashutosh Chauhan <ha...@apache.org>wrote:
>
>> Hi Tim,
>>
>> I figured that both reading the code and manual. I don't think
>> its explicitly documented anywhere, so it will be great if you document
>> this. This page looks right place where this place of information can live.
>>   Thanks for the help in making Hive better.
>>
>> Ashutosh
>>
>> On Mon, Aug 29, 2011 at 15:26, Time Less <ti...@gmail.com> wrote:
>>
>>> Hello, Ashutosh,
>>>
>>> I did nothing like that... :)
>>>
>>> It seems the problem here is I didn't RTFM. Perchance, could you say
>>> where you figured this out? I am going from the Hive DDL page on
>>> confluence[1], and although it mentions partitions and it mentions the
>>> "replace columns" you've mentioned here, it doesn't mention them together
>>> that I see. I would like to document this for future generations. Is that
>>> the proper page where I'd document this?
>>>
>>> I would probably explicitly create a section titled "Different Schemas
>>> per Partition" and basically give the syntax you give (from quoted, assuming
>>> after I test it, it works).
>>>
>>>
>>> [1]
>>> https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL#LanguageManualDDL-AlterTable%2FPartitionStatements
>>>
>>>
>>> On Wed, Aug 24, 2011 at 6:14 PM, Ashutosh Chauhan <ha...@apache.org>wrote:
>>>
>>>> Hey Tim,
>>>>
>>>> Hive does support different schema's for different partitions. If your
>>>> data comes out garbled, that seems to be a bug then. In your case, is the
>>>> following sequence of steps resemble what you did:
>>>>
>>>> a) create table tbl (id: int, name: string, level: int) partitioned by
>>>> date;
>>>> b) -- add partitions
>>>> c) alter table tbl replace columns (id: int, level: int, name_id: int)
>>>> d) -- add more partitions.
>>>>
>>>> If you do select * from tbl, then this should work. You need not to
>>>> rewrite any of your data. Can you provide more info about what output you
>>>> were expecting and what you got. Are there any error logs?
>>>>
>>>> Ashutosh
>>>>
>>>>
>>>> On Mon, Aug 22, 2011 at 14:34, Time Less <ti...@gmail.com>wrote:
>>>>
>>>>> I found a set of slides from Facebook online about Hive that claims you
>>>>> can have a schema per partition in the table, this is exciting to us,
>>>>> because we have a table like so:
>>>>>
>>>>> id     int
>>>>> name   string
>>>>> level  int
>>>>> date   string
>>>>>
>>>>> And it's broken up into partitions by date. However, on a particular
>>>>> date last year, the table dramatically changed its schema to:
>>>>>
>>>>> id       int
>>>>> level    int
>>>>> date     string
>>>>> name_id  int
>>>>>
>>>>> So now if I do "select * from table" in hive, the data is completely
>>>>> garbled for whichever portion of data doesn't fit the Hive schema. We are
>>>>> considering re-writing the datafiles so they're the same before/after that
>>>>> date, but if Hive supports having two entirely different schemas depending
>>>>> on the partition, that'd be really convenient, since these datafiles are
>>>>> hundreds of gigabytes in size (and we do sort of like the idea of knowing
>>>>> how the datafile looked back then...).
>>>>>
>>>>> This page:
>>>>> https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL#LanguageManualDDL-AlterTable%2FPartitionStatementsdoesn't seem to have an appropriate example, so I'm left wondering.
>>>>>
>>>>> Has anyone done anything like this?
>>>>>
>>>>> --
>>>>> Tim Ellis
>>>>> Data Architect, Riot Games
>>>>>
>>>>>
>>>>
>>>
>>>
>>> --
>>> Tim
>>>
>>
>>
>
>
> --
> Tim
>

Re: One Schema Per Partition? (Multiple schemas per table?)

Posted by Time Less <ti...@gmail.com>.

I have finally gotten around to testing this functionality, and it would
doesn't work. The ALTER table change columns command just changes the
metadata for the table, not for the partition. Follows is exactly what I did
to test this, and the (broken) result:

hive> create table multischema_test (
    >   id      int,
    >   crdt    string,
    >   name    string,
    >   age     int
    > )
    > partitioned by (dtp string)
    > row format delimited
    > fields terminated by '\t'
    > lines terminated by '\n'
    > stored as textfile;
OK
Time taken: 0.345 seconds
hive> alter table multischema_test add partition (dtp=20110101) location
'/user/hive/warehouse/test/multischema_test/20110101';
OK
Time taken: 0.662 seconds
hive> alter table multischema_test replace columns (id int, name string,
gender string, age int, crdt string) ;
OK
Time taken: 0.119 seconds
hive> alter table multischema_test add partition (dtp=20110102) location
'/user/hive/warehouse/test/multischema_test/20110102';
OK
Time taken: 0.186 seconds
hive> select * from multischema_test ;
OK
1    2010-07-01    Jeff    32    NULL    20110101
2    2010-07-01    Lisa    33    NULL    20110101
3    2010-07-01    Bob    22    NULL    20110101
4    2010-07-01    Fred    27    NULL    20110101
100    Gregory    Male    45    2010-08-01    20110102
101    Horus    Male    14    2010-08-01    20110102
102    Verdann    Male    33    2010-08-01    20110102
103    Gennefer    Female    32    2010-08-01    20110102
Time taken: 0.348 seconds
hive> select name,gender from multischema_test ;
Total MapReduce jobs = 1
Launching Job 1 out of 1
Number of reduce tasks is set to 0 since there's no reduce operator
Starting Job = job_201108291505_28322, Tracking URL =
http://laxhadoop1-001:50030/jobdetails.jsp?jobid=job_201108291505_28322
Kill Command = /usr/lib/hadoop/bin/hadoop job
-Dmapred.job.tracker=laxhadoop1-001:54311 -kill job_201108291505_28322
2011-10-06 11:02:27,099 Stage-1 map = 0%,  reduce = 0%
2011-10-06 11:02:31,129 Stage-1 map = 100%,  reduce = 0%
2011-10-06 11:02:33,144 Stage-1 map = 100%,  reduce = 100%
Ended Job = job_201108291505_28322
OK
Gregory    Male
Horus    Male
Verdann    Male
Gennefer    Female
2010-07-01    Jeff
2010-07-01    Lisa
2010-07-01    Bob
2010-07-01    Fred
Time taken: 8.977 seconds
hive>

The two text files that make up the data in the table are like this:

[hdfs@laxhadoop1-012 ~/Tim] :) cat multischema-datafile-20110101 | hadoop fs
-put - /user/hive/warehouse/test/multischema_test/20110101/datafile
[hdfs@laxhadoop1-012 ~/Tim] :) cat multischema-datafile-20110102 | hadoop fs
-put - /user/hive/warehouse/test/multischema_test/20110102/datafile
[hdfs@laxhadoop1-012 ~/Tim] :) cat multischema-datafile-20110101
1    2010-07-01    Jeff    32
2    2010-07-01    Lisa    33
3    2010-07-01    Bob    22
4    2010-07-01    Fred    27
[hdfs@laxhadoop1-012 ~/Tim] :) cat multischema-datafile-20110102
100    Gregory    Male    45    2010-08-01
101    Horus    Male    14    2010-08-01
102    Verdann    Male    33    2010-08-01
103    Gennefer    Female    32    2010-08-01

Did I do something wrong?


On Mon, Aug 29, 2011 at 10:46 PM, Ashutosh Chauhan <ha...@apache.org>wrote:

> Hi Tim,
>
> I figured that both reading the code and manual. I don't think
> its explicitly documented anywhere, so it will be great if you document
> this. This page looks right place where this place of information can live.
>   Thanks for the help in making Hive better.
>
> Ashutosh
>
> On Mon, Aug 29, 2011 at 15:26, Time Less <ti...@gmail.com> wrote:
>
>> Hello, Ashutosh,
>>
>> I did nothing like that... :)
>>
>> It seems the problem here is I didn't RTFM. Perchance, could you say where
>> you figured this out? I am going from the Hive DDL page on confluence[1],
>> and although it mentions partitions and it mentions the "replace columns"
>> you've mentioned here, it doesn't mention them together that I see. I would
>> like to document this for future generations. Is that the proper page where
>> I'd document this?
>>
>> I would probably explicitly create a section titled "Different Schemas per
>> Partition" and basically give the syntax you give (from quoted, assuming
>> after I test it, it works).
>>
>>
>> [1]
>> https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL#LanguageManualDDL-AlterTable%2FPartitionStatements
>>
>>
>> On Wed, Aug 24, 2011 at 6:14 PM, Ashutosh Chauhan <ha...@apache.org>wrote:
>>
>>> Hey Tim,
>>>
>>> Hive does support different schema's for different partitions. If your
>>> data comes out garbled, that seems to be a bug then. In your case, is the
>>> following sequence of steps resemble what you did:
>>>
>>> a) create table tbl (id: int, name: string, level: int) partitioned by
>>> date;
>>> b) -- add partitions
>>> c) alter table tbl replace columns (id: int, level: int, name_id: int)
>>> d) -- add more partitions.
>>>
>>> If you do select * from tbl, then this should work. You need not to
>>> rewrite any of your data. Can you provide more info about what output you
>>> were expecting and what you got. Are there any error logs?
>>>
>>> Ashutosh
>>>
>>>
>>> On Mon, Aug 22, 2011 at 14:34, Time Less <ti...@gmail.com> wrote:
>>>
>>>> I found a set of slides from Facebook online about Hive that claims you
>>>> can have a schema per partition in the table, this is exciting to us,
>>>> because we have a table like so:
>>>>
>>>> id     int
>>>> name   string
>>>> level  int
>>>> date   string
>>>>
>>>> And it's broken up into partitions by date. However, on a particular
>>>> date last year, the table dramatically changed its schema to:
>>>>
>>>> id       int
>>>> level    int
>>>> date     string
>>>> name_id  int
>>>>
>>>> So now if I do "select * from table" in hive, the data is completely
>>>> garbled for whichever portion of data doesn't fit the Hive schema. We are
>>>> considering re-writing the datafiles so they're the same before/after that
>>>> date, but if Hive supports having two entirely different schemas depending
>>>> on the partition, that'd be really convenient, since these datafiles are
>>>> hundreds of gigabytes in size (and we do sort of like the idea of knowing
>>>> how the datafile looked back then...).
>>>>
>>>> This page:
>>>> https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL#LanguageManualDDL-AlterTable%2FPartitionStatementsdoesn't seem to have an appropriate example, so I'm left wondering.
>>>>
>>>> Has anyone done anything like this?
>>>>
>>>> --
>>>> Tim Ellis
>>>> Data Architect, Riot Games
>>>>
>>>>
>>>
>>
>>
>> --
>> Tim
>>
>
>


-- 
Tim

Re: One Schema Per Partition? (Multiple schemas per table?)

Posted by Ashutosh Chauhan <ha...@apache.org>.

Hi Tim,

I figured that both reading the code and manual. I don't think
its explicitly documented anywhere, so it will be great if you document
this. This page looks right place where this place of information can live.
  Thanks for the help in making Hive better.

Ashutosh
On Mon, Aug 29, 2011 at 15:26, Time Less <ti...@gmail.com> wrote:

> Hello, Ashutosh,
>
> I did nothing like that... :)
>
> It seems the problem here is I didn't RTFM. Perchance, could you say where
> you figured this out? I am going from the Hive DDL page on confluence[1],
> and although it mentions partitions and it mentions the "replace columns"
> you've mentioned here, it doesn't mention them together that I see. I would
> like to document this for future generations. Is that the proper page where
> I'd document this?
>
> I would probably explicitly create a section titled "Different Schemas per
> Partition" and basically give the syntax you give (from quoted, assuming
> after I test it, it works).
>
>
> [1]
> https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL#LanguageManualDDL-AlterTable%2FPartitionStatements
>
>
> On Wed, Aug 24, 2011 at 6:14 PM, Ashutosh Chauhan <ha...@apache.org>wrote:
>
>> Hey Tim,
>>
>> Hive does support different schema's for different partitions. If your
>> data comes out garbled, that seems to be a bug then. In your case, is the
>> following sequence of steps resemble what you did:
>>
>> a) create table tbl (id: int, name: string, level: int) partitioned by
>> date;
>> b) -- add partitions
>> c) alter table tbl replace columns (id: int, level: int, name_id: int)
>> d) -- add more partitions.
>>
>> If you do select * from tbl, then this should work. You need not to
>> rewrite any of your data. Can you provide more info about what output you
>> were expecting and what you got. Are there any error logs?
>>
>> Ashutosh
>>
>>
>> On Mon, Aug 22, 2011 at 14:34, Time Less <ti...@gmail.com> wrote:
>>
>>> I found a set of slides from Facebook online about Hive that claims you
>>> can have a schema per partition in the table, this is exciting to us,
>>> because we have a table like so:
>>>
>>> id     int
>>> name   string
>>> level  int
>>> date   string
>>>
>>> And it's broken up into partitions by date. However, on a particular date
>>> last year, the table dramatically changed its schema to:
>>>
>>> id       int
>>> level    int
>>> date     string
>>> name_id  int
>>>
>>> So now if I do "select * from table" in hive, the data is completely
>>> garbled for whichever portion of data doesn't fit the Hive schema. We are
>>> considering re-writing the datafiles so they're the same before/after that
>>> date, but if Hive supports having two entirely different schemas depending
>>> on the partition, that'd be really convenient, since these datafiles are
>>> hundreds of gigabytes in size (and we do sort of like the idea of knowing
>>> how the datafile looked back then...).
>>>
>>> This page:
>>> https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL#LanguageManualDDL-AlterTable%2FPartitionStatementsdoesn't seem to have an appropriate example, so I'm left wondering.
>>>
>>> Has anyone done anything like this?
>>>
>>> --
>>> Tim Ellis
>>> Data Architect, Riot Games
>>>
>>>
>>
>
>
> --
> Tim
>

Re: One Schema Per Partition? (Multiple schemas per table?)

Posted by Time Less <ti...@gmail.com>.

Hello, Ashutosh,

I did nothing like that... :)

It seems the problem here is I didn't RTFM. Perchance, could you say where
you figured this out? I am going from the Hive DDL page on confluence[1],
and although it mentions partitions and it mentions the "replace columns"
you've mentioned here, it doesn't mention them together that I see. I would
like to document this for future generations. Is that the proper page where
I'd document this?

I would probably explicitly create a section titled "Different Schemas per
Partition" and basically give the syntax you give (from quoted, assuming
after I test it, it works).


[1]
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL#LanguageManualDDL-AlterTable%2FPartitionStatements

On Wed, Aug 24, 2011 at 6:14 PM, Ashutosh Chauhan <ha...@apache.org>wrote:

> Hey Tim,
>
> Hive does support different schema's for different partitions. If your data
> comes out garbled, that seems to be a bug then. In your case, is the
> following sequence of steps resemble what you did:
>
> a) create table tbl (id: int, name: string, level: int) partitioned by
> date;
> b) -- add partitions
> c) alter table tbl replace columns (id: int, level: int, name_id: int)
> d) -- add more partitions.
>
> If you do select * from tbl, then this should work. You need not to rewrite
> any of your data. Can you provide more info about what output you were
> expecting and what you got. Are there any error logs?
>
> Ashutosh
>
>
> On Mon, Aug 22, 2011 at 14:34, Time Less <ti...@gmail.com> wrote:
>
>> I found a set of slides from Facebook online about Hive that claims you
>> can have a schema per partition in the table, this is exciting to us,
>> because we have a table like so:
>>
>> id     int
>> name   string
>> level  int
>> date   string
>>
>> And it's broken up into partitions by date. However, on a particular date
>> last year, the table dramatically changed its schema to:
>>
>> id       int
>> level    int
>> date     string
>> name_id  int
>>
>> So now if I do "select * from table" in hive, the data is completely
>> garbled for whichever portion of data doesn't fit the Hive schema. We are
>> considering re-writing the datafiles so they're the same before/after that
>> date, but if Hive supports having two entirely different schemas depending
>> on the partition, that'd be really convenient, since these datafiles are
>> hundreds of gigabytes in size (and we do sort of like the idea of knowing
>> how the datafile looked back then...).
>>
>> This page:
>> https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL#LanguageManualDDL-AlterTable%2FPartitionStatementsdoesn't seem to have an appropriate example, so I'm left wondering.
>>
>> Has anyone done anything like this?
>>
>> --
>> Tim Ellis
>> Data Architect, Riot Games
>>
>>
>


-- 
Tim

Re: One Schema Per Partition? (Multiple schemas per table?)

Posted by Ashutosh Chauhan <ha...@apache.org>.

Hey Tim,

Hive does support different schema's for different partitions. If your data
comes out garbled, that seems to be a bug then. In your case, is the
following sequence of steps resemble what you did:

a) create table tbl (id: int, name: string, level: int) partitioned by date;
b) -- add partitions
c) alter table tbl replace columns (id: int, level: int, name_id: int)
d) -- add more partitions.

If you do select * from tbl, then this should work. You need not to rewrite
any of your data. Can you provide more info about what output you were
expecting and what you got. Are there any error logs?

Ashutosh

On Mon, Aug 22, 2011 at 14:34, Time Less <ti...@gmail.com> wrote:

> I found a set of slides from Facebook online about Hive that claims you can
> have a schema per partition in the table, this is exciting to us, because we
> have a table like so:
>
> id     int
> name   string
> level  int
> date   string
>
> And it's broken up into partitions by date. However, on a particular date
> last year, the table dramatically changed its schema to:
>
> id       int
> level    int
> date     string
> name_id  int
>
> So now if I do "select * from table" in hive, the data is completely
> garbled for whichever portion of data doesn't fit the Hive schema. We are
> considering re-writing the datafiles so they're the same before/after that
> date, but if Hive supports having two entirely different schemas depending
> on the partition, that'd be really convenient, since these datafiles are
> hundreds of gigabytes in size (and we do sort of like the idea of knowing
> how the datafile looked back then...).
>
> This page:
> https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL#LanguageManualDDL-AlterTable%2FPartitionStatementsdoesn't seem to have an appropriate example, so I'm left wondering.
>
> Has anyone done anything like this?
>
> --
> Tim Ellis
> Data Architect, Riot Games
>
>