You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@hive.apache.org by John Sichi <js...@facebook.com> on 2010/06/01 21:26:33 UTC

Re: SerDe and Rows

On May 28, 2010, at 3:49 PM, Sanjit Jhala wrote:

> John, theres some logic in the helper serialize method to serialize lists and structs. Is this used currently? I was under the impression that maps and primitives are the only types currently supported by the connector.


Yes, this logic is working.  I just now tested it interactively (see below) and will add a corresponding unit test when I work on HIVE-1245.

I'm not sure what is going on with the JSON-vs-delimited stuff; in my test it looks like it is coming out as delimited based on what I see from the HBase side.  There is a setUseJSONSerialize method but currently nothing invokes it; it would make sense to include this in the HIVE-1245 work as part of controlling how values are stored within HBase.

JVS

----

hive> CREATE TABLE complex(
    >     key string, 
    >     a array<string>, 
    >     s struct<col1 : int, col2 : int>)
    > STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'
    > WITH SERDEPROPERTIES (
    > "hbase.columns.mapping" = "cf:a, cf:s"
    > );
OK
hive> 
    > INSERT OVERWRITE TABLE complex 
    > SELECT bar, array('x', 'y', 'z'), struct(100, 200)
    > FROM pokes
    > WHERE foo=497;
...
OK
hive> 
    > SELECT * FROM complex;
OK
val_497	["x","y","z"]	{"col1":100,"col2":200}

hbase(main):003:0> scan 'complex'
ROW                          COLUMN+CELL                                                                      
 val_497                     column= cf:s, timestamp=1275419258650, value=100\x02200                          
 val_497                     column=cf:a, timestamp=1275419258650, value=x\x02y\x02z                         
1 row(s) in 1.0250 seconds


Re: SerDe and Rows

Posted by John Sichi <js...@facebook.com>.
On Jun 3, 2010, at 12:36 PM, Sanjit Jhala wrote:

> I'm wondering why the Split class needs to extend FileSplit and also why the InputFormat needs to call FileInputFormat.getInputPaths(job) in getSplits. Is this because of legacy code that needs to be cleaned up or does it get used somewhere?


Both of these are due to legacy code needing cleanup (HIVE-1133).  Currently some of the inputformat logic is based on physical paths where it should be based on logical data sources such as partitions instead.  This is also the reason why currently we are forced to create an empty directory in the file system corresponding to the name of each non-native table (HIVE-1222).

JVS


Re: SerDe and Rows

Posted by Sanjit Jhala <sj...@gmail.com>.
I'm wondering why the Split class needs to extend FileSplit and also why the
InputFormat needs to call FileInputFormat.getInputPaths(job) in getSplits.
Is this because of legacy code that needs to be cleaned up or does it get
used somewhere?

-Sanjit

On Wed, Jun 2, 2010 at 12:59 PM, Edward Capriolo <ed...@gmail.com>wrote:

>
>
> On Wed, Jun 2, 2010 at 3:17 PM, Sanjit Jhala <sj...@gmail.com> wrote:
>
>> Thanks, that sounds great ! Would love to come to the meetup. Thanks for
>> all the work on the Storage Handlers, its really nifty stuff.
>> I'm getting close on the Hypertable storage handler and will definitely
>> send out pointers once its ready.
>>
>> -Sanjit
>>
>>
>> On Wed, Jun 2, 2010 at 11:56 AM, John Sichi <js...@facebook.com> wrote:
>>
>>> Based on some recent offline discussions, it looks like CloudEra will be
>>> taking the lead on driving the release process for 0.6, so expect to see
>>> some initial plans on that here soon.
>>>
>>> We're thinking of classifying new features and frameworks as stable vs
>>> experimental.  For 0.6, items like storage handlers will definitely be
>>> classified as experimental, meaning they'll be there in the code, but
>>> expected to continue to evolve with breaking changes until they are declared
>>> stable in a subsequent release.
>>>
>>> We would also like to start holding monthly Hive developer meetups; it
>>> will be great if someone from hypertable can attend those--it's heartening
>>> to see so much interest in building up a storage handler ecosystem.
>>>
>>> I think the snapshot you reference is fine for trunk development work.
>>>
>>> Regarding thrift, here's info on the version currently being used:
>>>
>>> http://wiki.apache.org/hadoop/Hive/HowToContribute#Generating_Code
>>>
>>> JVS
>>>
>>> On Jun 2, 2010, at 11:19 AM, Sanjit Jhala wrote:
>>>
>>> Any idea when the next Hive release is scheduled for and whether the
>>> Storage Handler code will be included ?
>>>
>>> Also I'm currently using a snapshot from the trunk at commit:
>>>
>>> *commit bf7e3b9cc6c6ceced2dec70f0971ecc91fd0dcb3*
>>> *Author: Namit Jain <na...@apache.org>
>>> Date:   Thu May 6 19:05:52 2010 +0000
>>>
>>>     HIVE-1317. CombineHiveInputFormat throws exception when partition
>>> name contains special characte
>>>     (Ning Zhang via namit)
>>>
>>>     git-svn-id:
>>> https://svn.apache.org/repos/asf/hadoop/hive/trunk@94186013f79535-47bb-0310-9956-ff
>>> *
>>>
>>>
>>> Is this a reasonably stable commit or would you suggest another ? Also
>>> how do I figure out the corresponding Thrift version ?
>>>
>>> -Sanjit
>>>
>>>
>>>
>>>
>>> On Tue, Jun 1, 2010 at 5:36 PM, John Sichi <js...@facebook.com> wrote:
>>>
>>>> On Jun 1, 2010, at 4:45 PM, Sanjit Jhala wrote:
>>>>
>>>> > That looks cool. On a different note, it looks like the
>>>> HiveStorageHandler is based on the old Hadoop "mapred" interface. Any idea
>>>> when you plan to migrate to the "mapreduce" interface?
>>>>
>>>>
>>>> This one would be painful to do with shims, so I think it has to wait
>>>> until we drop support entirely for pre-0.20 Hadoop versions on Hive trunk.
>>>>  For Facebook, we may be ready for that within a few months; I'm not sure
>>>> about other Hive users.
>>>>
>>>> JVS
>>>>
>>>>
>>>
>>>
>> IMHO...
>
> Trunk has lot of features that 5.0 does not have. All (most) of the
> development for hive happens on the trunk. Trunk changes 2-3 times a week so
> it is a moving target.
>
>
>
> Hive is all userspace code, anyone who understands that can have 100
> different versions of hive in their home directory configured to the same
> metastore and hdfs.
>
> I currently have hive 5.0 latest release installed on the system path.
> /usr/bin/hive6  -> another hive install of (trunk)
>
> This gives me the best of both worlds. Users can pick and chose the hive
> they want to run with. I am really not caught up in releases. They are good
> things but in general I can not wait for them.
>
> Edward
>
>
>

Re: SerDe and Rows

Posted by Edward Capriolo <ed...@gmail.com>.
On Wed, Jun 2, 2010 at 3:17 PM, Sanjit Jhala <sj...@gmail.com> wrote:

> Thanks, that sounds great ! Would love to come to the meetup. Thanks for
> all the work on the Storage Handlers, its really nifty stuff.
> I'm getting close on the Hypertable storage handler and will definitely
> send out pointers once its ready.
>
> -Sanjit
>
>
> On Wed, Jun 2, 2010 at 11:56 AM, John Sichi <js...@facebook.com> wrote:
>
>> Based on some recent offline discussions, it looks like CloudEra will be
>> taking the lead on driving the release process for 0.6, so expect to see
>> some initial plans on that here soon.
>>
>> We're thinking of classifying new features and frameworks as stable vs
>> experimental.  For 0.6, items like storage handlers will definitely be
>> classified as experimental, meaning they'll be there in the code, but
>> expected to continue to evolve with breaking changes until they are declared
>> stable in a subsequent release.
>>
>> We would also like to start holding monthly Hive developer meetups; it
>> will be great if someone from hypertable can attend those--it's heartening
>> to see so much interest in building up a storage handler ecosystem.
>>
>> I think the snapshot you reference is fine for trunk development work.
>>
>> Regarding thrift, here's info on the version currently being used:
>>
>> http://wiki.apache.org/hadoop/Hive/HowToContribute#Generating_Code
>>
>> JVS
>>
>> On Jun 2, 2010, at 11:19 AM, Sanjit Jhala wrote:
>>
>> Any idea when the next Hive release is scheduled for and whether the
>> Storage Handler code will be included ?
>>
>> Also I'm currently using a snapshot from the trunk at commit:
>>
>> *commit bf7e3b9cc6c6ceced2dec70f0971ecc91fd0dcb3*
>> *Author: Namit Jain <na...@apache.org>
>> Date:   Thu May 6 19:05:52 2010 +0000
>>
>>     HIVE-1317. CombineHiveInputFormat throws exception when partition name
>> contains special characte
>>     (Ning Zhang via namit)
>>
>>     git-svn-id: https://svn.apache.org/repos/asf/hadoop/hive/trunk@94186013f79535-47bb-0310-9956-ff
>> *
>>
>>
>> Is this a reasonably stable commit or would you suggest another ? Also how
>> do I figure out the corresponding Thrift version ?
>>
>> -Sanjit
>>
>>
>>
>>
>> On Tue, Jun 1, 2010 at 5:36 PM, John Sichi <js...@facebook.com> wrote:
>>
>>> On Jun 1, 2010, at 4:45 PM, Sanjit Jhala wrote:
>>>
>>> > That looks cool. On a different note, it looks like the
>>> HiveStorageHandler is based on the old Hadoop "mapred" interface. Any idea
>>> when you plan to migrate to the "mapreduce" interface?
>>>
>>>
>>> This one would be painful to do with shims, so I think it has to wait
>>> until we drop support entirely for pre-0.20 Hadoop versions on Hive trunk.
>>>  For Facebook, we may be ready for that within a few months; I'm not sure
>>> about other Hive users.
>>>
>>> JVS
>>>
>>>
>>
>>
> IMHO...

Trunk has lot of features that 5.0 does not have. All (most) of the
development for hive happens on the trunk. Trunk changes 2-3 times a week so
it is a moving target.



Hive is all userspace code, anyone who understands that can have 100
different versions of hive in their home directory configured to the same
metastore and hdfs.

I currently have hive 5.0 latest release installed on the system path.
/usr/bin/hive6  -> another hive install of (trunk)

This gives me the best of both worlds. Users can pick and chose the hive
they want to run with. I am really not caught up in releases. They are good
things but in general I can not wait for them.

Edward

Re: SerDe and Rows

Posted by Sanjit Jhala <sj...@gmail.com>.
Thanks, that sounds great ! Would love to come to the meetup. Thanks for all
the work on the Storage Handlers, its really nifty stuff.
I'm getting close on the Hypertable storage handler and will definitely send
out pointers once its ready.

-Sanjit

On Wed, Jun 2, 2010 at 11:56 AM, John Sichi <js...@facebook.com> wrote:

> Based on some recent offline discussions, it looks like CloudEra will be
> taking the lead on driving the release process for 0.6, so expect to see
> some initial plans on that here soon.
>
> We're thinking of classifying new features and frameworks as stable vs
> experimental.  For 0.6, items like storage handlers will definitely be
> classified as experimental, meaning they'll be there in the code, but
> expected to continue to evolve with breaking changes until they are declared
> stable in a subsequent release.
>
> We would also like to start holding monthly Hive developer meetups; it will
> be great if someone from hypertable can attend those--it's heartening to see
> so much interest in building up a storage handler ecosystem.
>
> I think the snapshot you reference is fine for trunk development work.
>
> Regarding thrift, here's info on the version currently being used:
>
> http://wiki.apache.org/hadoop/Hive/HowToContribute#Generating_Code
>
> JVS
>
> On Jun 2, 2010, at 11:19 AM, Sanjit Jhala wrote:
>
> Any idea when the next Hive release is scheduled for and whether the
> Storage Handler code will be included ?
>
> Also I'm currently using a snapshot from the trunk at commit:
>
> *commit bf7e3b9cc6c6ceced2dec70f0971ecc91fd0dcb3*
> *Author: Namit Jain <na...@apache.org>
> Date:   Thu May 6 19:05:52 2010 +0000
>
>     HIVE-1317. CombineHiveInputFormat throws exception when partition name
> contains special characte
>     (Ning Zhang via namit)
>
>     git-svn-id: https://svn.apache.org/repos/asf/hadoop/hive/trunk@94186013f79535-47bb-0310-9956-ff
> *
>
>
> Is this a reasonably stable commit or would you suggest another ? Also how
> do I figure out the corresponding Thrift version ?
>
> -Sanjit
>
>
>
>
> On Tue, Jun 1, 2010 at 5:36 PM, John Sichi <js...@facebook.com> wrote:
>
>> On Jun 1, 2010, at 4:45 PM, Sanjit Jhala wrote:
>>
>> > That looks cool. On a different note, it looks like the
>> HiveStorageHandler is based on the old Hadoop "mapred" interface. Any idea
>> when you plan to migrate to the "mapreduce" interface?
>>
>>
>> This one would be painful to do with shims, so I think it has to wait
>> until we drop support entirely for pre-0.20 Hadoop versions on Hive trunk.
>>  For Facebook, we may be ready for that within a few months; I'm not sure
>> about other Hive users.
>>
>> JVS
>>
>>
>
>

Re: SerDe and Rows

Posted by John Sichi <js...@facebook.com>.
Based on some recent offline discussions, it looks like CloudEra will be taking the lead on driving the release process for 0.6, so expect to see some initial plans on that here soon.

We're thinking of classifying new features and frameworks as stable vs experimental.  For 0.6, items like storage handlers will definitely be classified as experimental, meaning they'll be there in the code, but expected to continue to evolve with breaking changes until they are declared stable in a subsequent release.

We would also like to start holding monthly Hive developer meetups; it will be great if someone from hypertable can attend those--it's heartening to see so much interest in building up a storage handler ecosystem.

I think the snapshot you reference is fine for trunk development work.

Regarding thrift, here's info on the version currently being used:

http://wiki.apache.org/hadoop/Hive/HowToContribute#Generating_Code

JVS

On Jun 2, 2010, at 11:19 AM, Sanjit Jhala wrote:

Any idea when the next Hive release is scheduled for and whether the Storage Handler code will be included ?

Also I'm currently using a snapshot from the trunk at commit:

commit bf7e3b9cc6c6ceced2dec70f0971ecc91fd0dcb3
Author: Namit Jain <na...@apache.org>>
Date:   Thu May 6 19:05:52 2010 +0000

    HIVE-1317. CombineHiveInputFormat throws exception when partition name contains special characte
    (Ning Zhang via namit)

    git-svn-id: https://svn.apache.org/repos/asf/hadoop/hive/trunk@941860 13f79535-47bb-0310-9956-ff


Is this a reasonably stable commit or would you suggest another ? Also how do I figure out the corresponding Thrift version ?

-Sanjit




On Tue, Jun 1, 2010 at 5:36 PM, John Sichi <js...@facebook.com>> wrote:
On Jun 1, 2010, at 4:45 PM, Sanjit Jhala wrote:

> That looks cool. On a different note, it looks like the HiveStorageHandler is based on the old Hadoop "mapred" interface. Any idea when you plan to migrate to the "mapreduce" interface?


This one would be painful to do with shims, so I think it has to wait until we drop support entirely for pre-0.20 Hadoop versions on Hive trunk.  For Facebook, we may be ready for that within a few months; I'm not sure about other Hive users.

JVS




Re: SerDe and Rows

Posted by Sanjit Jhala <sj...@gmail.com>.
Any idea when the next Hive release is scheduled for and whether the Storage
Handler code will be included ?

Also I'm currently using a snapshot from the trunk at commit:

*commit bf7e3b9cc6c6ceced2dec70f0971ecc91fd0dcb3*
*Author: Namit Jain <na...@apache.org>
Date:   Thu May 6 19:05:52 2010 +0000

    HIVE-1317. CombineHiveInputFormat throws exception when partition name
contains special characte
    (Ning Zhang via namit)

    git-svn-id:
https://svn.apache.org/repos/asf/hadoop/hive/trunk@94186013f79535-47bb-0310-9956-ff
*


Is this a reasonably stable commit or would you suggest another ? Also how
do I figure out the corresponding Thrift version ?

-Sanjit




On Tue, Jun 1, 2010 at 5:36 PM, John Sichi <js...@facebook.com> wrote:

> On Jun 1, 2010, at 4:45 PM, Sanjit Jhala wrote:
>
> > That looks cool. On a different note, it looks like the
> HiveStorageHandler is based on the old Hadoop "mapred" interface. Any idea
> when you plan to migrate to the "mapreduce" interface?
>
>
> This one would be painful to do with shims, so I think it has to wait until
> we drop support entirely for pre-0.20 Hadoop versions on Hive trunk.  For
> Facebook, we may be ready for that within a few months; I'm not sure about
> other Hive users.
>
> JVS
>
>

Re: SerDe and Rows

Posted by John Sichi <js...@facebook.com>.
On Jun 1, 2010, at 4:45 PM, Sanjit Jhala wrote:

> That looks cool. On a different note, it looks like the HiveStorageHandler is based on the old Hadoop "mapred" interface. Any idea when you plan to migrate to the "mapreduce" interface?


This one would be painful to do with shims, so I think it has to wait until we drop support entirely for pre-0.20 Hadoop versions on Hive trunk.  For Facebook, we may be ready for that within a few months; I'm not sure about other Hive users. 

JVS


Re: SerDe and Rows

Posted by Sanjit Jhala <sj...@gmail.com>.
That looks cool. On a different note, it looks like the HiveStorageHandler
is based on the old Hadoop "mapred" interface. Any idea when you plan to
migrate to the "mapreduce" interface?

-Sanjit

On Tue, Jun 1, 2010 at 12:26 PM, John Sichi <js...@facebook.com> wrote:

> On May 28, 2010, at 3:49 PM, Sanjit Jhala wrote:
>
> > John, theres some logic in the helper serialize method to serialize lists
> and structs. Is this used currently? I was under the impression that maps
> and primitives are the only types currently supported by the connector.
>
>
> Yes, this logic is working.  I just now tested it interactively (see below)
> and will add a corresponding unit test when I work on HIVE-1245.
>
> I'm not sure what is going on with the JSON-vs-delimited stuff; in my test
> it looks like it is coming out as delimited based on what I see from the
> HBase side.  There is a setUseJSONSerialize method but currently nothing
> invokes it; it would make sense to include this in the HIVE-1245 work as
> part of controlling how values are stored within HBase.
>
> JVS
>
> ----
>
> hive> CREATE TABLE complex(
>    >     key string,
>    >     a array<string>,
>    >     s struct<col1 : int, col2 : int>)
>    > STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'
>    > WITH SERDEPROPERTIES (
>    > "hbase.columns.mapping" = "cf:a, cf:s"
>    > );
> OK
> hive>
>    > INSERT OVERWRITE TABLE complex
>    > SELECT bar, array('x', 'y', 'z'), struct(100, 200)
>    > FROM pokes
>    > WHERE foo=497;
> ...
> OK
> hive>
>    > SELECT * FROM complex;
> OK
> val_497 ["x","y","z"]   {"col1":100,"col2":200}
>
> hbase(main):003:0> scan 'complex'
> ROW                          COLUMN+CELL
>  val_497                     column= cf:s, timestamp=1275419258650,
> value=100\x02200
>  val_497                     column=cf:a, timestamp=1275419258650,
> value=x\x02y\x02z
> 1 row(s) in 1.0250 seconds
>
>