You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@hive.apache.org by Saurabh Nanda <sa...@gmail.com> on 2009/07/28 12:41:29 UTC

UPDATE statement in Hive?

Is there an UPDATE statement in Hive? If not, are there any plans for adding
support for it in the future?

This is why I ask: I want to maintain a table which, against each user ID,
stores the first visit & last visit time. This is across the entire year,
not a day -- basically to understand how many visitors we got in last 1/3/6
months, etc.

I can add new users into a separate partition to get around the limitation
of not being able to append rows to a table. However, I don't know how to
update the last_visited_at column for each user?

Is this best achieved by storing this table outside of Hive in a traditional
RDBMS? Using JDBC query Hive for a list of distinct visitors today and based
on that list update the 'external' table.

Saurabh.
-- 
http://nandz.blogspot.com
http://foodieforlife.blogspot.com

Re: UPDATE statement in Hive?

Posted by Saurabh Nanda <sa...@gmail.com>.

Sorry for the newbie questions here, but how is this going to work? Using
'normal' Hive queries will I be able to read & write to an HBase datastore?
>From withing the Hive CLI?

Saurabh.
-- 
http://nandz.blogspot.com
http://foodieforlife.blogspot.com

Re: UPDATE statement in Hive?

Posted by Abhijit Pol <ap...@rocketfuelinc.com>.

+1 if need more support for this feature. I think this will be very
powerful and useful addition to HIVE.

2009/7/28 He Yongqiang <he...@software.ict.ac.cn>:
> Talked with Samuel Guo, and I am sure he will work on it soon.
>
> On 09-7-29 上午10:15, "Ashish Thusoo" <at...@facebook.com> wrote:
>
> That would be great Youngqiang.
>
> Amr, we don't have that kind of support but would love to add it.
>
> Ashish
>
> ________________________________
> From: He Yongqiang [mailto:heyongqiang@software.ict.ac.cn]
> Sent: Tuesday, July 28, 2009 7:03 PM
> To: hive-user@hadoop.apache.org
> Subject: Re: UPDATE statement in Hive?
>
> The patch contributor of https://issues.apache.org/jira/browse/PIG-6 is a
> student here in our institute, but another laboratory.
> If hive is interested in this, I will get in touch with him to see if he
> would like to do a similar contribution for hive.
>
> On 09-7-29 上午8:10, "Peter Skomoroch" <pe...@gmail.com> wrote:
>
> +1 for Hive queries on HBase - that would be a  powerful combination.
>
> On Tue, Jul 28, 2009 at 8:05 PM, Amr Awadallah  <aa...@cloudera.com> wrote:
>
>
> Saurabh, I think you better off with HBase for this  kind of use, see:
>
> http://hadoop.apache.org/hbase/
>
> In  a nutshell, HBase is a layer on top of HDFS which supports two things:
> (1)  quick lookups based on keys (e.g. a userid), and (2) transaction
> semantics  at the row-level (update/delete/insert values for a given  key).
>
> Ashish, is there any way to run Hive queries on top of HBase?  Pig has
> support for that via this  patch:
>
> https://issues.apache.org/jira/browse/PIG-6
>
> -- amr
>
>
> Ashish Thusoo  wrote:
>
>
> There is no update statement at this time and as  there is no update of a
> file in hadoop and update in Hive though possible  would just be syntax
> sugar for merging the new values to the old data in  the table and then
> rewriting the table with the merged output. This can be  achieved by doing
> an insert overwrite on the old table from the results of  the merge done by
> a left outer join on the old table and the new data  staged in another
> table. Also note that when you are updating the table,  current queries
> running on the table may fail.
>
> Another option is to  change your schema so that the table actually contains
> the changes to the  row instead of the row values themselves and then change
> the query that  takes the new schema into  account.
>
> Ashish
>
> ________________________________________
> From:  Saurabh Nanda [saurabhnanda@gmail.com]
> Sent: Tuesday, July 28, 2009  3:41 AM
> To: hive-user@hadoop.apache.org
> Subject: UPDATE statement in  Hive?
>
> Is there an UPDATE statement in Hive? If not, are there any  plans for
> adding support for it in the future?
>
> This is why I ask: I  want to maintain a table which, against each user ID,
> stores the first  visit & last visit time. This is across the entire year,
> not a day --  basically to understand how many visitors we got in last 1/3/6
> months,  etc.
>
> I can add new users into a separate partition to get around  the limitation
> of not being able to append rows to a table. However, I  don't know how to
> update the last_visited_at column for each  user?
>
> Is this best achieved by storing this table outside of Hive  in a
> traditional RDBMS? Using JDBC query Hive for a list of distinct  visitors
> today and based on that list update the 'external'  table.
>
> Saurabh.
> --
> http://nandz.blogspot.com
> http://foodieforlife.blogspot.com
>  ?
>
>
>
>

Re: UPDATE statement in Hive?

Posted by He Yongqiang <he...@software.ict.ac.cn>.

Talked with Samuel Guo, and I am sure he will work on it soon.

On 09-7-29 上午10:15, "Ashish Thusoo" <at...@facebook.com> wrote:

> That would be great Youngqiang.
>  
> Amr, we don't have that kind of support but would love to add it.
>  
> Ashish
> 
> 
> From: He Yongqiang [mailto:heyongqiang@software.ict.ac.cn]
> Sent: Tuesday, July 28, 2009 7:03 PM
> To: hive-user@hadoop.apache.org
> Subject: Re: UPDATE statement in Hive?
> 
> The patch contributor of https://issues.apache.org/jira/browse/PIG-6 is a
> student here in our institute, but another laboratory.
> If hive is interested in this, I will get in touch with him to see if he would
> like to do a similar contribution for hive.
> 
> On 09-7-29 上午8:10, "Peter Skomoroch" <pe...@gmail.com> wrote:
> 
>> +1 for Hive queries on HBase - that would be a  powerful combination.
>> 
>> On Tue, Jul 28, 2009 at 8:05 PM, Amr Awadallah  <aa...@cloudera.com> wrote:
>>  
>>> Saurabh, I think you better off with HBase for this  kind of use, see:
>>> 
>>> http://hadoop.apache.org/hbase/
>>> 
>>> In  a nutshell, HBase is a layer on top of HDFS which supports two things:
>>> (1)  quick lookups based on keys (e.g. a userid), and (2) transaction
>>> semantics  at the row-level (update/delete/insert values for a given  key).
>>> 
>>> Ashish, is there any way to run Hive queries on top of HBase?  Pig has
>>> support for that via this  patch:
>>> 
>>> https://issues.apache.org/jira/browse/PIG-6
>>> 
>>> -- amr
>>> 
>>> 
>>> Ashish Thusoo  wrote:
>>>  
>>>> There is no update statement at this time and as  there is no update of a
>>>> file in hadoop and update in Hive though possible  would just be syntax
>>>> sugar for merging the new values to the old data in  the table and then
>>>> rewriting the table with the merged output. This can be  achieved by doing
>>>> an insert overwrite on the old table from the results of  the merge done by
>>>> a left outer join on the old table and the new data  staged in another
>>>> table. Also note that when you are updating the table,  current queries
>>>> running on the table may fail.
>>>> 
>>>> Another option is to  change your schema so that the table actually
>>>> contains the changes to the  row instead of the row values themselves and
>>>> then change the query that  takes the new schema into  account.
>>>> 
>>>> Ashish
>>>> 
>>>> ________________________________________
>>>> From:  Saurabh Nanda [saurabhnanda@gmail.com]
>>>> Sent: Tuesday, July 28, 2009  3:41 AM
>>>> To: hive-user@hadoop.apache.org
>>>> Subject: UPDATE statement in  Hive?
>>>> 
>>>> Is there an UPDATE statement in Hive? If not, are there any  plans for
>>>> adding support for it in the future?
>>>> 
>>>> This is why I ask: I  want to maintain a table which, against each user ID,
>>>> stores the first  visit & last visit time. This is across the entire year,
>>>> not a day --  basically to understand how many visitors we got in last
>>>> 1/3/6 months,  etc.
>>>> 
>>>> I can add new users into a separate partition to get around  the limitation
>>>> of not being able to append rows to a table. However, I  don't know how to
>>>> update the last_visited_at column for each  user?
>>>> 
>>>> Is this best achieved by storing this table outside of Hive  in a
>>>> traditional RDBMS? Using JDBC query Hive for a list of distinct  visitors
>>>> today and based on that list update the 'external'  table.
>>>> 
>>>> Saurabh.
>>>> --
>>>> http://nandz.blogspot.com
>>>> http://foodieforlife.blogspot.com
>>>>  ?
>>>> 
>>>> 
>>>>

RE: UPDATE statement in Hive?

Posted by Ashish Thusoo <at...@facebook.com>.

That would be great Youngqiang.

Amr, we don't have that kind of support but would love to add it.

Ashish

________________________________
From: He Yongqiang [mailto:heyongqiang@software.ict.ac.cn]
Sent: Tuesday, July 28, 2009 7:03 PM
To: hive-user@hadoop.apache.org
Subject: Re: UPDATE statement in Hive?

The patch contributor of https://issues.apache.org/jira/browse/PIG-6 is a student here in our institute, but another laboratory.
If hive is interested in this, I will get in touch with him to see if he would like to do a similar contribution for hive.

On 09-7-29 上午8:10, "Peter Skomoroch" <pe...@gmail.com> wrote:

+1 for Hive queries on HBase - that would be a powerful combination.

On Tue, Jul 28, 2009 at 8:05 PM, Amr Awadallah <aa...@cloudera.com> wrote:
Saurabh, I think you better off with HBase for this kind of use, see:

http://hadoop.apache.org/hbase/

In a nutshell, HBase is a layer on top of HDFS which supports two things: (1) quick lookups based on keys (e.g. a userid), and (2) transaction semantics at the row-level (update/delete/insert values for a given key).

Ashish, is there any way to run Hive queries on top of HBase? Pig has support for that via this patch:

https://issues.apache.org/jira/browse/PIG-6

-- amr

Ashish Thusoo wrote:
There is no update statement at this time and as there is no update of a file in hadoop and update in Hive though possible would just be syntax sugar for merging the new values to the old data in the table and then rewriting the table with the merged output. This can be achieved by doing an insert overwrite on the old table from the results of the merge done by a left outer join on the old table and the new data staged in another table. Also note that when you are updating the table, current queries running on the table may fail.

Another option is to change your schema so that the table actually contains the changes to the row instead of the row values themselves and then change the query that takes the new schema into account.

Ashish

________________________________________
From: Saurabh Nanda [saurabhnanda@gmail.com]
Sent: Tuesday, July 28, 2009 3:41 AM
To: hive-user@hadoop.apache.org
Subject: UPDATE statement in Hive?

Is there an UPDATE statement in Hive? If not, are there any plans for adding support for it in the future?

This is why I ask: I want to maintain a table which, against each user ID, stores the first visit & last visit time. This is across the entire year, not a day -- basically to understand how many visitors we got in last 1/3/6 months, etc.

I can add new users into a separate partition to get around the limitation of not being able to append rows to a table. However, I don't know how to update the last_visited_at column for each user?

Is this best achieved by storing this table outside of Hive in a traditional RDBMS? Using JDBC query Hive for a list of distinct visitors today and based on that list update the 'external' table.

Saurabh.
--
http://nandz.blogspot.com
http://foodieforlife.blogspot.com
 ?

Re: UPDATE statement in Hive?

Posted by He Yongqiang <he...@software.ict.ac.cn>.

The patch contributor of https://issues.apache.org/jira/browse/PIG-6 is a
student here in our institute, but another laboratory.
If hive is interested in this, I will get in touch with him to see if he
would like to do a similar contribution for hive.

On 09-7-29 上午8:10, "Peter Skomoroch" <pe...@gmail.com> wrote:

> +1 for Hive queries on HBase - that would be a powerful combination.
> 
> On Tue, Jul 28, 2009 at 8:05 PM, Amr Awadallah <aa...@cloudera.com> wrote:
>> Saurabh, I think you better off with HBase for this kind of use, see:
>> 
>> http://hadoop.apache.org/hbase/
>> 
>> In a nutshell, HBase is a layer on top of HDFS which supports two things: (1)
>> quick lookups based on keys (e.g. a userid), and (2) transaction semantics at
>> the row-level (update/delete/insert values for a given key).
>> 
>> Ashish, is there any way to run Hive queries on top of HBase? Pig has support
>> for that via this patch:
>> 
>> https://issues.apache.org/jira/browse/PIG-6
>> 
>> -- amr
>> 
>> 
>> Ashish Thusoo wrote:
>>> There is no update statement at this time and as there is no update of a
>>> file in hadoop and update in Hive though possible would just be syntax sugar
>>> for merging the new values to the old data in the table and then rewriting
>>> the table with the merged output. This can be achieved by doing an insert
>>> overwrite on the old table from the results of the merge done by a left
>>> outer join on the old table and the new data staged in another table. Also
>>> note that when you are updating the table, current queries running on the
>>> table may fail.
>>> 
>>> Another option is to change your schema so that the table actually contains
>>> the changes to the row instead of the row values themselves and then change
>>> the query that takes the new schema into account.
>>> 
>>> Ashish
>>> 
>>> ________________________________________
>>> From: Saurabh Nanda [saurabhnanda@gmail.com]
>>> Sent: Tuesday, July 28, 2009 3:41 AM
>>> To: hive-user@hadoop.apache.org
>>> Subject: UPDATE statement in Hive?
>>> 
>>> Is there an UPDATE statement in Hive? If not, are there any plans for adding
>>> support for it in the future?
>>> 
>>> This is why I ask: I want to maintain a table which, against each user ID,
>>> stores the first visit & last visit time. This is across the entire year,
>>> not a day -- basically to understand how many visitors we got in last 1/3/6
>>> months, etc.
>>> 
>>> I can add new users into a separate partition to get around the limitation
>>> of not being able to append rows to a table. However, I don't know how to
>>> update the last_visited_at column for each user?
>>> 
>>> Is this best achieved by storing this table outside of Hive in a traditional
>>> RDBMS? Using JDBC query Hive for a list of distinct visitors today and based
>>> on that list update the 'external' table.
>>> 
>>> Saurabh.
>>> --
>>> http://nandz.blogspot.com
>>> http://foodieforlife.blogspot.com
>>>  ?
>>> 
>>>

Re: UPDATE statement in Hive?

Posted by Peter Skomoroch <pe...@gmail.com>.

+1 for Hive queries on HBase - that would be a powerful combination.

On Tue, Jul 28, 2009 at 8:05 PM, Amr Awadallah <aa...@cloudera.com> wrote:

> Saurabh, I think you better off with HBase for this kind of use, see:
>
> http://hadoop.apache.org/hbase/
>
> In a nutshell, HBase is a layer on top of HDFS which supports two things:
> (1) quick lookups based on keys (e.g. a userid), and (2) transaction
> semantics at the row-level (update/delete/insert values for a given key).
>
> Ashish, is there any way to run Hive queries on top of HBase? Pig has
> support for that via this patch:
>
> https://issues.apache.org/jira/browse/PIG-6
>
> -- amr
>
>
> Ashish Thusoo wrote:
>
>> There is no update statement at this time and as there is no update of a
>> file in hadoop and update in Hive though possible would just be syntax sugar
>> for merging the new values to the old data in the table and then rewriting
>> the table with the merged output. This can be achieved by doing an insert
>> overwrite on the old table from the results of the merge done by a left
>> outer join on the old table and the new data staged in another table. Also
>> note that when you are updating the table, current queries running on the
>> table may fail.
>>
>> Another option is to change your schema so that the table actually
>> contains the changes to the row instead of the row values themselves and
>> then change the query that takes the new schema into account.
>>
>> Ashish
>>
>> ________________________________________
>> From: Saurabh Nanda [saurabhnanda@gmail.com]
>> Sent: Tuesday, July 28, 2009 3:41 AM
>> To: hive-user@hadoop.apache.org
>> Subject: UPDATE statement in Hive?
>>
>> Is there an UPDATE statement in Hive? If not, are there any plans for
>> adding support for it in the future?
>>
>> This is why I ask: I want to maintain a table which, against each user ID,
>> stores the first visit & last visit time. This is across the entire year,
>> not a day -- basically to understand how many visitors we got in last 1/3/6
>> months, etc.
>>
>> I can add new users into a separate partition to get around the limitation
>> of not being able to append rows to a table. However, I don't know how to
>> update the last_visited_at column for each user?
>>
>> Is this best achieved by storing this table outside of Hive in a
>> traditional RDBMS? Using JDBC query Hive for a list of distinct visitors
>> today and based on that list update the 'external' table.
>>
>> Saurabh.
>> --
>> http://nandz.blogspot.com
>> http://foodieforlife.blogspot.com
>>
>>
>


-- 
Peter N. Skomoroch
617.285.8348
http://www.datawrangling.com
http://delicious.com/pskomoroch
http://twitter.com/peteskomoroch

Re: UPDATE statement in Hive?

Posted by Amr Awadallah <aa...@cloudera.com>.

Saurabh, I think you better off with HBase for this kind of use, see:

http://hadoop.apache.org/hbase/

In a nutshell, HBase is a layer on top of HDFS which supports two 
things: (1) quick lookups based on keys (e.g. a userid), and (2) 
transaction semantics at the row-level (update/delete/insert values for 
a given key).

Ashish, is there any way to run Hive queries on top of HBase? Pig has 
support for that via this patch:

https://issues.apache.org/jira/browse/PIG-6

-- amr

Ashish Thusoo wrote:
> There is no update statement at this time and as there is no update of a file in hadoop and update in Hive though possible would just be syntax sugar for merging the new values to the old data in the table and then rewriting the table with the merged output. This can be achieved by doing an insert overwrite on the old table from the results of the merge done by a left outer join on the old table and the new data staged in another table. Also note that when you are updating the table, current queries running on the table may fail.
>
> Another option is to change your schema so that the table actually contains the changes to the row instead of the row values themselves and then change the query that takes the new schema into account.
>
> Ashish
>
> ________________________________________
> From: Saurabh Nanda [saurabhnanda@gmail.com]
> Sent: Tuesday, July 28, 2009 3:41 AM
> To: hive-user@hadoop.apache.org
> Subject: UPDATE statement in Hive?
>
> Is there an UPDATE statement in Hive? If not, are there any plans for adding support for it in the future?
>
> This is why I ask: I want to maintain a table which, against each user ID, stores the first visit & last visit time. This is across the entire year, not a day -- basically to understand how many visitors we got in last 1/3/6 months, etc.
>
> I can add new users into a separate partition to get around the limitation of not being able to append rows to a table. However, I don't know how to update the last_visited_at column for each user?
>
> Is this best achieved by storing this table outside of Hive in a traditional RDBMS? Using JDBC query Hive for a list of distinct visitors today and based on that list update the 'external' table.
>
> Saurabh.
> --
> http://nandz.blogspot.com
> http://foodieforlife.blogspot.com
>

RE: UPDATE statement in Hive?

Posted by Ashish Thusoo <at...@facebook.com>.

There is no update statement at this time and as there is no update of a file in hadoop and update in Hive though possible would just be syntax sugar for merging the new values to the old data in the table and then rewriting the table with the merged output. This can be achieved by doing an insert overwrite on the old table from the results of the merge done by a left outer join on the old table and the new data staged in another table. Also note that when you are updating the table, current queries running on the table may fail.

Another option is to change your schema so that the table actually contains the changes to the row instead of the row values themselves and then change the query that takes the new schema into account.

Ashish

________________________________________
From: Saurabh Nanda [saurabhnanda@gmail.com]
Sent: Tuesday, July 28, 2009 3:41 AM
To: hive-user@hadoop.apache.org
Subject: UPDATE statement in Hive?

Is there an UPDATE statement in Hive? If not, are there any plans for adding support for it in the future?

This is why I ask: I want to maintain a table which, against each user ID, stores the first visit & last visit time. This is across the entire year, not a day -- basically to understand how many visitors we got in last 1/3/6 months, etc.

I can add new users into a separate partition to get around the limitation of not being able to append rows to a table. However, I don't know how to update the last_visited_at column for each user?

Is this best achieved by storing this table outside of Hive in a traditional RDBMS? Using JDBC query Hive for a list of distinct visitors today and based on that list update the 'external' table.

Saurabh.
--
http://nandz.blogspot.com
http://foodieforlife.blogspot.com