You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@hive.apache.org by Jay Ramadorai <jr...@tripadvisor.com> on 2011/01/26 19:52:58 UTC

Hive Concurrency Model - does it work?

https://issues.apache.org/jira/browse/HIVE-1293 : Is this JIRA truly fixed and included in 0.7.0?
If so, can the patch be applied separately on top of 0.5.0 or 0.6.0?
Are there instructions somewhere for how to enable/integrate Zookeeper with Hive for this patch to work?
The JIRA comments indicate the patch was tested and committed, however the wiki that the JIRA points to http://wiki.apache.org/hadoop/Hive/Locking implies concurrency will not be supported. Hence the confusion.
Is there a simple way in Hive to query which tables are currently being accessed?

More detail:
What I'm trying to do is to do daily Sqoop-imports into Hive from an external database. There are jobs running on the Hive warehouse a lot of the times. I import the data into temporary tables in Hive and then want to drop the permanent tables, and rename the (just-imported) temporary ones to the permanent names WITHOUT IMPACTING THE JOBS. At the moment of course doing an ALTER TABLE RENAME results in any running jobs accessing the table to die on the next fetch. So I thought if the above JIRA was indeed fixed, then 0.7.0 should allow the job to complete before the Rename gets its X lock, or if the rename is in progress, the Job wont get its S lock until the Rename is done. However our test on 0.7.0 trunk (pulled in late September) reveals that the rename happens instantly even with a query accessing the table, not waiting for any locks.

Barring this patch, are there any other ideas anyone can suggest for accomplishing what I want? Some ideas we have considered:
- Parse Hive logs/xml files looking for a tablename to determine if there is a job currently accessing the table. If not, then rename.
- Create views on temporary tables named by day. Have jobs go against the views. When we are ready to rename, basically replace the view, pointing it now to the new table of today. The key question here is: is the View metadata consulted only upon query startup, or is it repeatedly looked at during query execution. If only on startup, we might be able to get away this trick, until concurrency truly works.

Thanks
Jay

Re: Hive Concurrency Model - does it work?

Posted by John Sichi <js...@fb.com>.

On Jan 26, 2011, at 10:52 AM, Jay Ramadorai wrote:
> - Create views on temporary tables named by day. Have jobs go against the views. When we are ready to rename, basically replace the view, pointing it now to the new table of today. The key question here is: is the View metadata consulted only upon query startup, or is it repeatedly looked at during query execution. If only on startup, we might be able to get away this trick, until concurrency truly works.

View metadata is consulted only while the query is being compiled, not during execution.

JVS

Re: Hive Concurrency Model - does it work?

Posted by Namit Jain <nj...@fb.com>.

The patch below has been committed.


https://issues.apache.org/jira/browse/HIVE-1865 was a follow-up patch which should help concurrency.
I have not tried backporting the patch on hive 0.5 or hive0.6, but I don’t think it will work, since the code
has changed significantly, and a number of bug fixes to update the inputs and outputs went in.

By default, concurrency is disabled. If you want to enable it, you need to set: hive.support.concurrency to true


Thanks,
-namit


From: Jay Ramadorai <jr...@tripadvisor.com>>
Reply-To: <us...@hive.apache.org>>
Date: Wed, 26 Jan 2011 13:52:58 -0500
To: <us...@hive.apache.org>>
Subject: Hive Concurrency Model - does it work?

https://issues.apache.org/jira/browse/HIVE-1293 : Is this JIRA truly fixed and included in 0.7.0?
If so, can the patch be applied separately on top of 0.5.0 or 0.6.0?
Are there instructions somewhere for how to enable/integrate Zookeeper with Hive for this patch to work?
The JIRA comments indicate the patch was tested and committed, however the wiki that the JIRA points to  http://wiki.apache.org/hadoop/Hive/Locking implies concurrency will not be supported. Hence the confusion.
Is there a simple way in Hive to query which tables are currently being accessed?

More detail:
What I'm trying to do is to do daily Sqoop-imports into Hive from an external database. There are jobs running on the Hive warehouse a lot of the times. I import the data into temporary tables in Hive and then want to drop the permanent tables, and rename the (just-imported) temporary ones to the permanent names WITHOUT IMPACTING THE JOBS.  At the moment of course doing an ALTER TABLE RENAME results in any running jobs accessing the table to die on the next fetch. So I thought if the above JIRA was indeed fixed, then 0.7.0 should allow the job to complete before the Rename gets its X lock, or if the rename is in progress, the Job wont get its S lock until the Rename is done. However our test on 0.7.0 trunk (pulled in late September) reveals that the rename happens instantly even with a query accessing the table, not waiting for any locks.

Barring this patch, are there any other ideas anyone can suggest for accomplishing what I want? Some ideas we have considered:
- Parse Hive logs/xml files looking for a tablename to determine if there is a job currently accessing the table. If not, then rename.
- Create views on temporary tables named by day. Have jobs go against the views. When we are ready to rename, basically replace the view, pointing it now to the new table of today. The key question here is: is the View metadata consulted only upon query startup, or is it repeatedly looked at during query execution. If only on startup, we might be able to get away this trick, until concurrency truly works.

Thanks
Jay