You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@hive.apache.org by Eva Tse <et...@netflix.com> on 2009/09/09 07:18:17 UTC

Files does not exist error: concurrency control on hive queries...

We are planning to start enabling ad-hoc querying on our hive warehouse and
we tested some of the concurrent queries and found the following issue:

Query 1  doing insert overwrite table yyy .... partition (dateint = xxx)
select ...  from yyy where dateint = xxx¹  This is done to merge small files
within a partition in table yyy
Query 2  doing some select on the same table joining another table.

What we found is that query 2 would fail with the following exceptions in
multiple reducers. 
java.io.FileNotFoundException: File does not exist:
hdfs://ip-10-251-98-80.ec2.internal:9000/user/hive/dataeng/warehouse/nccp_se
ssion_facts/dateint=20090908/hour=9/sessionsFacts_P20090909T021823L20090908T
09-r-00006
 at 
org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSy
stem.java:457)
 at org.apache.hadoop.fs.FileSystem.getLength(FileSystem.java:671)
 at org.apache.hadoop.io.SequenceFile$Reader.(SequenceFile.java:1417)
 at org.apache.hadoop.io.SequenceFile$Reader.(SequenceFile.java:1412)
 at 
org.apache.hadoop.mapred.SequenceFileRecordReader.(SequenceFileRecordReader.
java:43)
 at 
org.apache.hadoop.mapred.SequenceFileInputFormat.getRecordReader(SequenceFil
eInputFormat.java:63)
 at 
org.apache.hadoop.hive.ql.io.HiveInputFormat.getRecordReader(HiveInputFormat
.java:236)
 at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:336)
 at org.apache.hadoop.mapred.MapTask.run(MapTask.java:305)
 at org.apache.hadoop.mapred.Child.main(Child.java:170)

Is this expected? If so, is there a jira or is it planned to be addressed?
We are trying to think of workaround, but haven¹t thought of good ones as
swapping of files would ideally be handled inside hive.

Please let us know your feedback.

Thanks,
Eva.

Re: Files does not exist error: concurrency control on hive queries...

Posted by Cliff Resnick <cr...@proclivitysystems.com>.

We found this error had to do with the Hive Query plan getting stepped
on because of some shared state in
org.apache.hadoop.hive.ql.exec.Utilities.

I attached a patch that fixed this for us to HIVE-80.

-Cliff

On 09/09/2009 01:29 PM, Prasad Chakka wrote:
> The first query will not return unless it copied the files to the dest
> directory and this operation is atomic (FileSystem.rename() guarantees
> that). Since second query is not executed until the first query
> returns, this problem may be due to a bug in HDFS (highly unlikely) or
> an issue with HDFS configuration or related to EC3.
>
> The second query knows the file name
> ‘sessionsFacts_P20090909T021823L20090908T09-r-00006’ so Hive client
> does was able to successfully call getFileStatus() on it but the
> mapper (of second query) is not able to do the same thing. So either
> this file has been deleted after the Hive client accessed it but
> before the mapper access it or the machine on which the mapper is
> being executed can’t see this file. Can you manually check whether the
> file exists at all after the job fails?
>
> Prasad
>
>
> ------------------------------------------------------------------------
> *From: *Eva Tse <et...@netflix.com>
> *Reply-To: *<hi...@hadoop.apache.org>
> *Date: *Wed, 9 Sep 2009 10:19:24 -0700
> *To: *<hi...@hadoop.apache.org>
> *Subject: *Re: Files does not exist error: concurrency control on hive
> queries...
>
>
> Prasad,
> We believe the problem is that one of the query is doing an ‘insert
> overwrite ... select from’ which actually is deleting and merging the
> small files. The other query somehow couldn’t find those files that it
> thought it has seen before and failed. So, it looks like a concurrency
> issue.
>
> Yongqiang,
> Could you elaborate a bit on why you say this is not a bug?
>
> Thanks,
> Eva.
>
>
> On 9/9/09 9:55 AM, "Prasad Chakka" <pc...@facebook.com> wrote:
>
>     If a certain input file/dir does not exist then the job can’t be
>     submitted. Since only a few reducers are failing, the problem
>     could be something else.
>     Eva, Does the same job succeed on a second try? Ie. Is the
>     file/dir available eventually? What is the replication factor?
>
>     Prasad
>
>
>     ------------------------------------------------------------------------
>     *From: *Yongqiang He <he...@software.ict.ac.cn>
>     *Reply-To: *<hi...@hadoop.apache.org>
>     *Date: *Wed, 9 Sep 2009 04:07:31 -0700
>     *To: *<hi...@hadoop.apache.org>
>     *Subject: *Re: Files does not exist error: concurrency control on
>     hive queries...
>
>     Hi Eva,
>     After a close at the code, I think this is not a bug. We need to
>     find out how to avoid this.
>
>     Thanks,
>     Yongqiang
>     On 09-9-9 下午1:31, "He Yongqiang"
>     <he...@software.ict.ac.cn> wrote:
>
>         Hi Eva,
>         Can you open a new jira for this? And let’s discuss and
>         resolve this issue.
>         I guess this is because the partition metadata is added before
>         the data is available.
>
>         Thanks
>         Yongqiang
>         On 09-9-9 下午1:18, "Eva Tse" <et...@netflix.com> wrote:
>
>
>             We are planning to start enabling ad-hoc querying on our
>             hive warehouse and we tested some of the concurrent
>             queries and found the following issue:
>
>             Query 1 – doing ‘insert overwrite table yyy .... partition
>             (dateint = xxx) select ... from yyy where dateint = xxx’
>             This is done to merge small files within a partition in
>             table yyy
>             Query 2 – doing some select on the same table joining
>             another table.
>
>             What we found is that query 2 would fail with the
>             following exceptions in multiple reducers.
>             java.io.FileNotFoundException: File does not exist:
>             hdfs://xxxxxxxxxxxxx.ec2.internal:9000/user/hive/dataeng/warehouse/nccp_session_facts/dateint=20090908/hour=9/sessionsFacts_P20090909T021823L20090908T09-r-00006
>             at
>             org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:457)
>             at
>             org.apache.hadoop.fs.FileSystem.getLength(FileSystem.java:671)
>             at
>             org.apache.hadoop.io.SequenceFile$Reader.(SequenceFile.java:1417)
>             at
>             org.apache.hadoop.io.SequenceFile$Reader.(SequenceFile.java:1412)
>             at
>             org.apache.hadoop.mapred.SequenceFileRecordReader.(SequenceFileRecordReader.java:43)
>             at
>             org.apache.hadoop.mapred.SequenceFileInputFormat.getRecordReader(SequenceFileInputFormat.java:63)
>             at
>             org.apache.hadoop.hive.ql.io.HiveInputFormat.getRecordReader(HiveInputFormat.java:236)
>             at
>             org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:336)
>             at org.apache.hadoop.mapred.MapTask.run(MapTask.java:305)
>             at org.apache.hadoop.mapred.Child.main(Child.java:170)
>
>             Is this expected? If so, is there a jira or is it planned
>             to be addressed? We are trying to think of workaround, but
>             haven’t thought of good ones as swapping of files would
>             ideally be handled inside hive.
>
>             Please let us know your feedback.
>
>             Thanks,
>             Eva.
>
>
>
>
>

Re: Files does not exist error: concurrency control on hive queries...

Posted by Eva Tse <et...@netflix.com>.

Doing versioning would work for this scenario. It essentially achieves the
same thing.


On 9/11/09 2:39 AM, "Ashish Thusoo" <at...@facebook.com> wrote:

> Another option is to deal with this using versioning. Some ideas on this are
> at
> 
> https://issues.apache.org/jira/browse/HIVE-718
> 
> Ashish
> ________________________________________
> From: Eva Tse [etse@netflix.com]
> Sent: Wednesday, September 09, 2009 10:45 PM
> To: hive-user@hadoop.apache.org
> Subject: Re: Files does not exist error: concurrency control on hive
> queries...
> 
> Zookeeper sounds like a decent alternative, though it would add a new
> dependency for deployment.
> Maybe we could open a jira for it first to track this issue?
> Thanks,
> Eva.
> 
> 
> On 9/9/09 2:49 PM, "Prasad Chakka" <pc...@facebook.com> wrote:
> 
> Yeah, metastore db is logical place to do locking but there have to be
> periodic cleanups (when clients die without releasing locks) etc which is
> hacky so less preferrable. Another option is to point a ZooKeeper cluster to
> Hive and ask Hive to use it for locking. So those who are not concerned about
> concurrency control, don’t have to install ZooKeeper but other can. ZooKeeper
> provides leases so there won’t be any problem of hanging locks and it will be
> easier for admins to clean it up.
> 
> I suppose it depends on whoever wants to take this task up :)
> 
> Prasad
> 
> 
> ________________________________
> From: Eva Tse <et...@netflix.com>
> Reply-To: <hi...@hadoop.apache.org>
> Date: Wed, 9 Sep 2009 14:32:20 -0700
> To: <hi...@hadoop.apache.org>
> Subject: Re: Files does not exist error: concurrency control on hive
> queries...
> 
> 
> Regardless of whether the user uses a HiveServer, looks like the logical place
> to do locking or concurrency control would be at the metastore DB. This is
> actually one big advantage of Hive. The r/w lock or access control can be
> achieved by a DB row with lock count for each partition, etc. This might be
> over-simplfying it, but the metastore DB seems to be the ideal candidate.
> Thoughts?
> 
> 
> On 9/9/09 12:52 PM, "Prasad Chakka" <pc...@facebook.com> wrote:
> 
> I thought your script runs the two job sequentially. If these two queries are
> run in parallel then the error can be expected since Hive doesn’t try to
> acquire locks before reading or writing. I don’t think there are any plans to
> support this kind of locking (this can only be done if all queries go through
> HiveServer otherwise lot of orphaned locks will bring the system to halt). I
> think you should do some kind of locking (possibly with HDFS files) to prevent
> queries being executed simultaneously.
> 
> Any other ideas?
> 
> Prasad
> 
> 
> ________________________________
> From: Eva Tse <et...@netflix.com>
> Reply-To: <hi...@hadoop.apache.org>
> Date: Wed, 9 Sep 2009 12:36:11 -0700
> To: <hi...@hadoop.apache.org>, Dhruba Borthakur <dh...@facebook.com>
> Subject: Re: Files does not exist error: concurrency control on hive
> queries...
> 
> Hi Prasad,
> 
> Are you implying the expected behavior for these queries should be run
> sequentially by hive because one is r/w and one is read-only ?
> 
> For clarifications, these two queries are running concurrently in two separate
> jobs as below.
> 
> Query 1 is run within a job that does the following essentially:
> For every hour:
>    - parse log files to generate completed sessions information
>    - load completed sessions into 48 partitions (for the prior 48 hours)
>    - merge small files using ‘insert overwrite ... select from’ on every other
> 8 partitions. Essentially, we would issue 6 separate queries to merge 6
> partitions at the same time, not sequentially. (We do this to minimize time
> required.) And this is query 1.
> 
> Query 2 is run within another job that does select on 24 partitions (aka daily
> summary) for the previous day. This job just run this query in a loop for
> testing purposes.
> 
> The error comes from query 2 saying ‘file not found’ for a file that we are
> merging in query 1 at that point in time.
> 
> We need to rerun the test to be able to catch the failure at that time to see
> if the file was there at that instance. In the previous run, the merge query
> succeeded, so I would imagine the file not there after the merge. And, am not
> sure if that file was still there at that instance when the failure happens.
> 
> Thanks for the help!
> Eva.
> 
> On 9/9/09 10:29 AM, "Prasad Chakka" <pc...@facebook.com> wrote:
> 
> The first query will not return unless it copied the files to the dest
> directory and this operation is atomic (FileSystem.rename() guarantees that).
> Since second query is not executed until the first query returns, this problem
> may be due to a bug in HDFS (highly unlikely) or an issue with HDFS
> configuration or related to EC3.
> 
> The second query knows the file name
> ‘sessionsFacts_P20090909T021823L20090908T09-r-00006’ so Hive client does was
> able to successfully call getFileStatus() on it but the mapper (of second
> query) is not able to do the same thing. So either this file has been deleted
> after the Hive client accessed it but before the mapper access it or the
> machine on which the mapper is being executed can’t see this file. Can you
> manually check whether the file exists at all after the job fails?
> 
> Prasad
> 
> 
> ________________________________
> From: Eva Tse <et...@netflix.com>
> Reply-To: <hi...@hadoop.apache.org>
> Date: Wed, 9 Sep 2009 10:19:24 -0700
> To: <hi...@hadoop.apache.org>
> Subject: Re: Files does not exist error: concurrency control on hive
> queries...
> 
> 
> Prasad,
> We believe the problem is that one of the query is doing an ‘insert overwrite
> ... select from’ which actually is deleting and merging the small files. The
> other query somehow couldn’t find those files that it thought it has seen
> before and failed. So, it looks like a concurrency issue.
> 
> Yongqiang,
> Could you elaborate a bit on why you say this is not a bug?
> 
> Thanks,
> Eva.
> 
> 
> On 9/9/09 9:55 AM, "Prasad Chakka" <pc...@facebook.com> wrote:
> 
> If a certain input file/dir does not exist then the job can’t be submitted.
> Since only a few reducers are failing, the problem could be something else.
> Eva, Does the same job succeed on a second try? Ie. Is the file/dir available
> eventually? What is the replication factor?
> 
> Prasad
> 
> 
> ________________________________
> From: Yongqiang He <he...@software.ict.ac.cn>
> Reply-To: <hi...@hadoop.apache.org>
> Date: Wed, 9 Sep 2009 04:07:31 -0700
> To: <hi...@hadoop.apache.org>
> Subject: Re: Files does not exist error: concurrency control on hive
> queries...
> 
> Hi Eva,
>    After a close at the code, I think this is not a bug. We need to find out
> how to avoid this.
> 
> Thanks,
> Yongqiang
> On 09-9-9 下午1:31, "He Yongqiang" <he...@software.ict.ac.cn> wrote:
> 
> Hi Eva,
>     Can you open a new jira for this?  And let’s discuss and resolve this
> issue.
> I guess this is because the partition metadata is added before the data is
> available.
> 
> Thanks
> Yongqiang
> On 09-9-9 下午1:18, "Eva Tse" <et...@netflix.com> wrote:
> 
> 
> We are planning to start enabling ad-hoc querying on our hive warehouse and we
> tested some of the concurrent queries and found the following issue:
> 
> Query 1 – doing ‘insert overwrite table yyy .... partition (dateint = xxx)
> select ...  from yyy where dateint = xxx’  This is done to merge small files
> within a partition in table yyy
> Query 2 – doing some select on the same table joining another table.
> 
> What we found is that query 2 would fail with the following exceptions in
> multiple reducers.
> java.io.FileNotFoundException: File does not exist:
> hdfs://xxxxxxxxxxxxx.ec2.internal:9000/user/hive/dataeng/warehouse/nccp_sessio
> n_facts/dateint=20090908/hour=9/sessionsFacts_P20090909T021823L20090908T09-r-0
> 0006
>  at 
> org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSyst
> em.java:457)
>  at org.apache.hadoop.fs.FileSystem.getLength(FileSystem.java:671)
>  at org.apache.hadoop.io.SequenceFile$Reader.(SequenceFile.java:1417)
>  at org.apache.hadoop.io.SequenceFile$Reader.(SequenceFile.java:1412)
>  at 
> org.apache.hadoop.mapred.SequenceFileRecordReader.(SequenceFileRecordReader.ja
> va:43)
>  at 
> org.apache.hadoop.mapred.SequenceFileInputFormat.getRecordReader(SequenceFileI
> nputFormat.java:63)
>  at 
> org.apache.hadoop.hive.ql.io.HiveInputFormat.getRecordReader(HiveInputFormat.j
> ava:236)
>  at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:336)
>  at org.apache.hadoop.mapred.MapTask.run(MapTask.java:305)
>  at org.apache.hadoop.mapred.Child.main(Child.java:170)
> 
> Is this expected? If so, is there a jira or is it planned to be addressed? We
> are trying to think of workaround, but haven’t thought of good ones as
> swapping of files would ideally be handled inside hive.
> 
> Please let us know your feedback.
> 
> Thanks,
> Eva.
> 
> 
> 
> 
> 
> 
> 
> 
>

RE: Files does not exist error: concurrency control on hive queries...

Posted by Ashish Thusoo <at...@facebook.com>.

Another option is to deal with this using versioning. Some ideas on this are at

https://issues.apache.org/jira/browse/HIVE-718

Ashish
________________________________________
From: Eva Tse [etse@netflix.com]
Sent: Wednesday, September 09, 2009 10:45 PM
To: hive-user@hadoop.apache.org
Subject: Re: Files does not exist error: concurrency control on hive queries...

Zookeeper sounds like a decent alternative, though it would add a new dependency for deployment.
Maybe we could open a jira for it first to track this issue?
Thanks,
Eva.

On 9/9/09 2:49 PM, "Prasad Chakka" <pc...@facebook.com> wrote:

Yeah, metastore db is logical place to do locking but there have to be periodic cleanups (when clients die without releasing locks) etc which is hacky so less preferrable. Another option is to point a ZooKeeper cluster to Hive and ask Hive to use it for locking. So those who are not concerned about concurrency control, don’t have to install ZooKeeper but other can. ZooKeeper provides leases so there won’t be any problem of hanging locks and it will be easier for admins to clean it up.

I suppose it depends on whoever wants to take this task up :)

Prasad

________________________________
From: Eva Tse <et...@netflix.com>
Reply-To: <hi...@hadoop.apache.org>
Date: Wed, 9 Sep 2009 14:32:20 -0700
To: <hi...@hadoop.apache.org>
Subject: Re: Files does not exist error: concurrency control on hive queries...

Regardless of whether the user uses a HiveServer, looks like the logical place to do locking or concurrency control would be at the metastore DB. This is actually one big advantage of Hive. The r/w lock or access control can be achieved by a DB row with lock count for each partition, etc. This might be over-simplfying it, but the metastore DB seems to be the ideal candidate.  Thoughts?

On 9/9/09 12:52 PM, "Prasad Chakka" <pc...@facebook.com> wrote:

I thought your script runs the two job sequentially. If these two queries are run in parallel then the error can be expected since Hive doesn’t try to acquire locks before reading or writing. I don’t think there are any plans to support this kind of locking (this can only be done if all queries go through HiveServer otherwise lot of orphaned locks will bring the system to halt). I think you should do some kind of locking (possibly with HDFS files) to prevent queries being executed simultaneously.

Any other ideas?

Prasad

________________________________
From: Eva Tse <et...@netflix.com>
Reply-To: <hi...@hadoop.apache.org>
Date: Wed, 9 Sep 2009 12:36:11 -0700
To: <hi...@hadoop.apache.org>, Dhruba Borthakur <dh...@facebook.com>
Subject: Re: Files does not exist error: concurrency control on hive queries...

Hi Prasad,

Are you implying the expected behavior for these queries should be run sequentially by hive because one is r/w and one is read-only ?

For clarifications, these two queries are running concurrently in two separate jobs as below.

Query 1 is run within a job that does the following essentially:
For every hour:
   - parse log files to generate completed sessions information
   - load completed sessions into 48 partitions (for the prior 48 hours)
   - merge small files using ‘insert overwrite ... select from’ on every other 8 partitions. Essentially, we would issue 6 separate queries to merge 6 partitions at the same time, not sequentially. (We do this to minimize time required.) And this is query 1.

Query 2 is run within another job that does select on 24 partitions (aka daily summary) for the previous day. This job just run this query in a loop for testing purposes.

The error comes from query 2 saying ‘file not found’ for a file that we are merging in query 1 at that point in time.

We need to rerun the test to be able to catch the failure at that time to see if the file was there at that instance. In the previous run, the merge query succeeded, so I would imagine the file not there after the merge. And, am not sure if that file was still there at that instance when the failure happens.

Thanks for the help!
Eva.

On 9/9/09 10:29 AM, "Prasad Chakka" <pc...@facebook.com> wrote:

The first query will not return unless it copied the files to the dest directory and this operation is atomic (FileSystem.rename() guarantees that). Since second query is not executed until the first query returns, this problem may be due to a bug in HDFS (highly unlikely) or an issue with HDFS configuration or related to EC3.

The second query knows the file name ‘sessionsFacts_P20090909T021823L20090908T09-r-00006’ so Hive client does was able to successfully call getFileStatus() on it but the mapper (of second query) is not able to do the same thing. So either this file has been deleted after the Hive client accessed it but before the mapper access it or the machine on which the mapper is being executed can’t see this file. Can you manually check whether the file exists at all after the job fails?

Prasad

________________________________
From: Eva Tse <et...@netflix.com>
Reply-To: <hi...@hadoop.apache.org>
Date: Wed, 9 Sep 2009 10:19:24 -0700
To: <hi...@hadoop.apache.org>
Subject: Re: Files does not exist error: concurrency control on hive queries...

Prasad,
We believe the problem is that one of the query is doing an ‘insert overwrite ... select from’ which actually is deleting and merging the small files. The other query somehow couldn’t find those files that it thought it has seen before and failed. So, it looks like a concurrency issue.

Yongqiang,
Could you elaborate a bit on why you say this is not a bug?

Thanks,
Eva.

On 9/9/09 9:55 AM, "Prasad Chakka" <pc...@facebook.com> wrote:

If a certain input file/dir does not exist then the job can’t be submitted. Since only a few reducers are failing, the problem could be something else.
Eva, Does the same job succeed on a second try? Ie. Is the file/dir available eventually? What is the replication factor?

Prasad

________________________________
From: Yongqiang He <he...@software.ict.ac.cn>
Reply-To: <hi...@hadoop.apache.org>
Date: Wed, 9 Sep 2009 04:07:31 -0700
To: <hi...@hadoop.apache.org>
Subject: Re: Files does not exist error: concurrency control on hive queries...

Hi Eva,
   After a close at the code, I think this is not a bug. We need to find out how to avoid this.

Thanks,
Yongqiang
On 09-9-9 下午1:31, "He Yongqiang" <he...@software.ict.ac.cn> wrote:

Hi Eva,
    Can you open a new jira for this?  And let’s discuss and resolve this issue.
I guess this is because the partition metadata is added before the data is available.

Thanks
Yongqiang
On 09-9-9 下午1:18, "Eva Tse" <et...@netflix.com> wrote:

We are planning to start enabling ad-hoc querying on our hive warehouse and we tested some of the concurrent queries and found the following issue:

Query 1 – doing ‘insert overwrite table yyy .... partition (dateint = xxx) select ...  from yyy where dateint = xxx’  This is done to merge small files within a partition in table yyy
Query 2 – doing some select on the same table joining another table.

What we found is that query 2 would fail with the following exceptions in multiple reducers.
java.io.FileNotFoundException: File does not exist: hdfs://xxxxxxxxxxxxx.ec2.internal:9000/user/hive/dataeng/warehouse/nccp_session_facts/dateint=20090908/hour=9/sessionsFacts_P20090909T021823L20090908T09-r-00006
 at org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:457)
 at org.apache.hadoop.fs.FileSystem.getLength(FileSystem.java:671)
 at org.apache.hadoop.io.SequenceFile$Reader.(SequenceFile.java:1417)
 at org.apache.hadoop.io.SequenceFile$Reader.(SequenceFile.java:1412)
 at org.apache.hadoop.mapred.SequenceFileRecordReader.(SequenceFileRecordReader.java:43)
 at org.apache.hadoop.mapred.SequenceFileInputFormat.getRecordReader(SequenceFileInputFormat.java:63)
 at org.apache.hadoop.hive.ql.io.HiveInputFormat.getRecordReader(HiveInputFormat.java:236)
 at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:336)
 at org.apache.hadoop.mapred.MapTask.run(MapTask.java:305)
 at org.apache.hadoop.mapred.Child.main(Child.java:170)

Is this expected? If so, is there a jira or is it planned to be addressed? We are trying to think of workaround, but haven’t thought of good ones as swapping of files would ideally be handled inside hive.

Please let us know your feedback.

Thanks,
Eva.

Re: Files does not exist error: concurrency control on hive queries...

Posted by Edward Capriolo <ed...@gmail.com>.

2009/9/10 Eva Tse <et...@netflix.com>:
> Zookeeper sounds like a decent alternative, though it would add a new
> dependency for deployment.
> Maybe we could open a jira for it first to track this issue?
> Thanks,
> Eva.
>
>
> On 9/9/09 2:49 PM, "Prasad Chakka" <pc...@facebook.com> wrote:
>
> Yeah, metastore db is logical place to do locking but there have to be
> periodic cleanups (when clients die without releasing locks) etc which is
> hacky so less preferrable. Another option is to point a ZooKeeper cluster to
> Hive and ask Hive to use it for locking. So those who are not concerned
> about concurrency control, don’t have to install ZooKeeper but other can.
> ZooKeeper provides leases so there won’t be any problem of hanging locks and
> it will be easier for admins to clean it up.
>
> I suppose it depends on whoever wants to take this task up :)
>
> Prasad
>
>
> ________________________________
> From: Eva Tse <et...@netflix.com>
> Reply-To: <hi...@hadoop.apache.org>
> Date: Wed, 9 Sep 2009 14:32:20 -0700
> To: <hi...@hadoop.apache.org>
> Subject: Re: Files does not exist error: concurrency control on hive
> queries...
>
>
> Regardless of whether the user uses a HiveServer, looks like the logical
> place to do locking or concurrency control would be at the metastore DB.
> This is actually one big advantage of Hive. The r/w lock or access control
> can be achieved by a DB row with lock count for each partition, etc. This
> might be over-simplfying it, but the metastore DB seems to be the ideal
> candidate.  Thoughts?
>
>
> On 9/9/09 12:52 PM, "Prasad Chakka" <pc...@facebook.com> wrote:
>
> I thought your script runs the two job sequentially. If these two queries
> are run in parallel then the error can be expected since Hive doesn’t try to
> acquire locks before reading or writing. I don’t think there are any plans
> to support this kind of locking (this can only be done if all queries go
> through HiveServer otherwise lot of orphaned locks will bring the system to
> halt). I think you should do some kind of locking (possibly with HDFS files)
> to prevent queries being executed simultaneously.
>
> Any other ideas?
>
> Prasad
>
>
> ________________________________
> From: Eva Tse <et...@netflix.com>
> Reply-To: <hi...@hadoop.apache.org>
> Date: Wed, 9 Sep 2009 12:36:11 -0700
> To: <hi...@hadoop.apache.org>, Dhruba Borthakur <dh...@facebook.com>
> Subject: Re: Files does not exist error: concurrency control on hive
> queries...
>
> Hi Prasad,
>
> Are you implying the expected behavior for these queries should be run
> sequentially by hive because one is r/w and one is read-only ?
>
> For clarifications, these two queries are running concurrently in two
> separate jobs as below.
>
> Query 1 is run within a job that does the following essentially:
> For every hour:
>    - parse log files to generate completed sessions information
>    - load completed sessions into 48 partitions (for the prior 48 hours)
>    - merge small files using ‘insert overwrite ... select from’ on every
> other 8 partitions. Essentially, we would issue 6 separate queries to merge
> 6 partitions at the same time, not sequentially. (We do this to minimize
> time required.) And this is query 1.
>
> Query 2 is run within another job that does select on 24 partitions (aka
> daily summary) for the previous day. This job just run this query in a loop
> for testing purposes.
>
> The error comes from query 2 saying ‘file not found’ for a file that we are
> merging in query 1 at that point in time.
>
> We need to rerun the test to be able to catch the failure at that time to
> see if the file was there at that instance. In the previous run, the merge
> query succeeded, so I would imagine the file not there after the merge. And,
> am not sure if that file was still there at that instance when the failure
> happens.
>
> Thanks for the help!
> Eva.
>
> On 9/9/09 10:29 AM, "Prasad Chakka" <pc...@facebook.com> wrote:
>
> The first query will not return unless it copied the files to the dest
> directory and this operation is atomic (FileSystem.rename() guarantees
> that). Since second query is not executed until the first query returns,
> this problem may be due to a bug in HDFS (highly unlikely) or an issue with
> HDFS configuration or related to EC3.
>
> The second query knows the file name
> ‘sessionsFacts_P20090909T021823L20090908T09-r-00006’ so Hive client does was
> able to successfully call getFileStatus() on it but the mapper (of second
> query) is not able to do the same thing. So either this file has been
> deleted after the Hive client accessed it but before the mapper access it or
> the machine on which the mapper is being executed can’t see this file. Can
> you manually check whether the file exists at all after the job fails?
>
> Prasad
>
>
> ________________________________
> From: Eva Tse <et...@netflix.com>
> Reply-To: <hi...@hadoop.apache.org>
> Date: Wed, 9 Sep 2009 10:19:24 -0700
> To: <hi...@hadoop.apache.org>
> Subject: Re: Files does not exist error: concurrency control on hive
> queries...
>
>
> Prasad,
> We believe the problem is that one of the query is doing an ‘insert
> overwrite ... select from’ which actually is deleting and merging the small
> files. The other query somehow couldn’t find those files that it thought it
> has seen before and failed. So, it looks like a concurrency issue.
>
> Yongqiang,
> Could you elaborate a bit on why you say this is not a bug?
>
> Thanks,
> Eva.
>
>
> On 9/9/09 9:55 AM, "Prasad Chakka" <pc...@facebook.com> wrote:
>
> If a certain input file/dir does not exist then the job can’t be submitted.
> Since only a few reducers are failing, the problem could be something else.
> Eva, Does the same job succeed on a second try? Ie. Is the file/dir
> available eventually? What is the replication factor?
>
> Prasad
>
>
> ________________________________
> From: Yongqiang He <he...@software.ict.ac.cn>
> Reply-To: <hi...@hadoop.apache.org>
> Date: Wed, 9 Sep 2009 04:07:31 -0700
> To: <hi...@hadoop.apache.org>
> Subject: Re: Files does not exist error: concurrency control on hive
> queries...
>
> Hi Eva,
>    After a close at the code, I think this is not a bug. We need to find out
> how to avoid this.
>
> Thanks,
> Yongqiang
> On 09-9-9 下午1:31, "He Yongqiang" <he...@software.ict.ac.cn> wrote:
>
> Hi Eva,
>     Can you open a new jira for this?  And let’s discuss and resolve this
> issue.
> I guess this is because the partition metadata is added before the data is
> available.
>
> Thanks
> Yongqiang
> On 09-9-9 下午1:18, "Eva Tse" <et...@netflix.com> wrote:
>
>
> We are planning to start enabling ad-hoc querying on our hive warehouse and
> we tested some of the concurrent queries and found the following issue:
>
> Query 1 – doing ‘insert overwrite table yyy .... partition (dateint = xxx)
> select ...  from yyy where dateint = xxx’  This is done to merge small files
> within a partition in table yyy
> Query 2 – doing some select on the same table joining another table.
>
> What we found is that query 2 would fail with the following exceptions in
> multiple reducers.
> java.io.FileNotFoundException: File does not exist:
> hdfs://xxxxxxxxxxxxx.ec2.internal:9000/user/hive/dataeng/warehouse/nccp_session_facts/dateint=20090908/hour=9/sessionsFacts_P20090909T021823L20090908T09-r-00006
>  at
> org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:457)
>  at org.apache.hadoop.fs.FileSystem.getLength(FileSystem.java:671)
>  at org.apache.hadoop.io.SequenceFile$Reader.(SequenceFile.java:1417)
>  at org.apache.hadoop.io.SequenceFile$Reader.(SequenceFile.java:1412)
>  at
> org.apache.hadoop.mapred.SequenceFileRecordReader.(SequenceFileRecordReader.java:43)
>  at
> org.apache.hadoop.mapred.SequenceFileInputFormat.getRecordReader(SequenceFileInputFormat.java:63)
>  at
> org.apache.hadoop.hive.ql.io.HiveInputFormat.getRecordReader(HiveInputFormat.java:236)
>  at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:336)
>  at org.apache.hadoop.mapred.MapTask.run(MapTask.java:305)
>  at org.apache.hadoop.mapred.Child.main(Child.java:170)
>
> Is this expected? If so, is there a jira or is it planned to be addressed?
> We are trying to think of workaround, but haven’t thought of good ones as
> swapping of files would ideally be handled inside hive.
>
> Please let us know your feedback.
>
> Thanks,
> Eva.
>
>
>
>
>
>
>
>
>
>
As we all know Hadoop is designed to be mostly lock free for increased
parallel-ism. Hive follows suit and does not directly deal with
locking. What I think we are looking for here is a solution to write
lock a table/partition. Even though it COULD be a hive level task, it
seems like the solution we are going for is 'make it someone other
software's problem' with a zookeeper 'Hive Lock Manager' or write so
flag files in HDFS and make your own 'Hive Lock Manager'.

I think this is also related to the couple times we have discussed
some 'hive orchestration'. Locking, sequencing queries, all this stuff
somewhat goes together and is somewhat outside of Hive's (current)
scope.

However while it is outside of Hive's scope it is very critical to how
people use hive. So we should cook up some type of best practices /
'pet shop logs' application.

Re: Files does not exist error: concurrency control on hive queries...

Posted by Eva Tse <et...@netflix.com>.

Zookeeper sounds like a decent alternative, though it would add a new
dependency for deployment.
Maybe we could open a jira for it first to track this issue?
Thanks,
Eva.


On 9/9/09 2:49 PM, "Prasad Chakka" <pc...@facebook.com> wrote:

> Yeah, metastore db is logical place to do locking but there have to be
> periodic cleanups (when clients die without releasing locks) etc which is
> hacky so less preferrable. Another option is to point a ZooKeeper cluster to
> Hive and ask Hive to use it for locking. So those who are not concerned about
> concurrency control, don’t have to install ZooKeeper but other can. ZooKeeper
> provides leases so there won’t be any problem of hanging locks and it will be
> easier for admins to clean it up.
> 
> I suppose it depends on whoever wants to take this task up :)
> 
> Prasad
> 
> 
> 
> From: Eva Tse <et...@netflix.com>
> Reply-To: <hi...@hadoop.apache.org>
> Date: Wed, 9 Sep 2009 14:32:20 -0700
> To: <hi...@hadoop.apache.org>
> Subject: Re: Files does not exist error: concurrency control on hive
> queries...
> 
> 
> Regardless of whether the user uses a HiveServer, looks like the logical place
> to do locking or concurrency control would be at the metastore DB. This is
> actually one big advantage of Hive. The r/w lock or access control can be
> achieved by a DB row with lock count for each partition, etc. This might be
> over-simplfying it, but the metastore DB seems to be the ideal candidate.
> Thoughts?
> 
> 
> On 9/9/09 12:52 PM, "Prasad Chakka" <pc...@facebook.com> wrote:
> 
>> I thought your script runs the two job sequentially. If these two queries are
>> run in parallel then the error can be expected since Hive doesn’t try to
>> acquire locks before reading or writing. I don’t think there are any plans to
>> support this kind of locking (this can only be done if all queries go through
>> HiveServer otherwise lot of orphaned locks will bring the system to halt). I
>> think you should do some kind of locking (possibly with HDFS files) to
>> prevent queries being executed simultaneously.
>> 
>> Any other ideas?
>> 
>> Prasad
>> 
>> 
>> 
>> From: Eva Tse <et...@netflix.com>
>> Reply-To: <hi...@hadoop.apache.org>
>> Date: Wed, 9 Sep 2009 12:36:11 -0700
>> To: <hi...@hadoop.apache.org>, Dhruba Borthakur <dh...@facebook.com>
>> Subject: Re: Files does not exist error: concurrency control on hive
>> queries...
>> 
>> Hi Prasad,
>> 
>> Are you implying the expected behavior for these queries should be run
>> sequentially by hive because one is r/w and one is read-only ?
>> 
>> For clarifications, these two queries are running concurrently in two
>> separate jobs as below.
>> 
>>> Query 1 is run within a job that does the following essentially:
>>> For every hour:
>>>    - parse log files to generate completed sessions information
>>>    - load completed sessions into 48 partitions (for the prior 48 hours)
>>>    - merge small files using ‘insert overwrite ... select from’ on every
>>> other 8 partitions. Essentially, we would issue 6 separate queries to merge
>>> 6 partitions at the same time, not sequentially. (We do this to minimize
>>> time required.) And this is query 1.
>>> 
>>> Query 2 is run within another job that does select on 24 partitions (aka
>>> daily summary) for the previous day. This job just run this query in a loop
>>> for testing purposes.
>> 
>> The error comes from query 2 saying ‘file not found’ for a file that we are
>> merging in query 1 at that point in time.
>> 
>> We need to rerun the test to be able to catch the failure at that time to see
>> if the file was there at that instance. In the previous run, the merge query
>> succeeded, so I would imagine the file not there after the merge. And, am not
>> sure if that file was still there at that instance when the failure happens.
>> 
>> Thanks for the help!
>> Eva.
>> 
>> On 9/9/09 10:29 AM, "Prasad Chakka" <pc...@facebook.com> wrote:
>> 
>>> The first query will not return unless it copied the files to the dest
>>> directory and this operation is atomic (FileSystem.rename() guarantees
>>> that). Since second query is not executed until the first query returns,
>>> this problem may be due to a bug in HDFS (highly unlikely) or an issue with
>>> HDFS configuration or related to EC3.
>>> 
>>> The second query knows the file name
>>> ‘sessionsFacts_P20090909T021823L20090908T09-r-00006’ so Hive client does was
>>> able to successfully call getFileStatus() on it but the mapper (of second
>>> query) is not able to do the same thing. So either this file has been
>>> deleted after the Hive client accessed it but before the mapper access it or
>>> the machine on which the mapper is being executed can’t see this file. Can
>>> you manually check whether the file exists at all after the job fails?
>>> 
>>> Prasad
>>> 
>>> 
>>> 
>>> From: Eva Tse <et...@netflix.com>
>>> Reply-To: <hi...@hadoop.apache.org>
>>> Date: Wed, 9 Sep 2009 10:19:24 -0700
>>> To: <hi...@hadoop.apache.org>
>>> Subject: Re: Files does not exist error: concurrency control on hive
>>> queries...
>>> 
>>> 
>>> Prasad,
>>> We believe the problem is that one of the query is doing an ‘insert
>>> overwrite ... select from’ which actually is deleting and merging the small
>>> files. The other query somehow couldn’t find those files that it thought it
>>> has seen before and failed. So, it looks like a concurrency issue.
>>> 
>>> Yongqiang,
>>> Could you elaborate a bit on why you say this is not a bug?
>>> 
>>> Thanks,
>>> Eva.
>>> 
>>> 
>>> On 9/9/09 9:55 AM, "Prasad Chakka" <pc...@facebook.com> wrote:
>>> 
>>>> If a certain input file/dir does not exist then the job can’t be submitted.
>>>> Since only a few reducers are failing, the problem could be something else.
>>>> Eva, Does the same job succeed on a second try? Ie. Is the file/dir
>>>> available eventually? What is the replication factor?
>>>> 
>>>> Prasad
>>>> 
>>>> 
>>>> 
>>>> From: Yongqiang He <he...@software.ict.ac.cn>
>>>> Reply-To: <hi...@hadoop.apache.org>
>>>> Date: Wed, 9 Sep 2009 04:07:31 -0700
>>>> To: <hi...@hadoop.apache.org>
>>>> Subject: Re: Files does not exist error: concurrency control on hive
>>>> queries...
>>>> 
>>>> Hi Eva,
>>>>    After a close at the code, I think this is not a bug. We need to find
>>>> out how to avoid this.
>>>> 
>>>> Thanks,
>>>> Yongqiang
>>>> On 09-9-9 下午1:31, "He Yongqiang" <he...@software.ict.ac.cn> wrote:
>>>> 
>>>>> Hi Eva,
>>>>>     Can you open a new jira for this?  And let’s discuss and resolve this
>>>>> issue. 
>>>>> I guess this is because the partition metadata is added before the data is
>>>>> available. 
>>>>> 
>>>>> Thanks
>>>>> Yongqiang
>>>>> On 09-9-9 下午1:18, "Eva Tse" <et...@netflix.com> wrote:
>>>>> 
>>>>>> 
>>>>>> We are planning to start enabling ad-hoc querying on our hive warehouse
>>>>>> and we tested some of the concurrent queries and found the following
>>>>>> issue:
>>>>>> 
>>>>>> Query 1 – doing ‘insert overwrite table yyy .... partition (dateint =
>>>>>> xxx) select ...  from yyy where dateint = xxx’  This is done to merge
>>>>>> small files within a partition in table yyy
>>>>>> Query 2 – doing some select on the same table joining another table.
>>>>>> 
>>>>>> What we found is that query 2 would fail with the following exceptions in
>>>>>> multiple reducers.
>>>>>> java.io.FileNotFoundException: File does not exist:
>>>>>> hdfs://xxxxxxxxxxxxx.ec2.internal:9000/user/hive/dataeng/warehouse/nccp_s
>>>>>> ession_facts/dateint=20090908/hour=9/sessionsFacts_P20090909T021823L20090
>>>>>> 908T09-r-00006
>>>>>>  at 
>>>>>> org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFil
>>>>>> eSystem.java:457)
>>>>>>  at org.apache.hadoop.fs.FileSystem.getLength(FileSystem.java:671)
>>>>>>  at org.apache.hadoop.io.SequenceFile$Reader.(SequenceFile.java:1417)
>>>>>>  at org.apache.hadoop.io.SequenceFile$Reader.(SequenceFile.java:1412)
>>>>>>  at 
>>>>>> org.apache.hadoop.mapred.SequenceFileRecordReader.(SequenceFileRecordRead
>>>>>> er.java:43)
>>>>>>  at 
>>>>>> org.apache.hadoop.mapred.SequenceFileInputFormat.getRecordReader(Sequence
>>>>>> FileInputFormat.java:63)
>>>>>>  at 
>>>>>> org.apache.hadoop.hive.ql.io.HiveInputFormat.getRecordReader(HiveInputFor
>>>>>> mat.java:236)
>>>>>>  at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:336)
>>>>>>  at org.apache.hadoop.mapred.MapTask.run(MapTask.java:305)
>>>>>>  at org.apache.hadoop.mapred.Child.main(Child.java:170)
>>>>>> 
>>>>>> Is this expected? If so, is there a jira or is it planned to be
>>>>>> addressed? We are trying to think of workaround, but haven’t thought of
>>>>>> good ones as swapping of files would ideally be handled inside hive.
>>>>>> 
>>>>>> Please let us know your feedback.
>>>>>> 
>>>>>> Thanks,
>>>>>> Eva.
>>>>> 
>>>> 
>>>> 
>>> 
>>> 
>> 
>> 
> 
>

Re: Files does not exist error: concurrency control on hive queries...

Posted by Prasad Chakka <pc...@facebook.com>.

Yeah, metastore db is logical place to do locking but there have to be periodic cleanups (when clients die without releasing locks) etc which is hacky so less preferrable. Another option is to point a ZooKeeper cluster to Hive and ask Hive to use it for locking. So those who are not concerned about concurrency control, don’t have to install ZooKeeper but other can. ZooKeeper provides leases so there won’t be any problem of hanging locks and it will be easier for admins to clean it up.

I suppose it depends on whoever wants to take this task up :)

Prasad


________________________________
From: Eva Tse <et...@netflix.com>
Reply-To: <hi...@hadoop.apache.org>
Date: Wed, 9 Sep 2009 14:32:20 -0700
To: <hi...@hadoop.apache.org>
Subject: Re: Files does not exist error: concurrency control on hive queries...


Regardless of whether the user uses a HiveServer, looks like the logical place to do locking or concurrency control would be at the metastore DB. This is actually one big advantage of Hive. The r/w lock or access control can be achieved by a DB row with lock count for each partition, etc. This might be over-simplfying it, but the metastore DB seems to be the ideal candidate.  Thoughts?


On 9/9/09 12:52 PM, "Prasad Chakka" <pc...@facebook.com> wrote:

I thought your script runs the two job sequentially. If these two queries are run in parallel then the error can be expected since Hive doesn’t try to acquire locks before reading or writing. I don’t think there are any plans to support this kind of locking (this can only be done if all queries go through HiveServer otherwise lot of orphaned locks will bring the system to halt). I think you should do some kind of locking (possibly with HDFS files) to prevent queries being executed simultaneously.

Any other ideas?

Prasad


________________________________
From: Eva Tse <et...@netflix.com>
Reply-To: <hi...@hadoop.apache.org>
Date: Wed, 9 Sep 2009 12:36:11 -0700
To: <hi...@hadoop.apache.org>, Dhruba Borthakur <dh...@facebook.com>
Subject: Re: Files does not exist error: concurrency control on hive queries...

Hi Prasad,

Are you implying the expected behavior for these queries should be run sequentially by hive because one is r/w and one is read-only ?

For clarifications, these two queries are running concurrently in two separate jobs as below.

Query 1 is run within a job that does the following essentially:
For every hour:
   - parse log files to generate completed sessions information
   - load completed sessions into 48 partitions (for the prior 48 hours)
   - merge small files using ‘insert overwrite ... select from’ on every other 8 partitions. Essentially, we would issue 6 separate queries to merge 6 partitions at the same time, not sequentially. (We do this to minimize time required.) And this is query 1.

Query 2 is run within another job that does select on 24 partitions (aka daily summary) for the previous day. This job just run this query in a loop for testing purposes.

The error comes from query 2 saying ‘file not found’ for a file that we are merging in query 1 at that point in time.

We need to rerun the test to be able to catch the failure at that time to see if the file was there at that instance. In the previous run, the merge query succeeded, so I would imagine the file not there after the merge. And, am not sure if that file was still there at that instance when the failure happens.

Thanks for the help!
Eva.

On 9/9/09 10:29 AM, "Prasad Chakka" <pc...@facebook.com> wrote:

The first query will not return unless it copied the files to the dest directory and this operation is atomic (FileSystem.rename() guarantees that). Since second query is not executed until the first query returns, this problem may be due to a bug in HDFS (highly unlikely) or an issue with HDFS configuration or related to EC3.

The second query knows the file name ‘sessionsFacts_P20090909T021823L20090908T09-r-00006’ so Hive client does was able to successfully call getFileStatus() on it but the mapper (of second query) is not able to do the same thing. So either this file has been deleted after the Hive client accessed it but before the mapper access it or the machine on which the mapper is being executed can’t see this file. Can you manually check whether the file exists at all after the job fails?

Prasad


________________________________
From: Eva Tse <et...@netflix.com>
Reply-To: <hi...@hadoop.apache.org>
Date: Wed, 9 Sep 2009 10:19:24 -0700
To: <hi...@hadoop.apache.org>
Subject: Re: Files does not exist error: concurrency control on hive queries...


Prasad,
We believe the problem is that one of the query is doing an ‘insert overwrite ... select from’ which actually is deleting and merging the small files. The other query somehow couldn’t find those files that it thought it has seen before and failed. So, it looks like a concurrency issue.

Yongqiang,
Could you elaborate a bit on why you say this is not a bug?

Thanks,
Eva.


On 9/9/09 9:55 AM, "Prasad Chakka" <pc...@facebook.com> wrote:

If a certain input file/dir does not exist then the job can’t be submitted. Since only a few reducers are failing, the problem could be something else.
Eva, Does the same job succeed on a second try? Ie. Is the file/dir available eventually? What is the replication factor?

Prasad


________________________________
From: Yongqiang He <he...@software.ict.ac.cn>
Reply-To: <hi...@hadoop.apache.org>
Date: Wed, 9 Sep 2009 04:07:31 -0700
To: <hi...@hadoop.apache.org>
Subject: Re: Files does not exist error: concurrency control on hive queries...

Hi Eva,
   After a close at the code, I think this is not a bug. We need to find out how to avoid this.

Thanks,
Yongqiang
On 09-9-9 下午1:31, "He Yongqiang" <he...@software.ict.ac.cn> wrote:

Hi Eva,
    Can you open a new jira for this?  And let’s discuss and resolve this issue.
I guess this is because the partition metadata is added before the data is available.

Thanks
Yongqiang
On 09-9-9 下午1:18, "Eva Tse" <et...@netflix.com> wrote:


We are planning to start enabling ad-hoc querying on our hive warehouse and we tested some of the concurrent queries and found the following issue:

Query 1 – doing ‘insert overwrite table yyy .... partition (dateint = xxx) select ...  from yyy where dateint = xxx’  This is done to merge small files within a partition in table yyy
Query 2 – doing some select on the same table joining another table.

What we found is that query 2 would fail with the following exceptions in multiple reducers.
java.io.FileNotFoundException: File does not exist: hdfs://xxxxxxxxxxxxx.ec2.internal:9000/user/hive/dataeng/warehouse/nccp_session_facts/dateint=20090908/hour=9/sessionsFacts_P20090909T021823L20090908T09-r-00006
 at org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:457)
 at org.apache.hadoop.fs.FileSystem.getLength(FileSystem.java:671)
 at org.apache.hadoop.io.SequenceFile$Reader.(SequenceFile.java:1417)
 at org.apache.hadoop.io.SequenceFile$Reader.(SequenceFile.java:1412)
 at org.apache.hadoop.mapred.SequenceFileRecordReader.(SequenceFileRecordReader.java:43)
 at org.apache.hadoop.mapred.SequenceFileInputFormat.getRecordReader(SequenceFileInputFormat.java:63)
 at org.apache.hadoop.hive.ql.io.HiveInputFormat.getRecordReader(HiveInputFormat.java:236)
 at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:336)
 at org.apache.hadoop.mapred.MapTask.run(MapTask.java:305)
 at org.apache.hadoop.mapred.Child.main(Child.java:170)

Is this expected? If so, is there a jira or is it planned to be addressed? We are trying to think of workaround, but haven’t thought of good ones as swapping of files would ideally be handled inside hive.

Please let us know your feedback.

Thanks,
Eva.

Re: Files does not exist error: concurrency control on hive queries...

Posted by Eva Tse <et...@netflix.com>.

Regardless of whether the user uses a HiveServer, looks like the logical
place to do locking or concurrency control would be at the metastore DB.
This is actually one big advantage of Hive. The r/w lock or access control
can be achieved by a DB row with lock count for each partition, etc. This
might be over-simplfying it, but the metastore DB seems to be the ideal
candidate.  Thoughts?


On 9/9/09 12:52 PM, "Prasad Chakka" <pc...@facebook.com> wrote:

> I thought your script runs the two job sequentially. If these two queries are
> run in parallel then the error can be expected since Hive doesn’t try to
> acquire locks before reading or writing. I don’t think there are any plans to
> support this kind of locking (this can only be done if all queries go through
> HiveServer otherwise lot of orphaned locks will bring the system to halt). I
> think you should do some kind of locking (possibly with HDFS files) to prevent
> queries being executed simultaneously.
> 
> Any other ideas?
> 
> Prasad
> 
> 
> 
> From: Eva Tse <et...@netflix.com>
> Reply-To: <hi...@hadoop.apache.org>
> Date: Wed, 9 Sep 2009 12:36:11 -0700
> To: <hi...@hadoop.apache.org>, Dhruba Borthakur <dh...@facebook.com>
> Subject: Re: Files does not exist error: concurrency control on hive
> queries...
> 
> Hi Prasad,
> 
> Are you implying the expected behavior for these queries should be run
> sequentially by hive because one is r/w and one is read-only ?
> 
> For clarifications, these two queries are running concurrently in two separate
> jobs as below.
> 
>> Query 1 is run within a job that does the following essentially:
>> For every hour:
>>    - parse log files to generate completed sessions information
>>    - load completed sessions into 48 partitions (for the prior 48 hours)
>>    - merge small files using ‘insert overwrite ... select from’ on every
>> other 8 partitions. Essentially, we would issue 6 separate queries to merge 6
>> partitions at the same time, not sequentially. (We do this to minimize time
>> required.) And this is query 1.
>> 
>> Query 2 is run within another job that does select on 24 partitions (aka
>> daily summary) for the previous day. This job just run this query in a loop
>> for testing purposes.
> 
> The error comes from query 2 saying ‘file not found’ for a file that we are
> merging in query 1 at that point in time.
> 
> We need to rerun the test to be able to catch the failure at that time to see
> if the file was there at that instance. In the previous run, the merge query
> succeeded, so I would imagine the file not there after the merge. And, am not
> sure if that file was still there at that instance when the failure happens.
> 
> Thanks for the help!
> Eva.
> 
> On 9/9/09 10:29 AM, "Prasad Chakka" <pc...@facebook.com> wrote:
> 
>> The first query will not return unless it copied the files to the dest
>> directory and this operation is atomic (FileSystem.rename() guarantees that).
>> Since second query is not executed until the first query returns, this
>> problem may be due to a bug in HDFS (highly unlikely) or an issue with HDFS
>> configuration or related to EC3.
>> 
>> The second query knows the file name
>> ‘sessionsFacts_P20090909T021823L20090908T09-r-00006’ so Hive client does was
>> able to successfully call getFileStatus() on it but the mapper (of second
>> query) is not able to do the same thing. So either this file has been deleted
>> after the Hive client accessed it but before the mapper access it or the
>> machine on which the mapper is being executed can’t see this file. Can you
>> manually check whether the file exists at all after the job fails?
>> 
>> Prasad
>> 
>> 
>> 
>> From: Eva Tse <et...@netflix.com>
>> Reply-To: <hi...@hadoop.apache.org>
>> Date: Wed, 9 Sep 2009 10:19:24 -0700
>> To: <hi...@hadoop.apache.org>
>> Subject: Re: Files does not exist error: concurrency control on hive
>> queries...
>> 
>> 
>> Prasad,
>> We believe the problem is that one of the query is doing an ‘insert overwrite
>> ... select from’ which actually is deleting and merging the small files. The
>> other query somehow couldn’t find those files that it thought it has seen
>> before and failed. So, it looks like a concurrency issue.
>> 
>> Yongqiang,
>> Could you elaborate a bit on why you say this is not a bug?
>> 
>> Thanks,
>> Eva.
>> 
>> 
>> On 9/9/09 9:55 AM, "Prasad Chakka" <pc...@facebook.com> wrote:
>> 
>>> If a certain input file/dir does not exist then the job can’t be submitted.
>>> Since only a few reducers are failing, the problem could be something else.
>>> Eva, Does the same job succeed on a second try? Ie. Is the file/dir
>>> available eventually? What is the replication factor?
>>> 
>>> Prasad
>>> 
>>> 
>>> 
>>> From: Yongqiang He <he...@software.ict.ac.cn>
>>> Reply-To: <hi...@hadoop.apache.org>
>>> Date: Wed, 9 Sep 2009 04:07:31 -0700
>>> To: <hi...@hadoop.apache.org>
>>> Subject: Re: Files does not exist error: concurrency control on hive
>>> queries...
>>> 
>>> Hi Eva,
>>>    After a close at the code, I think this is not a bug. We need to find out
>>> how to avoid this.
>>> 
>>> Thanks,
>>> Yongqiang
>>> On 09-9-9 下午1:31, "He Yongqiang" <he...@software.ict.ac.cn> wrote:
>>> 
>>>> Hi Eva,
>>>>     Can you open a new jira for this?  And let’s discuss and resolve this
>>>> issue. 
>>>> I guess this is because the partition metadata is added before the data is
>>>> available. 
>>>> 
>>>> Thanks
>>>> Yongqiang
>>>> On 09-9-9 下午1:18, "Eva Tse" <et...@netflix.com> wrote:
>>>> 
>>>>> 
>>>>> We are planning to start enabling ad-hoc querying on our hive warehouse
>>>>> and we tested some of the concurrent queries and found the following
>>>>> issue:
>>>>> 
>>>>> Query 1 – doing ‘insert overwrite table yyy .... partition (dateint = xxx)
>>>>> select ...  from yyy where dateint = xxx’  This is done to merge small
>>>>> files within a partition in table yyy
>>>>> Query 2 – doing some select on the same table joining another table.
>>>>> 
>>>>> What we found is that query 2 would fail with the following exceptions in
>>>>> multiple reducers.
>>>>> java.io.FileNotFoundException: File does not exist:
>>>>> hdfs://xxxxxxxxxxxxx.ec2.internal:9000/user/hive/dataeng/warehouse/nccp_se
>>>>> ssion_facts/dateint=20090908/hour=9/sessionsFacts_P20090909T021823L2009090
>>>>> 8T09-r-00006
>>>>>  at 
>>>>> org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFile
>>>>> System.java:457)
>>>>>  at org.apache.hadoop.fs.FileSystem.getLength(FileSystem.java:671)
>>>>>  at org.apache.hadoop.io.SequenceFile$Reader.(SequenceFile.java:1417)
>>>>>  at org.apache.hadoop.io.SequenceFile$Reader.(SequenceFile.java:1412)
>>>>>  at 
>>>>> org.apache.hadoop.mapred.SequenceFileRecordReader.(SequenceFileRecordReade
>>>>> r.java:43)
>>>>>  at 
>>>>> org.apache.hadoop.mapred.SequenceFileInputFormat.getRecordReader(SequenceF
>>>>> ileInputFormat.java:63)
>>>>>  at 
>>>>> org.apache.hadoop.hive.ql.io.HiveInputFormat.getRecordReader(HiveInputForm
>>>>> at.java:236)
>>>>>  at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:336)
>>>>>  at org.apache.hadoop.mapred.MapTask.run(MapTask.java:305)
>>>>>  at org.apache.hadoop.mapred.Child.main(Child.java:170)
>>>>> 
>>>>> Is this expected? If so, is there a jira or is it planned to be addressed?
>>>>> We are trying to think of workaround, but haven’t thought of good ones as
>>>>> swapping of files would ideally be handled inside hive.
>>>>> 
>>>>> Please let us know your feedback.
>>>>> 
>>>>> Thanks,
>>>>> Eva.
>>>> 
>>> 
>>> 
>> 
>> 
> 
>

Re: Files does not exist error: concurrency control on hive queries...

Posted by Prasad Chakka <pc...@facebook.com>.

I thought your script runs the two job sequentially. If these two queries are run in parallel then the error can be expected since Hive doesn’t try to acquire locks before reading or writing. I don’t think there are any plans to support this kind of locking (this can only be done if all queries go through HiveServer otherwise lot of orphaned locks will bring the system to halt). I think you should do some kind of locking (possibly with HDFS files) to prevent queries being executed simultaneously.

Any other ideas?

Prasad


________________________________
From: Eva Tse <et...@netflix.com>
Reply-To: <hi...@hadoop.apache.org>
Date: Wed, 9 Sep 2009 12:36:11 -0700
To: <hi...@hadoop.apache.org>, Dhruba Borthakur <dh...@facebook.com>
Subject: Re: Files does not exist error: concurrency control on hive queries...

Hi Prasad,

Are you implying the expected behavior for these queries should be run sequentially by hive because one is r/w and one is read-only ?

For clarifications, these two queries are running concurrently in two separate jobs as below.

Query 1 is run within a job that does the following essentially:
For every hour:
   - parse log files to generate completed sessions information
   - load completed sessions into 48 partitions (for the prior 48 hours)
   - merge small files using ‘insert overwrite ... select from’ on every other 8 partitions. Essentially, we would issue 6 separate queries to merge 6 partitions at the same time, not sequentially. (We do this to minimize time required.) And this is query 1.

Query 2 is run within another job that does select on 24 partitions (aka daily summary) for the previous day. This job just run this query in a loop for testing purposes.

The error comes from query 2 saying ‘file not found’ for a file that we are merging in query 1 at that point in time.

We need to rerun the test to be able to catch the failure at that time to see if the file was there at that instance. In the previous run, the merge query succeeded, so I would imagine the file not there after the merge. And, am not sure if that file was still there at that instance when the failure happens.

Thanks for the help!
Eva.

On 9/9/09 10:29 AM, "Prasad Chakka" <pc...@facebook.com> wrote:

The first query will not return unless it copied the files to the dest directory and this operation is atomic (FileSystem.rename() guarantees that). Since second query is not executed until the first query returns, this problem may be due to a bug in HDFS (highly unlikely) or an issue with HDFS configuration or related to EC3.

The second query knows the file name ‘sessionsFacts_P20090909T021823L20090908T09-r-00006’ so Hive client does was able to successfully call getFileStatus() on it but the mapper (of second query) is not able to do the same thing. So either this file has been deleted after the Hive client accessed it but before the mapper access it or the machine on which the mapper is being executed can’t see this file. Can you manually check whether the file exists at all after the job fails?

Prasad


________________________________
From: Eva Tse <et...@netflix.com>
Reply-To: <hi...@hadoop.apache.org>
Date: Wed, 9 Sep 2009 10:19:24 -0700
To: <hi...@hadoop.apache.org>
Subject: Re: Files does not exist error: concurrency control on hive queries...


Prasad,
We believe the problem is that one of the query is doing an ‘insert overwrite ... select from’ which actually is deleting and merging the small files. The other query somehow couldn’t find those files that it thought it has seen before and failed. So, it looks like a concurrency issue.

Yongqiang,
Could you elaborate a bit on why you say this is not a bug?

Thanks,
Eva.


On 9/9/09 9:55 AM, "Prasad Chakka" <pc...@facebook.com> wrote:

If a certain input file/dir does not exist then the job can’t be submitted. Since only a few reducers are failing, the problem could be something else.
Eva, Does the same job succeed on a second try? Ie. Is the file/dir available eventually? What is the replication factor?

Prasad


________________________________
From: Yongqiang He <he...@software.ict.ac.cn>
Reply-To: <hi...@hadoop.apache.org>
Date: Wed, 9 Sep 2009 04:07:31 -0700
To: <hi...@hadoop.apache.org>
Subject: Re: Files does not exist error: concurrency control on hive queries...

Hi Eva,
   After a close at the code, I think this is not a bug. We need to find out how to avoid this.

Thanks,
Yongqiang
On 09-9-9 下午1:31, "He Yongqiang" <he...@software.ict.ac.cn> wrote:

Hi Eva,
    Can you open a new jira for this?  And let’s discuss and resolve this issue.
I guess this is because the partition metadata is added before the data is available.

Thanks
Yongqiang
On 09-9-9 下午1:18, "Eva Tse" <et...@netflix.com> wrote:


We are planning to start enabling ad-hoc querying on our hive warehouse and we tested some of the concurrent queries and found the following issue:

Query 1 – doing ‘insert overwrite table yyy .... partition (dateint = xxx) select ...  from yyy where dateint = xxx’  This is done to merge small files within a partition in table yyy
Query 2 – doing some select on the same table joining another table.

What we found is that query 2 would fail with the following exceptions in multiple reducers.
java.io.FileNotFoundException: File does not exist: hdfs://xxxxxxxxxxxxx.ec2.internal:9000/user/hive/dataeng/warehouse/nccp_session_facts/dateint=20090908/hour=9/sessionsFacts_P20090909T021823L20090908T09-r-00006
 at org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:457)
 at org.apache.hadoop.fs.FileSystem.getLength(FileSystem.java:671)
 at org.apache.hadoop.io.SequenceFile$Reader.(SequenceFile.java:1417)
 at org.apache.hadoop.io.SequenceFile$Reader.(SequenceFile.java:1412)
 at org.apache.hadoop.mapred.SequenceFileRecordReader.(SequenceFileRecordReader.java:43)
 at org.apache.hadoop.mapred.SequenceFileInputFormat.getRecordReader(SequenceFileInputFormat.java:63)
 at org.apache.hadoop.hive.ql.io.HiveInputFormat.getRecordReader(HiveInputFormat.java:236)
 at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:336)
 at org.apache.hadoop.mapred.MapTask.run(MapTask.java:305)
 at org.apache.hadoop.mapred.Child.main(Child.java:170)

Is this expected? If so, is there a jira or is it planned to be addressed? We are trying to think of workaround, but haven’t thought of good ones as swapping of files would ideally be handled inside hive.

Please let us know your feedback.

Thanks,
Eva.

Re: Files does not exist error: concurrency control on hive queries...

Posted by Eva Tse <et...@netflix.com>.

Hi Prasad,

Are you implying the expected behavior for these queries should be run
sequentially by hive because one is r/w and one is read-only ?

For clarifications, these two queries are running concurrently in two
separate jobs as below.

> Query 1 is run within a job that does the following essentially:
> For every hour:
>    - parse log files to generate completed sessions information
>    - load completed sessions into 48 partitions (for the prior 48 hours)
>    - merge small files using ‘insert overwrite ... select from’ on every other
> 8 partitions. Essentially, we would issue 6 separate queries to merge 6
> partitions at the same time, not sequentially. (We do this to minimize time
> required.) And this is query 1.
> 
> Query 2 is run within another job that does select on 24 partitions (aka daily
> summary) for the previous day. This job just run this query in a loop for
> testing purposes.

The error comes from query 2 saying ‘file not found’ for a file that we are
merging in query 1 at that point in time.

We need to rerun the test to be able to catch the failure at that time to
see if the file was there at that instance. In the previous run, the merge
query succeeded, so I would imagine the file not there after the merge. And,
am not sure if that file was still there at that instance when the failure
happens.  

Thanks for the help!
Eva.

On 9/9/09 10:29 AM, "Prasad Chakka" <pc...@facebook.com> wrote:

> The first query will not return unless it copied the files to the dest
> directory and this operation is atomic (FileSystem.rename() guarantees that).
> Since second query is not executed until the first query returns, this problem
> may be due to a bug in HDFS (highly unlikely) or an issue with HDFS
> configuration or related to EC3.
> 
> The second query knows the file name
> ‘sessionsFacts_P20090909T021823L20090908T09-r-00006’ so Hive client does was
> able to successfully call getFileStatus() on it but the mapper (of second
> query) is not able to do the same thing. So either this file has been deleted
> after the Hive client accessed it but before the mapper access it or the
> machine on which the mapper is being executed can’t see this file. Can you
> manually check whether the file exists at all after the job fails?
> 
> Prasad
> 
> 
> 
> From: Eva Tse <et...@netflix.com>
> Reply-To: <hi...@hadoop.apache.org>
> Date: Wed, 9 Sep 2009 10:19:24 -0700
> To: <hi...@hadoop.apache.org>
> Subject: Re: Files does not exist error: concurrency control on hive
> queries...
> 
> 
> Prasad,
> We believe the problem is that one of the query is doing an ‘insert overwrite
> ... select from’ which actually is deleting and merging the small files. The
> other query somehow couldn’t find those files that it thought it has seen
> before and failed. So, it looks like a concurrency issue.
> 
> Yongqiang,
> Could you elaborate a bit on why you say this is not a bug?
> 
> Thanks,
> Eva.
> 
> 
> On 9/9/09 9:55 AM, "Prasad Chakka" <pc...@facebook.com> wrote:
> 
>> If a certain input file/dir does not exist then the job can’t be submitted.
>> Since only a few reducers are failing, the problem could be something else.
>> Eva, Does the same job succeed on a second try? Ie. Is the file/dir available
>> eventually? What is the replication factor?
>> 
>> Prasad
>> 
>> 
>> 
>> From: Yongqiang He <he...@software.ict.ac.cn>
>> Reply-To: <hi...@hadoop.apache.org>
>> Date: Wed, 9 Sep 2009 04:07:31 -0700
>> To: <hi...@hadoop.apache.org>
>> Subject: Re: Files does not exist error: concurrency control on hive
>> queries...
>> 
>> Hi Eva,
>>    After a close at the code, I think this is not a bug. We need to find out
>> how to avoid this.
>> 
>> Thanks,
>> Yongqiang
>> On 09-9-9 下午1:31, "He Yongqiang" <he...@software.ict.ac.cn> wrote:
>> 
>>> Hi Eva,
>>>     Can you open a new jira for this?  And let’s discuss and resolve this
>>> issue. 
>>> I guess this is because the partition metadata is added before the data is
>>> available. 
>>> 
>>> Thanks
>>> Yongqiang
>>> On 09-9-9 下午1:18, "Eva Tse" <et...@netflix.com> wrote:
>>> 
>>>> 
>>>> We are planning to start enabling ad-hoc querying on our hive warehouse and
>>>> we tested some of the concurrent queries and found the following issue:
>>>> 
>>>> Query 1 – doing ‘insert overwrite table yyy .... partition (dateint = xxx)
>>>> select ...  from yyy where dateint = xxx’  This is done to merge small
>>>> files within a partition in table yyy
>>>> Query 2 – doing some select on the same table joining another table.
>>>> 
>>>> What we found is that query 2 would fail with the following exceptions in
>>>> multiple reducers.
>>>> java.io.FileNotFoundException: File does not exist:
>>>> hdfs://xxxxxxxxxxxxx.ec2.internal:9000/user/hive/dataeng/warehouse/nccp_ses
>>>> sion_facts/dateint=20090908/hour=9/sessionsFacts_P20090909T021823L20090908T
>>>> 09-r-00006
>>>>  at 
>>>> org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileS
>>>> ystem.java:457)
>>>>  at org.apache.hadoop.fs.FileSystem.getLength(FileSystem.java:671)
>>>>  at org.apache.hadoop.io.SequenceFile$Reader.(SequenceFile.java:1417)
>>>>  at org.apache.hadoop.io.SequenceFile$Reader.(SequenceFile.java:1412)
>>>>  at 
>>>> org.apache.hadoop.mapred.SequenceFileRecordReader.(SequenceFileRecordReader
>>>> .java:43)
>>>>  at 
>>>> org.apache.hadoop.mapred.SequenceFileInputFormat.getRecordReader(SequenceFi
>>>> leInputFormat.java:63)
>>>>  at 
>>>> org.apache.hadoop.hive.ql.io.HiveInputFormat.getRecordReader(HiveInputForma
>>>> t.java:236)
>>>>  at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:336)
>>>>  at org.apache.hadoop.mapred.MapTask.run(MapTask.java:305)
>>>>  at org.apache.hadoop.mapred.Child.main(Child.java:170)
>>>> 
>>>> Is this expected? If so, is there a jira or is it planned to be addressed?
>>>> We are trying to think of workaround, but haven’t thought of good ones as
>>>> swapping of files would ideally be handled inside hive.
>>>> 
>>>> Please let us know your feedback.
>>>> 
>>>> Thanks,
>>>> Eva.
>>> 
>> 
>> 
> 
>

Re: Files does not exist error: concurrency control on hive queries...

Posted by Prasad Chakka <pc...@facebook.com>.

The first query will not return unless it copied the files to the dest directory and this operation is atomic (FileSystem.rename() guarantees that). Since second query is not executed until the first query returns, this problem may be due to a bug in HDFS (highly unlikely) or an issue with HDFS configuration or related to EC3.

The second query knows the file name ‘sessionsFacts_P20090909T021823L20090908T09-r-00006’ so Hive client does was able to successfully call getFileStatus() on it but the mapper (of second query) is not able to do the same thing. So either this file has been deleted after the Hive client accessed it but before the mapper access it or the machine on which the mapper is being executed can’t see this file. Can you manually check whether the file exists at all after the job fails?

Prasad


________________________________
From: Eva Tse <et...@netflix.com>
Reply-To: <hi...@hadoop.apache.org>
Date: Wed, 9 Sep 2009 10:19:24 -0700
To: <hi...@hadoop.apache.org>
Subject: Re: Files does not exist error: concurrency control on hive queries...


Prasad,
We believe the problem is that one of the query is doing an ‘insert overwrite ... select from’ which actually is deleting and merging the small files. The other query somehow couldn’t find those files that it thought it has seen before and failed. So, it looks like a concurrency issue.

Yongqiang,
Could you elaborate a bit on why you say this is not a bug?

Thanks,
Eva.


On 9/9/09 9:55 AM, "Prasad Chakka" <pc...@facebook.com> wrote:

If a certain input file/dir does not exist then the job can’t be submitted. Since only a few reducers are failing, the problem could be something else.
Eva, Does the same job succeed on a second try? Ie. Is the file/dir available eventually? What is the replication factor?

Prasad


________________________________
From: Yongqiang He <he...@software.ict.ac.cn>
Reply-To: <hi...@hadoop.apache.org>
Date: Wed, 9 Sep 2009 04:07:31 -0700
To: <hi...@hadoop.apache.org>
Subject: Re: Files does not exist error: concurrency control on hive queries...

Hi Eva,
   After a close at the code, I think this is not a bug. We need to find out how to avoid this.

Thanks,
Yongqiang
On 09-9-9 下午1:31, "He Yongqiang" <he...@software.ict.ac.cn> wrote:

Hi Eva,
    Can you open a new jira for this?  And let’s discuss and resolve this issue.
I guess this is because the partition metadata is added before the data is available.

Thanks
Yongqiang
On 09-9-9 下午1:18, "Eva Tse" <et...@netflix.com> wrote:


We are planning to start enabling ad-hoc querying on our hive warehouse and we tested some of the concurrent queries and found the following issue:

Query 1 – doing ‘insert overwrite table yyy .... partition (dateint = xxx) select ...  from yyy where dateint = xxx’  This is done to merge small files within a partition in table yyy
Query 2 – doing some select on the same table joining another table.

What we found is that query 2 would fail with the following exceptions in multiple reducers.
java.io.FileNotFoundException: File does not exist: hdfs://xxxxxxxxxxxxx.ec2.internal:9000/user/hive/dataeng/warehouse/nccp_session_facts/dateint=20090908/hour=9/sessionsFacts_P20090909T021823L20090908T09-r-00006
 at org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:457)
 at org.apache.hadoop.fs.FileSystem.getLength(FileSystem.java:671)
 at org.apache.hadoop.io.SequenceFile$Reader.(SequenceFile.java:1417)
 at org.apache.hadoop.io.SequenceFile$Reader.(SequenceFile.java:1412)
 at org.apache.hadoop.mapred.SequenceFileRecordReader.(SequenceFileRecordReader.java:43)
 at org.apache.hadoop.mapred.SequenceFileInputFormat.getRecordReader(SequenceFileInputFormat.java:63)
 at org.apache.hadoop.hive.ql.io.HiveInputFormat.getRecordReader(HiveInputFormat.java:236)
 at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:336)
 at org.apache.hadoop.mapred.MapTask.run(MapTask.java:305)
 at org.apache.hadoop.mapred.Child.main(Child.java:170)

Is this expected? If so, is there a jira or is it planned to be addressed? We are trying to think of workaround, but haven’t thought of good ones as swapping of files would ideally be handled inside hive.

Please let us know your feedback.

Thanks,
Eva.

Re: Files does not exist error: concurrency control on hive queries...

Posted by Eva Tse <et...@netflix.com>.

Prasad,
We believe the problem is that one of the query is doing an ‘insert
overwrite ... select from’ which actually is deleting and merging the small
files. The other query somehow couldn’t find those files that it thought it
has seen before and failed. So, it looks like a concurrency issue.

Yongqiang,
Could you elaborate a bit on why you say this is not a bug?

Thanks,
Eva.


On 9/9/09 9:55 AM, "Prasad Chakka" <pc...@facebook.com> wrote:

> If a certain input file/dir does not exist then the job can’t be submitted.
> Since only a few reducers are failing, the problem could be something else.
> Eva, Does the same job succeed on a second try? Ie. Is the file/dir available
> eventually? What is the replication factor?
> 
> Prasad
> 
> 
> 
> From: Yongqiang He <he...@software.ict.ac.cn>
> Reply-To: <hi...@hadoop.apache.org>
> Date: Wed, 9 Sep 2009 04:07:31 -0700
> To: <hi...@hadoop.apache.org>
> Subject: Re: Files does not exist error: concurrency control on hive
> queries...
> 
> Hi Eva,
>    After a close at the code, I think this is not a bug. We need to find out
> how to avoid this.
> 
> Thanks,
> Yongqiang
> On 09-9-9 下午1:31, "He Yongqiang" <he...@software.ict.ac.cn> wrote:
> 
>> Hi Eva,
>>     Can you open a new jira for this?  And let’s discuss and resolve this
>> issue. 
>> I guess this is because the partition metadata is added before the data is
>> available. 
>> 
>> Thanks
>> Yongqiang
>> On 09-9-9 下午1:18, "Eva Tse" <et...@netflix.com> wrote:
>> 
>>> 
>>> We are planning to start enabling ad-hoc querying on our hive warehouse and
>>> we tested some of the concurrent queries and found the following issue:
>>> 
>>> Query 1 – doing ‘insert overwrite table yyy .... partition (dateint = xxx)
>>> select ...  from yyy where dateint = xxx’  This is done to merge small files
>>> within a partition in table yyy
>>> Query 2 – doing some select on the same table joining another table.
>>> 
>>> What we found is that query 2 would fail with the following exceptions in
>>> multiple reducers.
>>> java.io.FileNotFoundException: File does not exist:
>>> hdfs://xxxxxxxxxxxxx.ec2.internal:9000/user/hive/dataeng/warehouse/nccp_sess
>>> ion_facts/dateint=20090908/hour=9/sessionsFacts_P20090909T021823L20090908T09
>>> -r-00006
>>>  at 
>>> org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSy
>>> stem.java:457)
>>>  at org.apache.hadoop.fs.FileSystem.getLength(FileSystem.java:671)
>>>  at org.apache.hadoop.io.SequenceFile$Reader.(SequenceFile.java:1417)
>>>  at org.apache.hadoop.io.SequenceFile$Reader.(SequenceFile.java:1412)
>>>  at 
>>> org.apache.hadoop.mapred.SequenceFileRecordReader.(SequenceFileRecordReader.
>>> java:43)
>>>  at 
>>> org.apache.hadoop.mapred.SequenceFileInputFormat.getRecordReader(SequenceFil
>>> eInputFormat.java:63)
>>>  at 
>>> org.apache.hadoop.hive.ql.io.HiveInputFormat.getRecordReader(HiveInputFormat
>>> .java:236)
>>>  at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:336)
>>>  at org.apache.hadoop.mapred.MapTask.run(MapTask.java:305)
>>>  at org.apache.hadoop.mapred.Child.main(Child.java:170)
>>> 
>>> Is this expected? If so, is there a jira or is it planned to be addressed?
>>> We are trying to think of workaround, but haven’t thought of good ones as
>>> swapping of files would ideally be handled inside hive.
>>> 
>>> Please let us know your feedback.
>>> 
>>> Thanks,
>>> Eva.
>> 
> 
>

Re: Files does not exist error: concurrency control on hive queries...

Posted by Prasad Chakka <pc...@facebook.com>.

If a certain input file/dir does not exist then the job can’t be submitted. Since only a few reducers are failing, the problem could be something else.
Eva, Does the same job succeed on a second try? Ie. Is the file/dir available eventually? What is the replication factor?

Prasad

________________________________
From: Yongqiang He <he...@software.ict.ac.cn>
Reply-To: <hi...@hadoop.apache.org>
Date: Wed, 9 Sep 2009 04:07:31 -0700
To: <hi...@hadoop.apache.org>
Subject: Re: Files does not exist error: concurrency control on hive queries...

Hi Eva,
   After a close at the code, I think this is not a bug. We need to find out how to avoid this.

Thanks,
Yongqiang
On 09-9-9 下午1:31, "He Yongqiang" <he...@software.ict.ac.cn> wrote:

Hi Eva,
    Can you open a new jira for this?  And let’s discuss and resolve this issue.
I guess this is because the partition metadata is added before the data is available.

Thanks
Yongqiang
On 09-9-9 下午1:18, "Eva Tse" <et...@netflix.com> wrote:

We are planning to start enabling ad-hoc querying on our hive warehouse and we tested some of the concurrent queries and found the following issue:

Query 1 – doing ‘insert overwrite table yyy .... partition (dateint = xxx) select ...  from yyy where dateint = xxx’  This is done to merge small files within a partition in table yyy
Query 2 – doing some select on the same table joining another table.

What we found is that query 2 would fail with the following exceptions in multiple reducers.
java.io.FileNotFoundException: File does not exist: hdfs://ip-10-251-98-80.ec2.internal:9000/user/hive/dataeng/warehouse/nccp_session_facts/dateint=20090908/hour=9/sessionsFacts_P20090909T021823L20090908T09-r-00006
 at org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:457)
 at org.apache.hadoop.fs.FileSystem.getLength(FileSystem.java:671)
 at org.apache.hadoop.io.SequenceFile$Reader.(SequenceFile.java:1417)
 at org.apache.hadoop.io.SequenceFile$Reader.(SequenceFile.java:1412)
 at org.apache.hadoop.mapred.SequenceFileRecordReader.(SequenceFileRecordReader.java:43)
 at org.apache.hadoop.mapred.SequenceFileInputFormat.getRecordReader(SequenceFileInputFormat.java:63)
 at org.apache.hadoop.hive.ql.io.HiveInputFormat.getRecordReader(HiveInputFormat.java:236)
 at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:336)
 at org.apache.hadoop.mapred.MapTask.run(MapTask.java:305)
 at org.apache.hadoop.mapred.Child.main(Child.java:170)

Is this expected? If so, is there a jira or is it planned to be addressed? We are trying to think of workaround, but haven’t thought of good ones as swapping of files would ideally be handled inside hive.

Please let us know your feedback.

Thanks,
Eva.

Re: Files does not exist error: concurrency control on hive queries...

Posted by He Yongqiang <he...@software.ict.ac.cn>.

Hi Eva,
   After a close at the code, I think this is not a bug. We need to find out
how to avoid this. 

Thanks,
Yongqiang
On 09-9-9 下午1:31, "He Yongqiang" <he...@software.ict.ac.cn> wrote:

> Hi Eva,
>     Can you open a new jira for this?  And let’s discuss and resolve this
> issue. 
> I guess this is because the partition metadata is added before the data is
> available. 
> 
> Thanks
> Yongqiang
> On 09-9-9 下午1:18, "Eva Tse" <et...@netflix.com> wrote:
> 
>> 
>> We are planning to start enabling ad-hoc querying on our hive warehouse and
>> we tested some of the concurrent queries and found the following issue:
>> 
>> Query 1 – doing ‘insert overwrite table yyy .... partition (dateint = xxx)
>> select ...  from yyy where dateint = xxx’  This is done to merge small files
>> within a partition in table yyy
>> Query 2 – doing some select on the same table joining another table.
>> 
>> What we found is that query 2 would fail with the following exceptions in
>> multiple reducers.
>> java.io.FileNotFoundException: File does not exist:
>> hdfs://ip-10-251-98-80.ec2.internal:9000/user/hive/dataeng/warehouse/nccp_ses
>> sion_facts/dateint=20090908/hour=9/sessionsFacts_P20090909T021823L20090908T09
>> -r-00006
>>  at 
>> org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSys
>> tem.java:457)
>>  at org.apache.hadoop.fs.FileSystem.getLength(FileSystem.java:671)
>>  at org.apache.hadoop.io.SequenceFile$Reader.(SequenceFile.java:1417)
>>  at org.apache.hadoop.io.SequenceFile$Reader.(SequenceFile.java:1412)
>>  at 
>> org.apache.hadoop.mapred.SequenceFileRecordReader.(SequenceFileRecordReader.j
>> ava:43)
>>  at 
>> org.apache.hadoop.mapred.SequenceFileInputFormat.getRecordReader(SequenceFile
>> InputFormat.java:63)
>>  at 
>> org.apache.hadoop.hive.ql.io.HiveInputFormat.getRecordReader(HiveInputFormat.
>> java:236)
>>  at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:336)
>>  at org.apache.hadoop.mapred.MapTask.run(MapTask.java:305)
>>  at org.apache.hadoop.mapred.Child.main(Child.java:170)
>> 
>> Is this expected? If so, is there a jira or is it planned to be addressed? We
>> are trying to think of workaround, but haven’t thought of good ones as
>> swapping of files would ideally be handled inside hive.
>> 
>> Please let us know your feedback.
>> 
>> Thanks,
>> Eva.
>

Re: Files does not exist error: concurrency control on hive queries...

Posted by He Yongqiang <he...@software.ict.ac.cn>.

Hi Eva,
    Can you open a new jira for this?  And let’s discuss and resolve this
issue. 
I guess this is because the partition metadata is added before the data is
available. 

Thanks
Yongqiang
On 09-9-9 下午1:18, "Eva Tse" <et...@netflix.com> wrote:

> 
> We are planning to start enabling ad-hoc querying on our hive warehouse and we
> tested some of the concurrent queries and found the following issue:
> 
> Query 1 – doing ‘insert overwrite table yyy .... partition (dateint = xxx)
> select ...  from yyy where dateint = xxx’  This is done to merge small files
> within a partition in table yyy
> Query 2 – doing some select on the same table joining another table.
> 
> What we found is that query 2 would fail with the following exceptions in
> multiple reducers.
> java.io.FileNotFoundException: File does not exist:
> hdfs://ip-10-251-98-80.ec2.internal:9000/user/hive/dataeng/warehouse/nccp_sess
> ion_facts/dateint=20090908/hour=9/sessionsFacts_P20090909T021823L20090908T09-r
> -00006
>  at 
> org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSyst
> em.java:457)
>  at org.apache.hadoop.fs.FileSystem.getLength(FileSystem.java:671)
>  at org.apache.hadoop.io.SequenceFile$Reader.(SequenceFile.java:1417)
>  at org.apache.hadoop.io.SequenceFile$Reader.(SequenceFile.java:1412)
>  at 
> org.apache.hadoop.mapred.SequenceFileRecordReader.(SequenceFileRecordReader.ja
> va:43)
>  at 
> org.apache.hadoop.mapred.SequenceFileInputFormat.getRecordReader(SequenceFileI
> nputFormat.java:63)
>  at 
> org.apache.hadoop.hive.ql.io.HiveInputFormat.getRecordReader(HiveInputFormat.j
> ava:236)
>  at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:336)
>  at org.apache.hadoop.mapred.MapTask.run(MapTask.java:305)
>  at org.apache.hadoop.mapred.Child.main(Child.java:170)
> 
> Is this expected? If so, is there a jira or is it planned to be addressed? We
> are trying to think of workaround, but haven’t thought of good ones as
> swapping of files would ideally be handled inside hive.
> 
> Please let us know your feedback.
> 
> Thanks,
> Eva.