You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@hbase.apache.org by Daniel Iancu <da...@1and1.ro> on 2011/04/07 18:35:17 UTC

file is already being created by NN_Recovery

Hello everybody
We've run into this, now popular, error on our cluster

2011-04-07 16:28:00,654 WARN IPC Server handler 0 on 8020 
org.apache.hadoop.hdfs.StateChange - DIR* NameSystem.startFile: failed 
to create file 
/hbase/.logs/search-hadoop-eu001.v300.gmx.net,60020,1302075782687/search-hadoop-eu001.v300.gmx.net%3A60020.1302075783467 
for 
DFSClient_hb_m_search-namenode-eu002.v300.gmx.net:60000_1302186078300 on 
client 10.1.100.32, because this file is already being created by 
NN_Recovery on 10.1.100.61

I've read a couple of threads around it, still it seems that nobody 
pinpointed the cause of it? The only solution here remains to delete the 
log file and lose data ?

I've seen  this error on almost any cluster we've installed so far, 
deleting logs was not concerning since all were test clusters. Now we 
got this on the production cluster, and strange, this cluster was just 
installed, there is no table and no data, no activity there. So what 
logs is master trying to create?

We are running the latest CDH3B4 from Cloudera.

Thanks for any hints,
Daniel

Re: file is already being created by NN_Recovery

Posted by Todd Lipcon <to...@cloudera.com>.

On Fri, Apr 8, 2011 at 9:11 AM, Daniel Iancu <da...@1and1.ro> wrote:

>  What we've did was to test NN recovery from SNN on a new installed
> cluster. After copying the image from SNN the cluster was started again and
> it seemed ok. After aprox. 1h it started this infinite loop and is was doing
> this for the entire night (we've checked the logs).
> Since we were not using it, we've acknowledged that the next day. There was
> no data and no activity there so any kind on normal recover should have
> finished in whole this time. Unfortunately I don't have more precise details
> than this.
>
> We'd be happy to upgrade to the final CDH3 if it will be available next
> week.  If not, we'll keep an eye on this issue and if it became a pain we'll
> ask for the patch.
>

Yep, it will be available 4/12.

Thanks
-Todd


>
> On 04/07/2011 08:11 PM, Stack wrote:
>
>> The RegionServer is down for sure?  Else it sounds like an issue that
>> was addressed by the addition of a new short-circuit API call added to
>> HDFS on the hadoop-0.20-append branch.  The patches that added this
>> new call went into the branch quite a while ago.   They are:
>>
>>  HDFS-1554. New semantics for recoverLease. (hairong)
>>
>>  HDFS-1555. Disallow pipelien recovery if a file is already being
>>     lease recovered. (hairong)
>>
>> These patches are not in CDH3b*.  They are in the CDH3 release which
>> is due any day now.
>>
>> HBase 0.90.2 makes use of the new API: See
>> https://issues.apache.org/jira/browse/HBASE-3285.  Attached to that
>> issue is a patch for CDH3b2, a patch we are running here at SU.  Shout
>> if you need a version of this patch for CDH3b3/4.
>>
>> St.Ack
>>
>>
>> On Thu, Apr 7, 2011 at 9:35 AM, Daniel Iancu<da...@1and1.ro>
>>  wrote:
>>
>>> Hello everybody
>>> We've run into this, now popular, error on our cluster
>>>
>>> 2011-04-07 16:28:00,654 WARN IPC Server handler 0 on 8020
>>> org.apache.hadoop.hdfs.StateChange - DIR* NameSystem.startFile: failed to
>>> create file
>>> /hbase/.logs/search-hadoop-eu001.v300.gmx.net,60020,1302075782687/
>>> search-hadoop-eu001.v300.gmx.net%3A60020.1302075783467
>>> for DFSClient_hb_m_search-namenode-eu002.v300.gmx.net:60000_1302186078300
>>> on
>>> client 10.1.100.32, because this file is already being created by
>>> NN_Recovery on 10.1.100.61
>>>
>>> I've read a couple of threads around it, still it seems that nobody
>>> pinpointed the cause of it? The only solution here remains to delete the
>>> log
>>> file and lose data ?
>>>
>>> I've seen  this error on almost any cluster we've installed so far,
>>> deleting
>>> logs was not concerning since all were test clusters. Now we got this on
>>> the
>>> production cluster, and strange, this cluster was just installed, there
>>> is
>>> no table and no data, no activity there. So what logs is master trying to
>>> create?
>>>
>>> We are running the latest CDH3B4 from Cloudera.
>>>
>>> Thanks for any hints,
>>> Daniel
>>>
>>>
> --
> Daniel Iancu
> Java Developer,Web Components Romania
> 1&1 Internet Development srl.
> 18 Mircea Eliade St
> Sect 1, Bucharest
> RO Bucharest, 012015
> www.1and1.ro
> Phone:+40-031-223-9081
> Email:daniel.iancu@1and1.ro
> IM:diancu@united.domain
>
>
>


-- 
Todd Lipcon
Software Engineer, Cloudera

Re: file is already being created by NN_Recovery

Posted by Daniel Iancu <da...@1and1.ro>.

  What we've did was to test NN recovery from SNN on a new installed 
cluster. After copying the image from SNN the cluster was started again 
and it seemed ok. After aprox. 1h it started this infinite loop and is 
was doing this for the entire night (we've checked the logs).
Since we were not using it, we've acknowledged that the next day. There 
was no data and no activity there so any kind on normal recover should 
have finished in whole this time. Unfortunately I don't have more 
precise details than this.

We'd be happy to upgrade to the final CDH3 if it will be available next 
week.  If not, we'll keep an eye on this issue and if it became a pain 
we'll ask for the patch.

Regards
Daniel


On 04/07/2011 08:11 PM, Stack wrote:
> The RegionServer is down for sure?  Else it sounds like an issue that
> was addressed by the addition of a new short-circuit API call added to
> HDFS on the hadoop-0.20-append branch.  The patches that added this
> new call went into the branch quite a while ago.   They are:
>
>   HDFS-1554. New semantics for recoverLease. (hairong)
>
>   HDFS-1555. Disallow pipelien recovery if a file is already being
>      lease recovered. (hairong)
>
> These patches are not in CDH3b*.  They are in the CDH3 release which
> is due any day now.
>
> HBase 0.90.2 makes use of the new API: See
> https://issues.apache.org/jira/browse/HBASE-3285.  Attached to that
> issue is a patch for CDH3b2, a patch we are running here at SU.  Shout
> if you need a version of this patch for CDH3b3/4.
>
> St.Ack
>
>
> On Thu, Apr 7, 2011 at 9:35 AM, Daniel Iancu<da...@1and1.ro>  wrote:
>> Hello everybody
>> We've run into this, now popular, error on our cluster
>>
>> 2011-04-07 16:28:00,654 WARN IPC Server handler 0 on 8020
>> org.apache.hadoop.hdfs.StateChange - DIR* NameSystem.startFile: failed to
>> create file
>> /hbase/.logs/search-hadoop-eu001.v300.gmx.net,60020,1302075782687/search-hadoop-eu001.v300.gmx.net%3A60020.1302075783467
>> for DFSClient_hb_m_search-namenode-eu002.v300.gmx.net:60000_1302186078300 on
>> client 10.1.100.32, because this file is already being created by
>> NN_Recovery on 10.1.100.61
>>
>> I've read a couple of threads around it, still it seems that nobody
>> pinpointed the cause of it? The only solution here remains to delete the log
>> file and lose data ?
>>
>> I've seen  this error on almost any cluster we've installed so far, deleting
>> logs was not concerning since all were test clusters. Now we got this on the
>> production cluster, and strange, this cluster was just installed, there is
>> no table and no data, no activity there. So what logs is master trying to
>> create?
>>
>> We are running the latest CDH3B4 from Cloudera.
>>
>> Thanks for any hints,
>> Daniel
>>

-- 
Daniel Iancu
Java Developer,Web Components Romania
1&1 Internet Development srl.
18 Mircea Eliade St
Sect 1, Bucharest
RO Bucharest, 012015
www.1and1.ro
Phone:+40-031-223-9081
Email:daniel.iancu@1and1.ro
IM:diancu@united.domain

Re: file is already being created by NN_Recovery

Posted by Stack <st...@duboce.net>.

The RegionServer is down for sure?  Else it sounds like an issue that
was addressed by the addition of a new short-circuit API call added to
HDFS on the hadoop-0.20-append branch.  The patches that added this
new call went into the branch quite a while ago.   They are:

 HDFS-1554. New semantics for recoverLease. (hairong)

 HDFS-1555. Disallow pipelien recovery if a file is already being
    lease recovered. (hairong)

These patches are not in CDH3b*.  They are in the CDH3 release which
is due any day now.

HBase 0.90.2 makes use of the new API: See
https://issues.apache.org/jira/browse/HBASE-3285.  Attached to that
issue is a patch for CDH3b2, a patch we are running here at SU.  Shout
if you need a version of this patch for CDH3b3/4.

St.Ack

On Thu, Apr 7, 2011 at 9:35 AM, Daniel Iancu <da...@1and1.ro> wrote:
> Hello everybody
> We've run into this, now popular, error on our cluster
>
> 2011-04-07 16:28:00,654 WARN IPC Server handler 0 on 8020
> org.apache.hadoop.hdfs.StateChange - DIR* NameSystem.startFile: failed to
> create file
> /hbase/.logs/search-hadoop-eu001.v300.gmx.net,60020,1302075782687/search-hadoop-eu001.v300.gmx.net%3A60020.1302075783467
> for DFSClient_hb_m_search-namenode-eu002.v300.gmx.net:60000_1302186078300 on
> client 10.1.100.32, because this file is already being created by
> NN_Recovery on 10.1.100.61
>
> I've read a couple of threads around it, still it seems that nobody
> pinpointed the cause of it? The only solution here remains to delete the log
> file and lose data ?
>
> I've seen  this error on almost any cluster we've installed so far, deleting
> logs was not concerning since all were test clusters. Now we got this on the
> production cluster, and strange, this cluster was just installed, there is
> no table and no data, no activity there. So what logs is master trying to
> create?
>
> We are running the latest CDH3B4 from Cloudera.
>
> Thanks for any hints,
> Daniel
>

Re: file is already being created by NN_Recovery

Posted by Jack Levin <ma...@gmail.com>.

I would vote for that, will safe me a number of manual steps.

-Jack

On Fri, Apr 8, 2011 at 12:48 PM, Andrew Purtell <ap...@apache.org> wrote:
> I've wondered if the master should not copy logs to some '.' prefix directory, delete the original, then split the copies. Haven't thought through all of the consequences though.
>
>   - Andy
>
> --- On Fri, 4/8/11, Daniel Iancu <da...@1and1.ro> wrote:
>
>> From: Daniel Iancu <da...@1and1.ro>
>> Subject: Re: file is already being created by NN_Recovery
>> To: "Jack Levin" <ma...@gmail.com>
>> Cc: "user@hbase.apache.org" <us...@hbase.apache.org>
>> Date: Friday, April 8, 2011, 8:59 AM
>>   Thanks Jack, looks like this
>> is the best workaround as data is not lost.
>> Daniel
>>
>> On 04/07/2011 07:41 PM, Jack Levin wrote:
>> > If you have socket.dfs.timeout set to 0, consider
>> removing it, most of
>> > our issues like that went away after that.  This
>> problem occurs when
>> > you have datanode crash, and there is a conflict with
>> the lease on the
>> > file (which should expire in one hour, this is
>> unconfigurable hard
>> > timeout).   If you do end up in
>> situation like that, the only way we
>> > could resolve it is like this:
>> >
>> > # stop the master
>> > # hadoop fs -cp file new_file
>> > # hadoop fs -rm file
>> > # hadoop fs -cp new_file file
>> > # start master, and watch it replay the log.
>> >
>> > This appears to break the lease as new .log file does
>> not have this issue.
>> >
>> > -Jack
>> >
>> > On Thu, Apr 7, 2011 at 9:35 AM, Daniel Iancu<da...@1and1.ro>
>> wrote:
>> >> Hello everybody
>> >> We've run into this, now popular, error on our
>> cluster
>> >>
>> >> 2011-04-07 16:28:00,654 WARN IPC Server handler 0
>> on 8020
>> >> org.apache.hadoop.hdfs.StateChange - DIR*
>> NameSystem.startFile: failed to
>> >> create file
>> >>
>> /hbase/.logs/search-hadoop-eu001.v300.gmx.net,60020,1302075782687/search-hadoop-eu001.v300.gmx.net%3A60020.1302075783467
>> >> for
>> DFSClient_hb_m_search-namenode-eu002.v300.gmx.net:60000_1302186078300
>> on
>> >> client 10.1.100.32, because this file is already
>> being created by
>> >> NN_Recovery on 10.1.100.61
>> >>
>> >> I've read a couple of threads around it, still it
>> seems that nobody
>> >> pinpointed the cause of it? The only solution here
>> remains to delete the log
>> >> file and lose data ?
>> >>
>> >> I've seen  this error on almost any cluster
>> we've installed so far, deleting
>> >> logs was not concerning since all were test
>> clusters. Now we got this on the
>> >> production cluster, and strange, this cluster was
>> just installed, there is
>> >> no table and no data, no activity there. So what
>> logs is master trying to
>> >> create?
>> >>
>> >> We are running the latest CDH3B4 from Cloudera.
>> >>
>> >> Thanks for any hints,
>> >> Daniel
>> >>
>>
>> --
>> Daniel Iancu
>> Java Developer,Web Components Romania
>> 1&1 Internet Development srl.
>> 18 Mircea Eliade St
>> Sect 1, Bucharest
>> RO Bucharest, 012015
>> www.1and1.ro
>> Phone:+40-031-223-9081
>> Email:daniel.iancu@1and1.ro
>> IM:diancu@united.domain
>>
>>
>>
>
>
>
>

Re: file is already being created by NN_Recovery

Posted by Andrew Purtell <ap...@apache.org>.

I've wondered if the master should not copy logs to some '.' prefix directory, delete the original, then split the copies. Haven't thought through all of the consequences though.

   - Andy

--- On Fri, 4/8/11, Daniel Iancu <da...@1and1.ro> wrote:

> From: Daniel Iancu <da...@1and1.ro>
> Subject: Re: file is already being created by NN_Recovery
> To: "Jack Levin" <ma...@gmail.com>
> Cc: "user@hbase.apache.org" <us...@hbase.apache.org>
> Date: Friday, April 8, 2011, 8:59 AM
>   Thanks Jack, looks like this
> is the best workaround as data is not lost.
> Daniel
> 
> On 04/07/2011 07:41 PM, Jack Levin wrote:
> > If you have socket.dfs.timeout set to 0, consider
> removing it, most of
> > our issues like that went away after that.  This
> problem occurs when
> > you have datanode crash, and there is a conflict with
> the lease on the
> > file (which should expire in one hour, this is
> unconfigurable hard
> > timeout).   If you do end up in
> situation like that, the only way we
> > could resolve it is like this:
> >
> > # stop the master
> > # hadoop fs -cp file new_file
> > # hadoop fs -rm file
> > # hadoop fs -cp new_file file
> > # start master, and watch it replay the log.
> >
> > This appears to break the lease as new .log file does
> not have this issue.
> >
> > -Jack
> >
> > On Thu, Apr 7, 2011 at 9:35 AM, Daniel Iancu<da...@1and1.ro> 
> wrote:
> >> Hello everybody
> >> We've run into this, now popular, error on our
> cluster
> >>
> >> 2011-04-07 16:28:00,654 WARN IPC Server handler 0
> on 8020
> >> org.apache.hadoop.hdfs.StateChange - DIR*
> NameSystem.startFile: failed to
> >> create file
> >>
> /hbase/.logs/search-hadoop-eu001.v300.gmx.net,60020,1302075782687/search-hadoop-eu001.v300.gmx.net%3A60020.1302075783467
> >> for
> DFSClient_hb_m_search-namenode-eu002.v300.gmx.net:60000_1302186078300
> on
> >> client 10.1.100.32, because this file is already
> being created by
> >> NN_Recovery on 10.1.100.61
> >>
> >> I've read a couple of threads around it, still it
> seems that nobody
> >> pinpointed the cause of it? The only solution here
> remains to delete the log
> >> file and lose data ?
> >>
> >> I've seen  this error on almost any cluster
> we've installed so far, deleting
> >> logs was not concerning since all were test
> clusters. Now we got this on the
> >> production cluster, and strange, this cluster was
> just installed, there is
> >> no table and no data, no activity there. So what
> logs is master trying to
> >> create?
> >>
> >> We are running the latest CDH3B4 from Cloudera.
> >>
> >> Thanks for any hints,
> >> Daniel
> >>
> 
> -- 
> Daniel Iancu
> Java Developer,Web Components Romania
> 1&1 Internet Development srl.
> 18 Mircea Eliade St
> Sect 1, Bucharest
> RO Bucharest, 012015
> www.1and1.ro
> Phone:+40-031-223-9081
> Email:daniel.iancu@1and1.ro
> IM:diancu@united.domain
> 
> 
>

Re: file is already being created by NN_Recovery

Posted by Daniel Iancu <da...@1and1.ro>.

  Thanks Jack, looks like this is the best workaround as data is not lost.
Daniel

On 04/07/2011 07:41 PM, Jack Levin wrote:
> If you have socket.dfs.timeout set to 0, consider removing it, most of
> our issues like that went away after that.  This problem occurs when
> you have datanode crash, and there is a conflict with the lease on the
> file (which should expire in one hour, this is unconfigurable hard
> timeout).   If you do end up in situation like that, the only way we
> could resolve it is like this:
>
> # stop the master
> # hadoop fs -cp file new_file
> # hadoop fs -rm file
> # hadoop fs -cp new_file file
> # start master, and watch it replay the log.
>
> This appears to break the lease as new .log file does not have this issue.
>
> -Jack
>
> On Thu, Apr 7, 2011 at 9:35 AM, Daniel Iancu<da...@1and1.ro>  wrote:
>> Hello everybody
>> We've run into this, now popular, error on our cluster
>>
>> 2011-04-07 16:28:00,654 WARN IPC Server handler 0 on 8020
>> org.apache.hadoop.hdfs.StateChange - DIR* NameSystem.startFile: failed to
>> create file
>> /hbase/.logs/search-hadoop-eu001.v300.gmx.net,60020,1302075782687/search-hadoop-eu001.v300.gmx.net%3A60020.1302075783467
>> for DFSClient_hb_m_search-namenode-eu002.v300.gmx.net:60000_1302186078300 on
>> client 10.1.100.32, because this file is already being created by
>> NN_Recovery on 10.1.100.61
>>
>> I've read a couple of threads around it, still it seems that nobody
>> pinpointed the cause of it? The only solution here remains to delete the log
>> file and lose data ?
>>
>> I've seen  this error on almost any cluster we've installed so far, deleting
>> logs was not concerning since all were test clusters. Now we got this on the
>> production cluster, and strange, this cluster was just installed, there is
>> no table and no data, no activity there. So what logs is master trying to
>> create?
>>
>> We are running the latest CDH3B4 from Cloudera.
>>
>> Thanks for any hints,
>> Daniel
>>

-- 
Daniel Iancu
Java Developer,Web Components Romania
1&1 Internet Development srl.
18 Mircea Eliade St
Sect 1, Bucharest
RO Bucharest, 012015
www.1and1.ro
Phone:+40-031-223-9081
Email:daniel.iancu@1and1.ro
IM:diancu@united.domain

Re: file is already being created by NN_Recovery

Posted by Jack Levin <ma...@gmail.com>.

If you have socket.dfs.timeout set to 0, consider removing it, most of
our issues like that went away after that.  This problem occurs when
you have datanode crash, and there is a conflict with the lease on the
file (which should expire in one hour, this is unconfigurable hard
timeout).   If you do end up in situation like that, the only way we
could resolve it is like this:

# stop the master
# hadoop fs -cp file new_file
# hadoop fs -rm file
# hadoop fs -cp new_file file
# start master, and watch it replay the log.

This appears to break the lease as new .log file does not have this issue.

-Jack

On Thu, Apr 7, 2011 at 9:35 AM, Daniel Iancu <da...@1and1.ro> wrote:
> Hello everybody
> We've run into this, now popular, error on our cluster
>
> 2011-04-07 16:28:00,654 WARN IPC Server handler 0 on 8020
> org.apache.hadoop.hdfs.StateChange - DIR* NameSystem.startFile: failed to
> create file
> /hbase/.logs/search-hadoop-eu001.v300.gmx.net,60020,1302075782687/search-hadoop-eu001.v300.gmx.net%3A60020.1302075783467
> for DFSClient_hb_m_search-namenode-eu002.v300.gmx.net:60000_1302186078300 on
> client 10.1.100.32, because this file is already being created by
> NN_Recovery on 10.1.100.61
>
> I've read a couple of threads around it, still it seems that nobody
> pinpointed the cause of it? The only solution here remains to delete the log
> file and lose data ?
>
> I've seen  this error on almost any cluster we've installed so far, deleting
> logs was not concerning since all were test clusters. Now we got this on the
> production cluster, and strange, this cluster was just installed, there is
> no table and no data, no activity there. So what logs is master trying to
> create?
>
> We are running the latest CDH3B4 from Cloudera.
>
> Thanks for any hints,
> Daniel
>