You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@hbase.apache.org by Zheng Lv <lv...@gmail.com> on 2010/03/15 11:40:00 UTC

Cannot open filename Exceptions

Hello Everyone,
    Recently we often got these in our client logs:
    org.apache.hadoop.hbase.client.RetriesExhaustedException: Trying to
contact region server 172.16.1.208:60020 for region
summary,SITE_0000000032\x01pt\x0120100314000000\x01\x25E7\x258C\x25AE\x25E5\x258E\x25BF\x25E5\x2586\x2580\x25E9\x25B9\x25B0\x25E6\x2591\x25A9\x25E6\x2593\x25A6\x25E6\x259D\x2590\x25E6\x2596\x2599\x25E5\x258E\x2582\x2B\x25E6\x25B1\x25BD\x25E8\x25BD\x25A6\x25E9\x2585\x258D\x25E4\x25BB\x25B6\x25EF\x25BC\x258C\x25E5\x2598\x2580\x25E9\x2593\x2583\x25E9\x2593\x2583--\x25E7\x259C\x259F\x25E5\x25AE\x259E\x25E5\x25AE\x2589\x25E5\x2585\x25A8\x25E7\x259A\x2584\x25E7\x2594\x25B5\x25E8\x25AF\x259D\x25E3\x2580\x2581\x25E7\x25BD\x2591\x25E7\x25BB\x259C\x25E4\x25BA\x2592\x25E5\x258A\x25A8\x25E4\x25BA\x25A4\x25E5\x258F\x258B\x25E7\x25A4\x25BE\x25E5\x258C\x25BA\x25EF\x25BC\x2581,1268640385017,
row
'SITE_0000000032\x01pt\x0120100315000000\x01\x2521\x25EF\x25BC\x2581\x25E9\x2594\x2580\x25E5\x2594\x25AE\x252F\x25E6\x2594\x25B6\x25E8\x25B4\x25AD\x25EF\x25BC\x2581VM700T\x2BVM700T\x2B\x25E5\x259B\x25BE\x25E5\x2583\x258F\x25E4\x25BF\x25A1\x25E5\x258F\x25B7\x25E4\x25BA\x25A7\x25E7\x2594\x259F\x25E5\x2599\x25A8\x2B\x25E7\x2594\x25B5\x25E5\x25AD\x2590\x25E6\x25B5\x258B\x25E9\x2587\x258F\x25E4\x25BB\x25AA\x25E5\x2599\x25A8\x25EF\x25BC\x258C\x25E5\x2598\x2580\x25E9\x2593\x2583\x25E9\x2593\x2583--\x25E7\x259C\x259F\x25E5\x25AE\x259E\x25E5\x25AE\x2589\x25E5\x2585\x25A8\x25E7\x259A\x2584\x25E7\x2594\x25B5\x25E8\x25AF\x259D\x25E3\x2580\x2581\x25E7\x25BD\x2591\x25E7\x25BB\x259C\x25E4\x25BA\x2592\x25E5\x258A\x25A8\x25E4\x25BA\x25A4\x25E5\x258F\x258B\x25E7\x25A4\x25BE\x25E5\x258C\x25BA\x25EF\x25BC\x2581',
but failed after 10 attempts.
Exceptions:
java.io.IOException: java.io.IOException: Cannot open filename
/hbase/summary/1491233486/metrics/5046821377427277894
at
org.apache.hadoop.hdfs.DFSClient$DFSInputStream.openInfo(DFSClient.java:1474)
at
org.apache.hadoop.hdfs.DFSClient$DFSInputStream.chooseDataNode(DFSClient.java:1800)
at
org.apache.hadoop.hdfs.DFSClient$DFSInputStream.blockSeekTo(DFSClient.java:1616)
at org.apache.hadoop.hdfs.DFSClient$DFSInputStream.read(DFSClient.java:1743)
at java.io.DataInputStream.read(DataInputStream.java:132)
at
org.apache.hadoop.hbase.io.hfile.BoundedRangeFileInputStream.read(BoundedRangeFileInputStream.java:99)
at org.apache.hadoop.io.IOUtils.readFully(IOUtils.java:100)
at org.apache.hadoop.hbase.io.hfile.HFile$Reader.decompress(HFile.java:1020)
at org.apache.hadoop.hbase.io.hfile.HFile$Reader.readBlock(HFile.java:971)
at
org.apache.hadoop.hbase.io.hfile.HFile$Reader$Scanner.loadBlock(HFile.java:1304)
at
org.apache.hadoop.hbase.io.hfile.HFile$Reader$Scanner.seekTo(HFile.java:1186)
at
org.apache.hadoop.hbase.io.HalfHFileReader$1.seekTo(HalfHFileReader.java:207)
at
org.apache.hadoop.hbase.regionserver.StoreFileGetScan.getStoreFile(StoreFileGetScan.java:80)
at
org.apache.hadoop.hbase.regionserver.StoreFileGetScan.get(StoreFileGetScan.java:65)
at org.apache.hadoop.hbase.regionserver.Store.get(Store.java:1461)
at org.apache.hadoop.hbase.regionserver.HRegion.get(HRegion.java:2396)
at org.apache.hadoop.hbase.regionserver.HRegion.get(HRegion.java:2385)
at
org.apache.hadoop.hbase.regionserver.HRegionServer.get(HRegionServer.java:1731)
at sun.reflect.GeneratedMethodAccessor6.invoke(Unknown Source)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at org.apache.hadoop.hbase.ipc.HBaseRPC$Server.call(HBaseRPC.java:657)
at org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.java:915)
    Is there any way to fix this problem? Or is there anything we can do
even manually to relieve it?
    Any suggestion?
    Thank you.
    LvZheng

Re: Cannot open filename Exceptions

Posted by Patrick Hunt <ph...@apache.org>.
Zheng Lv wrote:
>   I didn't change the tick value, and I will do it right now. But I wanna
> know why the timeout value can only be 20 times bigger than ticktime, can
> you tell me?

The limit is mainly there to keep users from shooting themselves in the 
foot. Typically (we've not seen any case other than hbase where this has 
been necessary) you want low timeouts, in the 5-10 second range, perhaps 
30seconds on the outside. This results in sessions being cleaned up 
quickly, and in general allows clients to be very responsive to 
failures. As HBase RS are effected by limitations in the Sun GC we've 
had to force a larger timeout than normally would be used via a 
workaround. In 3.3 ZooKeeper we added new configurations options 
specific to this HBase use case (there is now a parameter to control the 
max timeout limitation). In 3.3 we've also added log messages on 
server/client to log the negotiated timeout and the client API provides 
programmatic access to the negotiated timeout.

There are other reasons why we have min/max limits on the negotiated 
timeout, in particular to limit memory use on the server. There is state 
associated with each session we do not want this to grow too large, 
having a max timeout limit effectively helps to cap this.

Patrick

> 
> 2010/3/26 Jean-Daniel Cryans <jd...@apache.org>
> 
>> 4 CPUs seems ok, unless you are running 2-3 MR tasks at the same time.
>>
>> So your value for the timeout is 240000, but did you change the tick
>> time? The GC pause you got seemed to last almost a minute which, if
>> you did not change the tick value, matches 3000*20 (disregard your
>> session timeout).
>>
>> J-D
>>
>> On Thu, Mar 25, 2010 at 1:07 AM, Zheng Lv <lv...@gmail.com>
>> wrote:
>>> Hello J-D,
>>>  Thank you for your reply first.
>>>  >How many CPUs do you have?
>>>  Every server has 2 Dual-Core cpus.
>>>  >Are you swapping?
>>>  Now I'm not sure about it with our monitor tools, but now we have
>> written
>>> a script to record vmstat log every 2 seconds. If something wrong happen
>>> again, we can take it.
>>>  >Also if the only you are using this system currently to batch load
>>>  >data or as an analytics backend, you probably want to set the timeout
>>>  >higher:
>>>  But our value of this property is already 240000.
>>>
>>>  We will try to optimize our garbage collector and we will see what will
>>> happen.
>>>  Thanks again, J-D,
>>>    LvZheng
>>>
>>> 2010/3/25 Jean-Daniel Cryans <jd...@apache.org>
>>>
>>>> 2010-03-24 11:33:52,331 WARN org.apache.hadoop.hbase.util.Sleeper: We
>>>> slept 54963ms, ten times longer than scheduled: 3000
>>>>
>>>> You had an important garbage collector pause (aka pause of the world
>>>> in java-speak) and your region server's session with zookeeper expired
>>>> (it literally stopped responding for too long, so long it was
>>>> considered dead). Are you swapping? How many CPUs do you have? If you
>>>> are slowing down the garbage collecting process, it will take more
>>>> time.
>>>>
>>>> Also if the only you are using this system currently to batch load
>>>> data or as an analytics backend, you probably want to set the timeout
>>>> higher:
>>>>
>>>>  <property>
>>>>    <name>zookeeper.session.timeout</name>
>>>>    <value>60000</value>
>>>>    <description>ZooKeeper session timeout.
>>>>      HBase passes this to the zk quorum as suggested maximum time for a
>>>>      session.  See
>>>>
>>>>
>> http://hadoop.apache.org/zookeeper/docs/current/zookeeperProgrammers.html#ch_zkSessions
>>>>      "The client sends a requested timeout, the server responds with the
>>>>      timeout that it can give the client. The current implementation
>>>>      requires that the timeout be a minimum of 2 times the tickTime
>>>>      (as set in the server configuration) and a maximum of 20 times
>>>>      the tickTime." Set the zk ticktime with
>>>> hbase.zookeeper.property.tickTime.
>>>>      In milliseconds.
>>>>    </description>
>>>>  </property>
>>>>
>>>> This value can only be 20 times bigger than this:
>>>>
>>>>  <property>
>>>>    <name>hbase.zookeeper.property.tickTime</name>
>>>>    <value>3000</value>
>>>>    <description>Property from ZooKeeper's config zoo.cfg.
>>>>    The number of milliseconds of each tick.  See
>>>>    zookeeper.session.timeout description.
>>>>    </description>
>>>>  </property>
>>>>
>>>>
>>>> So you could set tick to 6000, timeout to 120000 for a 2min timeout.
>>>>
> 

Re: Cannot open filename Exceptions

Posted by Zheng Lv <lv...@gmail.com>.
Hello J-D,
  >4 CPUs seems ok, unless you are running 2-3 MR tasks at the same time.
  I think it never happened that we are running 3 mr tasks at the same time
in one server, maybe sometimes 2, but not 3. And with our monitor tools, the
cpu is always not busy.
  I didn't change the tick value, and I will do it right now. But I wanna
know why the timeout value can only be 20 times bigger than ticktime, can
you tell me?
  Thank you,
  Regards,
    LvZheng


2010/3/26 Jean-Daniel Cryans <jd...@apache.org>

> 4 CPUs seems ok, unless you are running 2-3 MR tasks at the same time.
>
> So your value for the timeout is 240000, but did you change the tick
> time? The GC pause you got seemed to last almost a minute which, if
> you did not change the tick value, matches 3000*20 (disregard your
> session timeout).
>
> J-D
>
> On Thu, Mar 25, 2010 at 1:07 AM, Zheng Lv <lv...@gmail.com>
> wrote:
> > Hello J-D,
> >  Thank you for your reply first.
> >  >How many CPUs do you have?
> >  Every server has 2 Dual-Core cpus.
> >  >Are you swapping?
> >  Now I'm not sure about it with our monitor tools, but now we have
> written
> > a script to record vmstat log every 2 seconds. If something wrong happen
> > again, we can take it.
> >  >Also if the only you are using this system currently to batch load
> >  >data or as an analytics backend, you probably want to set the timeout
> >  >higher:
> >  But our value of this property is already 240000.
> >
> >  We will try to optimize our garbage collector and we will see what will
> > happen.
> >  Thanks again, J-D,
> >    LvZheng
> >
> > 2010/3/25 Jean-Daniel Cryans <jd...@apache.org>
> >
> >> 2010-03-24 11:33:52,331 WARN org.apache.hadoop.hbase.util.Sleeper: We
> >> slept 54963ms, ten times longer than scheduled: 3000
> >>
> >> You had an important garbage collector pause (aka pause of the world
> >> in java-speak) and your region server's session with zookeeper expired
> >> (it literally stopped responding for too long, so long it was
> >> considered dead). Are you swapping? How many CPUs do you have? If you
> >> are slowing down the garbage collecting process, it will take more
> >> time.
> >>
> >> Also if the only you are using this system currently to batch load
> >> data or as an analytics backend, you probably want to set the timeout
> >> higher:
> >>
> >>  <property>
> >>    <name>zookeeper.session.timeout</name>
> >>    <value>60000</value>
> >>    <description>ZooKeeper session timeout.
> >>      HBase passes this to the zk quorum as suggested maximum time for a
> >>      session.  See
> >>
> >>
> http://hadoop.apache.org/zookeeper/docs/current/zookeeperProgrammers.html#ch_zkSessions
> >>      "The client sends a requested timeout, the server responds with the
> >>      timeout that it can give the client. The current implementation
> >>      requires that the timeout be a minimum of 2 times the tickTime
> >>      (as set in the server configuration) and a maximum of 20 times
> >>      the tickTime." Set the zk ticktime with
> >> hbase.zookeeper.property.tickTime.
> >>      In milliseconds.
> >>    </description>
> >>  </property>
> >>
> >> This value can only be 20 times bigger than this:
> >>
> >>  <property>
> >>    <name>hbase.zookeeper.property.tickTime</name>
> >>    <value>3000</value>
> >>    <description>Property from ZooKeeper's config zoo.cfg.
> >>    The number of milliseconds of each tick.  See
> >>    zookeeper.session.timeout description.
> >>    </description>
> >>  </property>
> >>
> >>
> >> So you could set tick to 6000, timeout to 120000 for a 2min timeout.
> >>
>

Re: Cannot open filename Exceptions

Posted by Jean-Daniel Cryans <jd...@apache.org>.
4 CPUs seems ok, unless you are running 2-3 MR tasks at the same time.

So your value for the timeout is 240000, but did you change the tick
time? The GC pause you got seemed to last almost a minute which, if
you did not change the tick value, matches 3000*20 (disregard your
session timeout).

J-D

On Thu, Mar 25, 2010 at 1:07 AM, Zheng Lv <lv...@gmail.com> wrote:
> Hello J-D,
>  Thank you for your reply first.
>  >How many CPUs do you have?
>  Every server has 2 Dual-Core cpus.
>  >Are you swapping?
>  Now I'm not sure about it with our monitor tools, but now we have written
> a script to record vmstat log every 2 seconds. If something wrong happen
> again, we can take it.
>  >Also if the only you are using this system currently to batch load
>  >data or as an analytics backend, you probably want to set the timeout
>  >higher:
>  But our value of this property is already 240000.
>
>  We will try to optimize our garbage collector and we will see what will
> happen.
>  Thanks again, J-D,
>    LvZheng
>
> 2010/3/25 Jean-Daniel Cryans <jd...@apache.org>
>
>> 2010-03-24 11:33:52,331 WARN org.apache.hadoop.hbase.util.Sleeper: We
>> slept 54963ms, ten times longer than scheduled: 3000
>>
>> You had an important garbage collector pause (aka pause of the world
>> in java-speak) and your region server's session with zookeeper expired
>> (it literally stopped responding for too long, so long it was
>> considered dead). Are you swapping? How many CPUs do you have? If you
>> are slowing down the garbage collecting process, it will take more
>> time.
>>
>> Also if the only you are using this system currently to batch load
>> data or as an analytics backend, you probably want to set the timeout
>> higher:
>>
>>  <property>
>>    <name>zookeeper.session.timeout</name>
>>    <value>60000</value>
>>    <description>ZooKeeper session timeout.
>>      HBase passes this to the zk quorum as suggested maximum time for a
>>      session.  See
>>
>> http://hadoop.apache.org/zookeeper/docs/current/zookeeperProgrammers.html#ch_zkSessions
>>      "The client sends a requested timeout, the server responds with the
>>      timeout that it can give the client. The current implementation
>>      requires that the timeout be a minimum of 2 times the tickTime
>>      (as set in the server configuration) and a maximum of 20 times
>>      the tickTime." Set the zk ticktime with
>> hbase.zookeeper.property.tickTime.
>>      In milliseconds.
>>    </description>
>>  </property>
>>
>> This value can only be 20 times bigger than this:
>>
>>  <property>
>>    <name>hbase.zookeeper.property.tickTime</name>
>>    <value>3000</value>
>>    <description>Property from ZooKeeper's config zoo.cfg.
>>    The number of milliseconds of each tick.  See
>>    zookeeper.session.timeout description.
>>    </description>
>>  </property>
>>
>>
>> So you could set tick to 6000, timeout to 120000 for a 2min timeout.
>>

Re: Cannot open filename Exceptions

Posted by Zheng Lv <lv...@gmail.com>.
Hello J-D,
  Thank you for your reply first.
  >How many CPUs do you have?
  Every server has 2 Dual-Core cpus.
  >Are you swapping?
  Now I'm not sure about it with our monitor tools, but now we have written
a script to record vmstat log every 2 seconds. If something wrong happen
again, we can take it.
  >Also if the only you are using this system currently to batch load
  >data or as an analytics backend, you probably want to set the timeout
  >higher:
  But our value of this property is already 240000.

  We will try to optimize our garbage collector and we will see what will
happen.
  Thanks again, J-D,
    LvZheng

2010/3/25 Jean-Daniel Cryans <jd...@apache.org>

> 2010-03-24 11:33:52,331 WARN org.apache.hadoop.hbase.util.Sleeper: We
> slept 54963ms, ten times longer than scheduled: 3000
>
> You had an important garbage collector pause (aka pause of the world
> in java-speak) and your region server's session with zookeeper expired
> (it literally stopped responding for too long, so long it was
> considered dead). Are you swapping? How many CPUs do you have? If you
> are slowing down the garbage collecting process, it will take more
> time.
>
> Also if the only you are using this system currently to batch load
> data or as an analytics backend, you probably want to set the timeout
> higher:
>
>  <property>
>    <name>zookeeper.session.timeout</name>
>    <value>60000</value>
>    <description>ZooKeeper session timeout.
>      HBase passes this to the zk quorum as suggested maximum time for a
>      session.  See
>
> http://hadoop.apache.org/zookeeper/docs/current/zookeeperProgrammers.html#ch_zkSessions
>      "The client sends a requested timeout, the server responds with the
>      timeout that it can give the client. The current implementation
>      requires that the timeout be a minimum of 2 times the tickTime
>      (as set in the server configuration) and a maximum of 20 times
>      the tickTime." Set the zk ticktime with
> hbase.zookeeper.property.tickTime.
>      In milliseconds.
>    </description>
>  </property>
>
> This value can only be 20 times bigger than this:
>
>  <property>
>    <name>hbase.zookeeper.property.tickTime</name>
>    <value>3000</value>
>    <description>Property from ZooKeeper's config zoo.cfg.
>    The number of milliseconds of each tick.  See
>    zookeeper.session.timeout description.
>    </description>
>  </property>
>
>
> So you could set tick to 6000, timeout to 120000 for a 2min timeout.
>
> J-D
>
> On Wed, Mar 24, 2010 at 8:01 PM, Zheng Lv <lv...@gmail.com>
> wrote:
> > Hello Stack,
> >  Yesterday we got another problem about "zookeeper session expired",
> > leading rs shutdown, which never happened before.
> >  I googled it, finding some docs about it, but didnot get things
> > really certain about how it happened and how to avoid it.
> >  Now I have put the corresponding logs to
> > http://rapidshare.com/files/367820690/208-0324.log.html.
> >  Look forward to your reply.
> >  Thank you.
> >    LvZheng
> >
> > 2010/3/24 Zheng Lv <lv...@gmail.com>
> >
> >> Hello Stack,
> >>   Thank you for your explainations, it's very helpful, Thank you.
> >>   If I get something new, I'll connect you.
> >>   Regards,
> >>     LvZheng
> >>
> >> 2010/3/24 Stack <st...@duboce.net>
> >>
> >>  On Tue, Mar 23, 2010 at 8:42 PM, Zheng Lv <lv...@gmail.com>
> >>> wrote:
> >>> > Hello Stack,
> >>> >  >So, for sure ugly stuff is going on.  I filed
> >>> >  >https://issues.apache.org/jira/browse/HBASE-2365.  It looks like
> >>> we're
> >>> >  >doubly assigning a region.
> >>> >  Can you tell me how this happened in detail? Thanks a lot.
> >>> >
> >>>
> >>> Yes.
> >>>
> >>> Splits are run by the regionserver.  It figures a region needs to be
> >>> split and goes ahead closing the parent and creating the daughter
> >>> regions.  It then adds edits to the meta table offlining the parent
> >>> and inserting the two new daughter regions.  Next it sends a message
> >>> to the master telling it that a region has been split.   The message
> >>> contains names of the daughter regions.  On receipt of the message,
> >>> the master adds the new daughter regions to the unassigned regions
> >>> list so they'll be passed out the next time a regionserver checks in.
> >>>
> >>> Concurrently, the master is running a scan of the meta table every
> >>> minute making sure all is in order.  One thing it does is if it finds
> >>> unassigned regions, it'll add them to the unassigned regions (this
> >>> process is what gets all regions assigned after a startup).
> >>>
> >>> In your case, whats happening is that there is a long period between
> >>> the add of the new split regions to the meta table and the report of
> >>> split to the master.  During this time, the master meta scan ran,
> >>> found one of the daughters and went and assigned it.  Then the split
> >>> message came in and the daughter was assigned again!
> >>>
> >>> There was supposed to be protection against this happening IIRC.
> >>> Looking at responsible code, we are trying to defend against this
> >>> happening in ServerManager:
> >>>
> >>>  /*
> >>>   * Assign new daughter-of-a-split UNLESS its already been assigned.
> >>>   * It could have been assigned already in rare case where there was a
> >>> large
> >>>   * gap between insertion of the daughter region into .META. by the
> >>>   * splitting regionserver and receipt of the split message in master
> (See
> >>>   * HBASE-1784).
> >>>   * @param hri Region to assign.
> >>>   */
> >>>  private void assignSplitDaughter(final HRegionInfo hri) {
> >>>    MetaRegion mr =
> >>> this.master.regionManager.getFirstMetaRegionForRegion(hri);
> >>>    Get g = new Get(hri.getRegionName());
> >>>    g.addFamily(HConstants.CATALOG_FAMILY);
> >>>    try {
> >>>      HRegionInterface server =
> >>>        master.connection.getHRegionConnection(mr.getServer());
> >>>      Result r = server.get(mr.getRegionName(), g);
> >>>      // If size > 3 -- presume regioninfo, startcode and server -- then
> >>> presume
> >>>      // that this daughter already assigned and return.
> >>>      if (r.size() >= 3) return;
> >>>    } catch (IOException e) {
> >>>      LOG.warn("Failed get on " + HConstants.CATALOG_FAMILY_STR +
> >>>        "; possible double-assignment?", e);
> >>>    }
> >>>    this.master.regionManager.setUnassigned(hri, false);
> >>>  }
> >>>
> >>> So, the above is not working in your case for some reason.   I'll take
> >>> a look but I'm not sure I can figure it w/o DEBUG (thanks for letting
> >>> me know about the out-of-sync clocks... Now I can have more faith in
> >>> what the logs are telling me).
> >>>
> >>> >
> >>> >  >With DEBUG enabled have you been able to reproduce?
> >>> >  These days the exception did not appera again, if it would, I'll
> show
> >>> you
> >>> > the logs.
> >>> >
> >>>
> >>> For sure, if you come across it again, I'm interested.
> >>>
> >>> Thanks Zheng,
> >>> St.Ack
> >>>
> >>
> >>
> >
>

Re: Cannot open filename Exceptions

Posted by Jean-Daniel Cryans <jd...@apache.org>.
2010-03-24 11:33:52,331 WARN org.apache.hadoop.hbase.util.Sleeper: We
slept 54963ms, ten times longer than scheduled: 3000

You had an important garbage collector pause (aka pause of the world
in java-speak) and your region server's session with zookeeper expired
(it literally stopped responding for too long, so long it was
considered dead). Are you swapping? How many CPUs do you have? If you
are slowing down the garbage collecting process, it will take more
time.

Also if the only you are using this system currently to batch load
data or as an analytics backend, you probably want to set the timeout
higher:

  <property>
    <name>zookeeper.session.timeout</name>
    <value>60000</value>
    <description>ZooKeeper session timeout.
      HBase passes this to the zk quorum as suggested maximum time for a
      session.  See
http://hadoop.apache.org/zookeeper/docs/current/zookeeperProgrammers.html#ch_zkSessions
      "The client sends a requested timeout, the server responds with the
      timeout that it can give the client. The current implementation
      requires that the timeout be a minimum of 2 times the tickTime
      (as set in the server configuration) and a maximum of 20 times
      the tickTime." Set the zk ticktime with hbase.zookeeper.property.tickTime.
      In milliseconds.
    </description>
  </property>

This value can only be 20 times bigger than this:

  <property>
    <name>hbase.zookeeper.property.tickTime</name>
    <value>3000</value>
    <description>Property from ZooKeeper's config zoo.cfg.
    The number of milliseconds of each tick.  See
    zookeeper.session.timeout description.
    </description>
  </property>


So you could set tick to 6000, timeout to 120000 for a 2min timeout.

J-D

On Wed, Mar 24, 2010 at 8:01 PM, Zheng Lv <lv...@gmail.com> wrote:
> Hello Stack,
>  Yesterday we got another problem about "zookeeper session expired",
> leading rs shutdown, which never happened before.
>  I googled it, finding some docs about it, but didnot get things
> really certain about how it happened and how to avoid it.
>  Now I have put the corresponding logs to
> http://rapidshare.com/files/367820690/208-0324.log.html.
>  Look forward to your reply.
>  Thank you.
>    LvZheng
>
> 2010/3/24 Zheng Lv <lv...@gmail.com>
>
>> Hello Stack,
>>   Thank you for your explainations, it's very helpful, Thank you.
>>   If I get something new, I'll connect you.
>>   Regards,
>>     LvZheng
>>
>> 2010/3/24 Stack <st...@duboce.net>
>>
>>  On Tue, Mar 23, 2010 at 8:42 PM, Zheng Lv <lv...@gmail.com>
>>> wrote:
>>> > Hello Stack,
>>> >  >So, for sure ugly stuff is going on.  I filed
>>> >  >https://issues.apache.org/jira/browse/HBASE-2365.  It looks like
>>> we're
>>> >  >doubly assigning a region.
>>> >  Can you tell me how this happened in detail? Thanks a lot.
>>> >
>>>
>>> Yes.
>>>
>>> Splits are run by the regionserver.  It figures a region needs to be
>>> split and goes ahead closing the parent and creating the daughter
>>> regions.  It then adds edits to the meta table offlining the parent
>>> and inserting the two new daughter regions.  Next it sends a message
>>> to the master telling it that a region has been split.   The message
>>> contains names of the daughter regions.  On receipt of the message,
>>> the master adds the new daughter regions to the unassigned regions
>>> list so they'll be passed out the next time a regionserver checks in.
>>>
>>> Concurrently, the master is running a scan of the meta table every
>>> minute making sure all is in order.  One thing it does is if it finds
>>> unassigned regions, it'll add them to the unassigned regions (this
>>> process is what gets all regions assigned after a startup).
>>>
>>> In your case, whats happening is that there is a long period between
>>> the add of the new split regions to the meta table and the report of
>>> split to the master.  During this time, the master meta scan ran,
>>> found one of the daughters and went and assigned it.  Then the split
>>> message came in and the daughter was assigned again!
>>>
>>> There was supposed to be protection against this happening IIRC.
>>> Looking at responsible code, we are trying to defend against this
>>> happening in ServerManager:
>>>
>>>  /*
>>>   * Assign new daughter-of-a-split UNLESS its already been assigned.
>>>   * It could have been assigned already in rare case where there was a
>>> large
>>>   * gap between insertion of the daughter region into .META. by the
>>>   * splitting regionserver and receipt of the split message in master (See
>>>   * HBASE-1784).
>>>   * @param hri Region to assign.
>>>   */
>>>  private void assignSplitDaughter(final HRegionInfo hri) {
>>>    MetaRegion mr =
>>> this.master.regionManager.getFirstMetaRegionForRegion(hri);
>>>    Get g = new Get(hri.getRegionName());
>>>    g.addFamily(HConstants.CATALOG_FAMILY);
>>>    try {
>>>      HRegionInterface server =
>>>        master.connection.getHRegionConnection(mr.getServer());
>>>      Result r = server.get(mr.getRegionName(), g);
>>>      // If size > 3 -- presume regioninfo, startcode and server -- then
>>> presume
>>>      // that this daughter already assigned and return.
>>>      if (r.size() >= 3) return;
>>>    } catch (IOException e) {
>>>      LOG.warn("Failed get on " + HConstants.CATALOG_FAMILY_STR +
>>>        "; possible double-assignment?", e);
>>>    }
>>>    this.master.regionManager.setUnassigned(hri, false);
>>>  }
>>>
>>> So, the above is not working in your case for some reason.   I'll take
>>> a look but I'm not sure I can figure it w/o DEBUG (thanks for letting
>>> me know about the out-of-sync clocks... Now I can have more faith in
>>> what the logs are telling me).
>>>
>>> >
>>> >  >With DEBUG enabled have you been able to reproduce?
>>> >  These days the exception did not appera again, if it would, I'll show
>>> you
>>> > the logs.
>>> >
>>>
>>> For sure, if you come across it again, I'm interested.
>>>
>>> Thanks Zheng,
>>> St.Ack
>>>
>>
>>
>

Re: Cannot open filename Exceptions

Posted by Zheng Lv <lv...@gmail.com>.
Hello Stack,
  Yesterday we got another problem about "zookeeper session expired",
leading rs shutdown, which never happened before.
  I googled it, finding some docs about it, but didnot get things
really certain about how it happened and how to avoid it.
  Now I have put the corresponding logs to
http://rapidshare.com/files/367820690/208-0324.log.html.
  Look forward to your reply.
  Thank you.
    LvZheng

2010/3/24 Zheng Lv <lv...@gmail.com>

> Hello Stack,
>   Thank you for your explainations, it's very helpful, Thank you.
>   If I get something new, I'll connect you.
>   Regards,
>     LvZheng
>
> 2010/3/24 Stack <st...@duboce.net>
>
>  On Tue, Mar 23, 2010 at 8:42 PM, Zheng Lv <lv...@gmail.com>
>> wrote:
>> > Hello Stack,
>> >  >So, for sure ugly stuff is going on.  I filed
>> >  >https://issues.apache.org/jira/browse/HBASE-2365.  It looks like
>> we're
>> >  >doubly assigning a region.
>> >  Can you tell me how this happened in detail? Thanks a lot.
>> >
>>
>> Yes.
>>
>> Splits are run by the regionserver.  It figures a region needs to be
>> split and goes ahead closing the parent and creating the daughter
>> regions.  It then adds edits to the meta table offlining the parent
>> and inserting the two new daughter regions.  Next it sends a message
>> to the master telling it that a region has been split.   The message
>> contains names of the daughter regions.  On receipt of the message,
>> the master adds the new daughter regions to the unassigned regions
>> list so they'll be passed out the next time a regionserver checks in.
>>
>> Concurrently, the master is running a scan of the meta table every
>> minute making sure all is in order.  One thing it does is if it finds
>> unassigned regions, it'll add them to the unassigned regions (this
>> process is what gets all regions assigned after a startup).
>>
>> In your case, whats happening is that there is a long period between
>> the add of the new split regions to the meta table and the report of
>> split to the master.  During this time, the master meta scan ran,
>> found one of the daughters and went and assigned it.  Then the split
>> message came in and the daughter was assigned again!
>>
>> There was supposed to be protection against this happening IIRC.
>> Looking at responsible code, we are trying to defend against this
>> happening in ServerManager:
>>
>>  /*
>>   * Assign new daughter-of-a-split UNLESS its already been assigned.
>>   * It could have been assigned already in rare case where there was a
>> large
>>   * gap between insertion of the daughter region into .META. by the
>>   * splitting regionserver and receipt of the split message in master (See
>>   * HBASE-1784).
>>   * @param hri Region to assign.
>>   */
>>  private void assignSplitDaughter(final HRegionInfo hri) {
>>    MetaRegion mr =
>> this.master.regionManager.getFirstMetaRegionForRegion(hri);
>>    Get g = new Get(hri.getRegionName());
>>    g.addFamily(HConstants.CATALOG_FAMILY);
>>    try {
>>      HRegionInterface server =
>>        master.connection.getHRegionConnection(mr.getServer());
>>      Result r = server.get(mr.getRegionName(), g);
>>      // If size > 3 -- presume regioninfo, startcode and server -- then
>> presume
>>      // that this daughter already assigned and return.
>>      if (r.size() >= 3) return;
>>    } catch (IOException e) {
>>      LOG.warn("Failed get on " + HConstants.CATALOG_FAMILY_STR +
>>        "; possible double-assignment?", e);
>>    }
>>    this.master.regionManager.setUnassigned(hri, false);
>>  }
>>
>> So, the above is not working in your case for some reason.   I'll take
>> a look but I'm not sure I can figure it w/o DEBUG (thanks for letting
>> me know about the out-of-sync clocks... Now I can have more faith in
>> what the logs are telling me).
>>
>> >
>> >  >With DEBUG enabled have you been able to reproduce?
>> >  These days the exception did not appera again, if it would, I'll show
>> you
>> > the logs.
>> >
>>
>> For sure, if you come across it again, I'm interested.
>>
>> Thanks Zheng,
>> St.Ack
>>
>
>

Re: Cannot open filename Exceptions

Posted by Zheng Lv <lv...@gmail.com>.
Hello Stack,
  Thank you for your explainations, it's very helpful, Thank you.
  If I get something new, I'll connect you.
  Regards,
    LvZheng

2010/3/24 Stack <st...@duboce.net>

> On Tue, Mar 23, 2010 at 8:42 PM, Zheng Lv <lv...@gmail.com>
> wrote:
> > Hello Stack,
> >  >So, for sure ugly stuff is going on.  I filed
> >  >https://issues.apache.org/jira/browse/HBASE-2365.  It looks like we're
> >  >doubly assigning a region.
> >  Can you tell me how this happened in detail? Thanks a lot.
> >
>
> Yes.
>
> Splits are run by the regionserver.  It figures a region needs to be
> split and goes ahead closing the parent and creating the daughter
> regions.  It then adds edits to the meta table offlining the parent
> and inserting the two new daughter regions.  Next it sends a message
> to the master telling it that a region has been split.   The message
> contains names of the daughter regions.  On receipt of the message,
> the master adds the new daughter regions to the unassigned regions
> list so they'll be passed out the next time a regionserver checks in.
>
> Concurrently, the master is running a scan of the meta table every
> minute making sure all is in order.  One thing it does is if it finds
> unassigned regions, it'll add them to the unassigned regions (this
> process is what gets all regions assigned after a startup).
>
> In your case, whats happening is that there is a long period between
> the add of the new split regions to the meta table and the report of
> split to the master.  During this time, the master meta scan ran,
> found one of the daughters and went and assigned it.  Then the split
> message came in and the daughter was assigned again!
>
> There was supposed to be protection against this happening IIRC.
> Looking at responsible code, we are trying to defend against this
> happening in ServerManager:
>
>  /*
>   * Assign new daughter-of-a-split UNLESS its already been assigned.
>   * It could have been assigned already in rare case where there was a
> large
>   * gap between insertion of the daughter region into .META. by the
>   * splitting regionserver and receipt of the split message in master (See
>   * HBASE-1784).
>   * @param hri Region to assign.
>   */
>  private void assignSplitDaughter(final HRegionInfo hri) {
>    MetaRegion mr =
> this.master.regionManager.getFirstMetaRegionForRegion(hri);
>    Get g = new Get(hri.getRegionName());
>    g.addFamily(HConstants.CATALOG_FAMILY);
>    try {
>      HRegionInterface server =
>        master.connection.getHRegionConnection(mr.getServer());
>      Result r = server.get(mr.getRegionName(), g);
>      // If size > 3 -- presume regioninfo, startcode and server -- then
> presume
>      // that this daughter already assigned and return.
>      if (r.size() >= 3) return;
>    } catch (IOException e) {
>      LOG.warn("Failed get on " + HConstants.CATALOG_FAMILY_STR +
>        "; possible double-assignment?", e);
>    }
>    this.master.regionManager.setUnassigned(hri, false);
>  }
>
> So, the above is not working in your case for some reason.   I'll take
> a look but I'm not sure I can figure it w/o DEBUG (thanks for letting
> me know about the out-of-sync clocks... Now I can have more faith in
> what the logs are telling me).
>
> >
> >  >With DEBUG enabled have you been able to reproduce?
> >  These days the exception did not appera again, if it would, I'll show
> you
> > the logs.
> >
>
> For sure, if you come across it again, I'm interested.
>
> Thanks Zheng,
> St.Ack
>

Re: Cannot open filename Exceptions

Posted by Stack <st...@duboce.net>.
On Tue, Mar 23, 2010 at 8:42 PM, Zheng Lv <lv...@gmail.com> wrote:
> Hello Stack,
>  >So, for sure ugly stuff is going on.  I filed
>  >https://issues.apache.org/jira/browse/HBASE-2365.  It looks like we're
>  >doubly assigning a region.
>  Can you tell me how this happened in detail? Thanks a lot.
>

Yes.

Splits are run by the regionserver.  It figures a region needs to be
split and goes ahead closing the parent and creating the daughter
regions.  It then adds edits to the meta table offlining the parent
and inserting the two new daughter regions.  Next it sends a message
to the master telling it that a region has been split.   The message
contains names of the daughter regions.  On receipt of the message,
the master adds the new daughter regions to the unassigned regions
list so they'll be passed out the next time a regionserver checks in.

Concurrently, the master is running a scan of the meta table every
minute making sure all is in order.  One thing it does is if it finds
unassigned regions, it'll add them to the unassigned regions (this
process is what gets all regions assigned after a startup).

In your case, whats happening is that there is a long period between
the add of the new split regions to the meta table and the report of
split to the master.  During this time, the master meta scan ran,
found one of the daughters and went and assigned it.  Then the split
message came in and the daughter was assigned again!

There was supposed to be protection against this happening IIRC.
Looking at responsible code, we are trying to defend against this
happening in ServerManager:

  /*
   * Assign new daughter-of-a-split UNLESS its already been assigned.
   * It could have been assigned already in rare case where there was a large
   * gap between insertion of the daughter region into .META. by the
   * splitting regionserver and receipt of the split message in master (See
   * HBASE-1784).
   * @param hri Region to assign.
   */
  private void assignSplitDaughter(final HRegionInfo hri) {
    MetaRegion mr = this.master.regionManager.getFirstMetaRegionForRegion(hri);
    Get g = new Get(hri.getRegionName());
    g.addFamily(HConstants.CATALOG_FAMILY);
    try {
      HRegionInterface server =
        master.connection.getHRegionConnection(mr.getServer());
      Result r = server.get(mr.getRegionName(), g);
      // If size > 3 -- presume regioninfo, startcode and server -- then presume
      // that this daughter already assigned and return.
      if (r.size() >= 3) return;
    } catch (IOException e) {
      LOG.warn("Failed get on " + HConstants.CATALOG_FAMILY_STR +
        "; possible double-assignment?", e);
    }
    this.master.regionManager.setUnassigned(hri, false);
  }

So, the above is not working in your case for some reason.   I'll take
a look but I'm not sure I can figure it w/o DEBUG (thanks for letting
me know about the out-of-sync clocks... Now I can have more faith in
what the logs are telling me).

>
>  >With DEBUG enabled have you been able to reproduce?
>  These days the exception did not appera again, if it would, I'll show you
> the logs.
>

For sure, if you come across it again, I'm interested.

Thanks Zheng,
St.Ack

Re: Cannot open filename Exceptions

Posted by Zheng Lv <lv...@gmail.com>.
Hello Stack,
  >So, for sure ugly stuff is going on.  I filed
  >https://issues.apache.org/jira/browse/HBASE-2365.  It looks like we're
  >doubly assigning a region.
  Can you tell me how this happened in detail? Thanks a lot.

  >Can you confirm that 209 lags behind the master (207) by about 25
  >seconds?  Are you running NTP on these machines so they sync their
  >clocks?
  Yes, the 209 lags behind the 207 by about 30s. But this can lead to the
exceptions?
  Before, there were something wrong about our ntp script, now it has been
fixed.

  >With DEBUG enabled have you been able to reproduce?
  These days the exception did not appera again, if it would, I'll show you
the logs.

  Thanks a lot again.
    LvZheng




2010/3/23 Stack <st...@duboce.net>

> So, for sure ugly stuff is going on.  I filed
> https://issues.apache.org/jira/browse/HBASE-2365.  It looks like we're
> doubly assigning a region.
>
> Can you confirm that 209 lags behind the master (207) by about 25
> seconds?  Are you running NTP on these machines so they sync their
> clocks?
>
> With DEBUG enabled have you been able to reproduce?
>
> That said there might be enough in these logs to go on if you can
> confirm the above.
>
> Thanks for your patience Zheng.
>
> St.Ack
>
>
>
> On Thu, Mar 18, 2010 at 11:43 PM, Zheng Lv <lv...@gmail.com>
> wrote:
> > Hello Stack,
> >  I must say thank you, for your patience too.
> >  I'm sorry for that you had tried for many times but the logs you got
> were
> > not that usful. Now I have turn the logging to debug level, so if we get
> > these exceptions again, I will send you debug logs. Anyway, I still
> upload
> > the logs you want to rapidshare, although they are not in debug level.
> The
> > urls:
> >
> >
> http://rapidshare.com/files/365292889/hadoop-root-namenode-cactus207.log.2010-03-15.html
> >
> >
> http://rapidshare.com/files/365293127/hbase-root-master-cactus207.log.2010-03-15.html
> >
> >
> http://rapidshare.com/files/365293238/hbase-root-regionserver-cactus208.log.2010-03-15.html
> >
> >
> http://rapidshare.com/files/365293391/hbase-root-regionserver-cactus209.log.2010-03-15.html
> >
> >
> http://rapidshare.com/files/365293488/hbase-root-regionserver-cactus210.log.2010-03-15.html
> >
> >  >For sure you've upped xceivers on your hdfs cluster and you've upped
> >>the file descriptors as per the 'Getting Started'? (Sorry, have to
> >>ask).
> >  Before I got your mail, we didn't set the properties you mentioned,
> > because we didn't got the "too many open files" or something which are
> > mentioned in "getting start" docs. But now I have upped these properties.
> > We'll see what will happen.
> >
> >  If you need more infomations, just tell me.
> >
> >  Thanks again,
> >  LvZheng.
> >
> >
> > 2010/3/19 Stack <st...@duboce.net>
> >
> >> Yeah, I had to retry a couple of times ("Too busy; try back later --
> >> or sign up premium service!").
> >>
> >> It would have been nice to have wider log snippets.  I'd like to have
> >> seen if the issue was double assignment.  The master log snippet only
> >> shows the split.  Regionserver 209's log is the one where the
> >> interesting stuff is going on around this time, 2010-03-15
> >> 16:06:51,150, but its not in the provided set.  Neither are you
> >> running at DEBUG level so it'd be harder to see what is up even if you
> >> provided it.
> >>
> >> Looking in 208, I see a few exceptions beyond the one you paste below.
> >>  For sure you've upped xceivers on your hdfs cluster and you've upped
> >> the file descriptors as per the 'Getting Started'? (Sorry, have to
> >> ask).
> >>
> >> Can I have more of the logs?  Can I have all of the namenode log, all
> >> of the master log and 209's log?  This rapidshare thing is fine with
> >> me.  I don't mind retrying.
> >>
> >> Sorry it took me a while to get to this.
> >> St.Ack
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >> On Wed, Mar 17, 2010 at 8:32 PM, Zheng Lv <lv...@gmail.com>
> >> wrote:
> >> > Hello Stack,
> >> >    >Sorry. It's taken me a while.  Let try and get to this this
> evening
> >> >    Is it downloading the log files what take you a while? I m sorry, I
> >> used
> >> > to upload files to skydrive, but now we cant access the website. Is
> there
> >> > any netdisk or something you can download fast? I can upload to it.
> >> >    LvZheng
> >> > 2010/3/18 Stack <sa...@gmail.com>
> >> >
> >> >> Sorry. It's taken me a while.  Let try and get to this this evening
> >> >>
> >> >> Thank you for your patience
> >> >>
> >> >>
> >> >>
> >> >>
> >> >> On Mar 17, 2010, at 2:29 AM, Zheng Lv <lv...@gmail.com>
> >> wrote:
> >> >>
> >> >> Hello Stack,
> >> >>>  Did you receive my mail?It looks like you didnt.
> >> >>>   LvZheng
> >> >>>
> >> >>> 2010/3/16 Zheng Lv <lv...@gmail.com>
> >> >>>
> >> >>> Hello Stack,
> >> >>>>  I have uploaded some parts of the logs on master, regionserver208
> and
> >> >>>> regionserver210 to:
> >> >>>>  http://rapidshare.com/files/363988384/master_207_log.txt.html
> >> >>>>
> http://rapidshare.com/files/363988673/regionserver_208_log.txt.html
> >> >>>>
> http://rapidshare.com/files/363988819/regionserver_210_log.txt.html
> >> >>>>  I noticed that there are some LeaseExpiredException and
> "2010-03-15
> >> >>>> 16:06:32,864 ERROR
> >> >>>> org.apache.hadoop.hbase.regionserver.CompactSplitThread:
> >> >>>> Compaction/Split failed for region ..." before 17 oclock. Did these
> >> lead
> >> >>>> to
> >> >>>> the error? Why did these happened? How to avoid these?
> >> >>>>  Thanks.
> >> >>>>   LvZheng
> >> >>>> 2010/3/16 Stack <st...@duboce.net>
> >> >>>>
> >> >>>> Maybe just the master log would be sufficient from around this time
> to
> >> >>>>> figure the story.
> >> >>>>> St.Ack
> >> >>>>>
> >> >>>>> On Mon, Mar 15, 2010 at 10:04 PM, Stack <st...@duboce.net> wrote:
> >> >>>>>
> >> >>>>>> Hey Zheng:
> >> >>>>>>
> >> >>>>>> On Mon, Mar 15, 2010 at 8:16 PM, Zheng Lv <
> >> lvzheng19800619@gmail.com>
> >> >>>>>>
> >> >>>>> wrote:
> >> >>>>>
> >> >>>>>> Hello Stack,
> >> >>>>>>> After we got these exceptions, we restart the cluster and
> restarted
> >> >>>>>>>
> >> >>>>>> the
> >> >>>>>
> >> >>>>>> job that failed, and the job succeeded.
> >> >>>>>>> Now when we access
> >> >>>>>>>
> >> >>>>>> /hbase/summary/1491233486/metrics/5046821377427277894,
> >> >>>>>
> >> >>>>>> we got " Cannot access
> >> >>>>>>> /hbase/summary/1491233486/metrics/5046821377427277894: No such
> file
> >> or
> >> >>>>>>> directory." .
> >> >>>>>>>
> >> >>>>>>>
> >> >>>>>> So, that would seem to indicate that the reference was in memory
> >> >>>>>> only.. that file was not in filesystem.  You could have tried
> >> closing
> >> >>>>>> that region.   It would have been interesting also to find
> history
> >> on
> >> >>>>>> that region, to try and figure how it came to hold in memory a
> >> >>>>>> reference to a file since removed.
> >> >>>>>>
> >> >>>>>> The messages about this file in namenode logs are in here:
> >> >>>>>>> http://rapidshare.com/files/363938595/log.txt.html
> >> >>>>>>>
> >> >>>>>>
> >> >>>>>> This is interesting.  Do you have regionserver logs from 209,
> 208,
> >> and
> >> >>>>>> 210 for corresponding times?
> >> >>>>>>
> >> >>>>>> Thanks,
> >> >>>>>> St.Ack
> >> >>>>>>
> >> >>>>>> The job failed startted about at 17 o'clock.
> >> >>>>>>> By the way, the hadoop version we are using is 0.20.1, the hbase
> >> >>>>>>>
> >> >>>>>> version
> >> >>>>>
> >> >>>>>> we are using is 0.20.3.
> >> >>>>>>>
> >> >>>>>>> Regards,
> >> >>>>>>> LvZheng
> >> >>>>>>> 2010/3/16 Stack <st...@duboce.net>
> >> >>>>>>>
> >> >>>>>>> Can you get that file from hdfs?
> >> >>>>>>>>
> >> >>>>>>>> ./bin/hadoop fs -get
> >> >>>>>>>>>
> >> >>>>>>>> /hbase/summary/1491233486/metrics/5046821377427277894
> >> >>>>>>>>
> >> >>>>>>>> Does it look wholesome?  Is it empty?
> >> >>>>>>>>
> >> >>>>>>>> What if you trace the life of that file in regionserver logs or
> >> >>>>>>>> probably better, over in namenode log?  If you move this file
> >> aside,
> >> >>>>>>>> the region deploys?
> >> >>>>>>>>
> >> >>>>>>>> St.Ack
> >> >>>>>>>>
> >> >>>>>>>> On Mon, Mar 15, 2010 at 3:40 AM, Zheng Lv <
> >> lvzheng19800619@gmail.com
> >> >>>>>>>> >
> >> >>>>>>>> wrote:
> >> >>>>>>>>
> >> >>>>>>>>> Hello Everyone,
> >> >>>>>>>>>  Recently we often got these in our client logs:
> >> >>>>>>>>>  org.apache.hadoop.hbase.client.RetriesExhaustedException:
> Trying
> >> >>>>>>>>>
> >> >>>>>>>> to
> >> >>>>>
> >> >>>>>>  contact region server 172.16.1.208:60020 for region
> >> >>>>>>>>>
> >> >>>>>>>>>
> >> >>>>>>>>
> >> >>>>>
> >>
> summary,SITE_0000000032\x01pt\x0120100314000000\x01\x25E7\x258C\x25AE\x25E5\x258E\x25BF\x25E5\x2586\x2580\x25E9\x25B9\x25B0\x25E6\x2591\x25A9\x25E6\x2593\x25A6\x25E6\x259D\x2590\x25E6\x2596\x2599\x25E5\x258E\x2582\x2B\x25E6\x25B1\x25BD\x25E8\x25BD\x25A6\x25E9\x2585\x258D\x25E4\x25BB\x25B6\x25EF\x25BC\x258C\x25E5\x2598\x2580\x25E9\x2593\x2583\x25E9\x2593\x2583--\x25E7\x259C\x259F\x25E5\x25AE\x259E\x25E5\x25AE\x2589\x25E5\x2585\x25A8\x25E7\x259A\x2584\x25E7\x2594\x25B5\x25E8\x25AF\x259D\x25E3\x2580\x2581\x25E7\x25BD\x2591\x25E7\x25BB\x259C\x25E4\x25BA\x2592\x25E5\x258A\x25A8\x25E4\x25BA\x25A4\x25E5\x258F\x258B\x25E7\x25A4\x25BE\x25E5\x258C\x25BA\x25EF\x25BC\x2581,1268640385017,
> >> >>>>>
> >> >>>>>>  row
> >> >>>>>>>>>
> >> >>>>>>>>>
> >> >>>>>>>>
> >> >>>>>
> >>
> 'SITE_0000000032\x01pt\x0120100315000000\x01\x2521\x25EF\x25BC\x2581\x25E9\x2594\x2580\x25E5\x2594\x25AE\x252F\x25E6\x2594\x25B6\x25E8\x25B4\x25AD\x25EF\x25BC\x2581VM700T\x2BVM700T\x2B\x25E5\x259B\x25BE\x25E5\x2583\x258F\x25E4\x25BF\x25A1\x25E5\x258F\x25B7\x25E4\x25BA\x25A7\x25E7\x2594\x259F\x25E5\x2599\x25A8\x2B\x25E7\x2594\x25B5\x25E5\x25AD\x2590\x25E6\x25B5\x258B\x25E9\x2587\x258F\x25E4\x25BB\x25AA\x25E5\x2599\x25A8\x25EF\x25BC\x258C\x25E5\x2598\x2580\x25E9\x2593\x2583\x25E9\x2593\x2583--\x25E7\x259C\x259F\x25E5\x25AE\x259E\x25E5\x25AE\x2589\x25E5\x2585\x25A8\x25E7\x259A\x2584\x25E7\x2594\x25B5\x25E8\x25AF\x259D\x25E3\x2580\x2581\x25E7\x25BD\x2591\x25E7\x25BB\x259C\x25E4\x25BA\x2592\x25E5\x258A\x25A8\x25E4\x25BA\x25A4\x25E5\x258F\x258B\x25E7\x25A4\x25BE\x25E5\x258C\x25BA\x25EF\x25BC\x2581',
> >> >>>>>
> >> >>>>>>  but failed after 10 attempts.
> >> >>>>>>>>> Exceptions:
> >> >>>>>>>>> java.io.IOException: java.io.IOException: Cannot open filename
> >> >>>>>>>>> /hbase/summary/1491233486/metrics/5046821377427277894
> >> >>>>>>>>> at
> >> >>>>>>>>>
> >> >>>>>>>>>
> >> >>>>>>>>
> >> >>>>>
> >>
> org.apache.hadoop.hdfs.DFSClient$DFSInputStream.openInfo(DFSClient.java:1474)
> >> >>>>>
> >> >>>>>>  at
> >> >>>>>>>>>
> >> >>>>>>>>>
> >> >>>>>>>>
> >> >>>>>
> >>
> org.apache.hadoop.hdfs.DFSClient$DFSInputStream.chooseDataNode(DFSClient.java:1800)
> >> >>>>>
> >> >>>>>>  at
> >> >>>>>>>>>
> >> >>>>>>>>>
> >> >>>>>>>>
> >> >>>>>
> >>
> org.apache.hadoop.hdfs.DFSClient$DFSInputStream.blockSeekTo(DFSClient.java:1616)
> >> >>>>>
> >> >>>>>>  at
> >> >>>>>>>>>
> >> >>>>>>>>
> >> >>>>>>>>
> >> >>>>>
> >>
> org.apache.hadoop.hdfs.DFSClient$DFSInputStream.read(DFSClient.java:1743)
> >> >>>>>
> >> >>>>>>  at java.io.DataInputStream.read(DataInputStream.java:132)
> >> >>>>>>>>> at
> >> >>>>>>>>>
> >> >>>>>>>>>
> >> >>>>>>>>
> >> >>>>>
> >>
> org.apache.hadoop.hbase.io.hfile.BoundedRangeFileInputStream.read(BoundedRangeFileInputStream.java:99)
> >> >>>>>
> >> >>>>>>  at org.apache.hadoop.io.IOUtils.readFully(IOUtils.java:100)
> >> >>>>>>>>> at
> >> >>>>>>>>>
> >> >>>>>>>>
> >> >>>>>>>>
> >> >>>>>
> >>
> org.apache.hadoop.hbase.io.hfile.HFile$Reader.decompress(HFile.java:1020)
> >> >>>>>
> >> >>>>>>  at
> >> >>>>>>>>>
> >> >>>>>>>>
> >> >>>>>>>>
> >> >>>>>
> >> org.apache.hadoop.hbase.io.hfile.HFile$Reader.readBlock(HFile.java:971)
> >> >>>>>
> >> >>>>>>  at
> >> >>>>>>>>>
> >> >>>>>>>>>
> >> >>>>>>>>
> >> >>>>>
> >>
> org.apache.hadoop.hbase.io.hfile.HFile$Reader$Scanner.loadBlock(HFile.java:1304)
> >> >>>>>
> >> >>>>>>  at
> >> >>>>>>>>>
> >> >>>>>>>>>
> >> >>>>>>>>
> >> >>>>>
> >>
> org.apache.hadoop.hbase.io.hfile.HFile$Reader$Scanner.seekTo(HFile.java:1186)
> >> >>>>>
> >> >>>>>>  at
> >> >>>>>>>>>
> >> >>>>>>>>>
> >> >>>>>>>>
> >> >>>>>
> >>
> org.apache.hadoop.hbase.io.HalfHFileReader$1.seekTo(HalfHFileReader.java:207)
> >> >>>>>
> >> >>>>>>  at
> >> >>>>>>>>>
> >> >>>>>>>>>
> >> >>>>>>>>
> >> >>>>>
> >>
> org.apache.hadoop.hbase.regionserver.StoreFileGetScan.getStoreFile(StoreFileGetScan.java:80)
> >> >>>>>
> >> >>>>>>  at
> >> >>>>>>>>>
> >> >>>>>>>>>
> >> >>>>>>>>
> >> >>>>>
> >>
> org.apache.hadoop.hbase.regionserver.StoreFileGetScan.get(StoreFileGetScan.java:65)
> >> >>>>>
> >> >>>>>>  at
> org.apache.hadoop.hbase.regionserver.Store.get(Store.java:1461)
> >> >>>>>>>>> at
> >> >>>>>>>>>
> >> >>>>>>>>
> >> org.apache.hadoop.hbase.regionserver.HRegion.get(HRegion.java:2396)
> >> >>>>>
> >> >>>>>>  at
> >> >>>>>>>>>
> >> >>>>>>>>
> >> org.apache.hadoop.hbase.regionserver.HRegion.get(HRegion.java:2385)
> >> >>>>>
> >> >>>>>>  at
> >> >>>>>>>>>
> >> >>>>>>>>>
> >> >>>>>>>>
> >> >>>>>
> >>
> org.apache.hadoop.hbase.regionserver.HRegionServer.get(HRegionServer.java:1731)
> >> >>>>>
> >> >>>>>>  at sun.reflect.GeneratedMethodAccessor6.invoke(Unknown Source)
> >> >>>>>>>>> at
> >> >>>>>>>>>
> >> >>>>>>>>>
> >> >>>>>>>>
> >> >>>>>
> >>
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
> >> >>>>>
> >> >>>>>>  at java.lang.reflect.Method.invoke(Method.java:597)
> >> >>>>>>>>> at
> >> >>>>>>>>>
> >> >>>>>>>>
> >> org.apache.hadoop.hbase.ipc.HBaseRPC$Server.call(HBaseRPC.java:657)
> >> >>>>>
> >> >>>>>>  at
> >> >>>>>>>>>
> >> >>>>>>>>
> >> >>>>>>>>
> >> >>>>>
> >>
> org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.java:915)
> >> >>>>>
> >> >>>>>>   Is there any way to fix this problem? Or is there anything we
> can
> >> >>>>>>>>>
> >> >>>>>>>> do
> >> >>>>>
> >> >>>>>>  even manually to relieve it?
> >> >>>>>>>>>  Any suggestion?
> >> >>>>>>>>>  Thank you.
> >> >>>>>>>>>  LvZheng
> >> >>>>>>>>>
> >> >>>>>>>>>
> >> >>>>>>>>
> >> >>>>>>>
> >> >>>>>>
> >> >>>>>
> >> >>>>
> >> >>>>
> >> >
> >>
> >
>

Re: Cannot open filename Exceptions

Posted by Stack <st...@duboce.net>.
So, for sure ugly stuff is going on.  I filed
https://issues.apache.org/jira/browse/HBASE-2365.  It looks like we're
doubly assigning a region.

Can you confirm that 209 lags behind the master (207) by about 25
seconds?  Are you running NTP on these machines so they sync their
clocks?

With DEBUG enabled have you been able to reproduce?

That said there might be enough in these logs to go on if you can
confirm the above.

Thanks for your patience Zheng.

St.Ack



On Thu, Mar 18, 2010 at 11:43 PM, Zheng Lv <lv...@gmail.com> wrote:
> Hello Stack,
>  I must say thank you, for your patience too.
>  I'm sorry for that you had tried for many times but the logs you got were
> not that usful. Now I have turn the logging to debug level, so if we get
> these exceptions again, I will send you debug logs. Anyway, I still upload
> the logs you want to rapidshare, although they are not in debug level. The
> urls:
>
> http://rapidshare.com/files/365292889/hadoop-root-namenode-cactus207.log.2010-03-15.html
>
> http://rapidshare.com/files/365293127/hbase-root-master-cactus207.log.2010-03-15.html
>
> http://rapidshare.com/files/365293238/hbase-root-regionserver-cactus208.log.2010-03-15.html
>
> http://rapidshare.com/files/365293391/hbase-root-regionserver-cactus209.log.2010-03-15.html
>
> http://rapidshare.com/files/365293488/hbase-root-regionserver-cactus210.log.2010-03-15.html
>
>  >For sure you've upped xceivers on your hdfs cluster and you've upped
>>the file descriptors as per the 'Getting Started'? (Sorry, have to
>>ask).
>  Before I got your mail, we didn't set the properties you mentioned,
> because we didn't got the "too many open files" or something which are
> mentioned in "getting start" docs. But now I have upped these properties.
> We'll see what will happen.
>
>  If you need more infomations, just tell me.
>
>  Thanks again,
>  LvZheng.
>
>
> 2010/3/19 Stack <st...@duboce.net>
>
>> Yeah, I had to retry a couple of times ("Too busy; try back later --
>> or sign up premium service!").
>>
>> It would have been nice to have wider log snippets.  I'd like to have
>> seen if the issue was double assignment.  The master log snippet only
>> shows the split.  Regionserver 209's log is the one where the
>> interesting stuff is going on around this time, 2010-03-15
>> 16:06:51,150, but its not in the provided set.  Neither are you
>> running at DEBUG level so it'd be harder to see what is up even if you
>> provided it.
>>
>> Looking in 208, I see a few exceptions beyond the one you paste below.
>>  For sure you've upped xceivers on your hdfs cluster and you've upped
>> the file descriptors as per the 'Getting Started'? (Sorry, have to
>> ask).
>>
>> Can I have more of the logs?  Can I have all of the namenode log, all
>> of the master log and 209's log?  This rapidshare thing is fine with
>> me.  I don't mind retrying.
>>
>> Sorry it took me a while to get to this.
>> St.Ack
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>> On Wed, Mar 17, 2010 at 8:32 PM, Zheng Lv <lv...@gmail.com>
>> wrote:
>> > Hello Stack,
>> >    >Sorry. It's taken me a while.  Let try and get to this this evening
>> >    Is it downloading the log files what take you a while? I m sorry, I
>> used
>> > to upload files to skydrive, but now we cant access the website. Is there
>> > any netdisk or something you can download fast? I can upload to it.
>> >    LvZheng
>> > 2010/3/18 Stack <sa...@gmail.com>
>> >
>> >> Sorry. It's taken me a while.  Let try and get to this this evening
>> >>
>> >> Thank you for your patience
>> >>
>> >>
>> >>
>> >>
>> >> On Mar 17, 2010, at 2:29 AM, Zheng Lv <lv...@gmail.com>
>> wrote:
>> >>
>> >> Hello Stack,
>> >>>  Did you receive my mail?It looks like you didnt.
>> >>>   LvZheng
>> >>>
>> >>> 2010/3/16 Zheng Lv <lv...@gmail.com>
>> >>>
>> >>> Hello Stack,
>> >>>>  I have uploaded some parts of the logs on master, regionserver208 and
>> >>>> regionserver210 to:
>> >>>>  http://rapidshare.com/files/363988384/master_207_log.txt.html
>> >>>>  http://rapidshare.com/files/363988673/regionserver_208_log.txt.html
>> >>>>  http://rapidshare.com/files/363988819/regionserver_210_log.txt.html
>> >>>>  I noticed that there are some LeaseExpiredException and "2010-03-15
>> >>>> 16:06:32,864 ERROR
>> >>>> org.apache.hadoop.hbase.regionserver.CompactSplitThread:
>> >>>> Compaction/Split failed for region ..." before 17 oclock. Did these
>> lead
>> >>>> to
>> >>>> the error? Why did these happened? How to avoid these?
>> >>>>  Thanks.
>> >>>>   LvZheng
>> >>>> 2010/3/16 Stack <st...@duboce.net>
>> >>>>
>> >>>> Maybe just the master log would be sufficient from around this time to
>> >>>>> figure the story.
>> >>>>> St.Ack
>> >>>>>
>> >>>>> On Mon, Mar 15, 2010 at 10:04 PM, Stack <st...@duboce.net> wrote:
>> >>>>>
>> >>>>>> Hey Zheng:
>> >>>>>>
>> >>>>>> On Mon, Mar 15, 2010 at 8:16 PM, Zheng Lv <
>> lvzheng19800619@gmail.com>
>> >>>>>>
>> >>>>> wrote:
>> >>>>>
>> >>>>>> Hello Stack,
>> >>>>>>> After we got these exceptions, we restart the cluster and restarted
>> >>>>>>>
>> >>>>>> the
>> >>>>>
>> >>>>>> job that failed, and the job succeeded.
>> >>>>>>> Now when we access
>> >>>>>>>
>> >>>>>> /hbase/summary/1491233486/metrics/5046821377427277894,
>> >>>>>
>> >>>>>> we got " Cannot access
>> >>>>>>> /hbase/summary/1491233486/metrics/5046821377427277894: No such file
>> or
>> >>>>>>> directory." .
>> >>>>>>>
>> >>>>>>>
>> >>>>>> So, that would seem to indicate that the reference was in memory
>> >>>>>> only.. that file was not in filesystem.  You could have tried
>> closing
>> >>>>>> that region.   It would have been interesting also to find history
>> on
>> >>>>>> that region, to try and figure how it came to hold in memory a
>> >>>>>> reference to a file since removed.
>> >>>>>>
>> >>>>>> The messages about this file in namenode logs are in here:
>> >>>>>>> http://rapidshare.com/files/363938595/log.txt.html
>> >>>>>>>
>> >>>>>>
>> >>>>>> This is interesting.  Do you have regionserver logs from 209, 208,
>> and
>> >>>>>> 210 for corresponding times?
>> >>>>>>
>> >>>>>> Thanks,
>> >>>>>> St.Ack
>> >>>>>>
>> >>>>>> The job failed startted about at 17 o'clock.
>> >>>>>>> By the way, the hadoop version we are using is 0.20.1, the hbase
>> >>>>>>>
>> >>>>>> version
>> >>>>>
>> >>>>>> we are using is 0.20.3.
>> >>>>>>>
>> >>>>>>> Regards,
>> >>>>>>> LvZheng
>> >>>>>>> 2010/3/16 Stack <st...@duboce.net>
>> >>>>>>>
>> >>>>>>> Can you get that file from hdfs?
>> >>>>>>>>
>> >>>>>>>> ./bin/hadoop fs -get
>> >>>>>>>>>
>> >>>>>>>> /hbase/summary/1491233486/metrics/5046821377427277894
>> >>>>>>>>
>> >>>>>>>> Does it look wholesome?  Is it empty?
>> >>>>>>>>
>> >>>>>>>> What if you trace the life of that file in regionserver logs or
>> >>>>>>>> probably better, over in namenode log?  If you move this file
>> aside,
>> >>>>>>>> the region deploys?
>> >>>>>>>>
>> >>>>>>>> St.Ack
>> >>>>>>>>
>> >>>>>>>> On Mon, Mar 15, 2010 at 3:40 AM, Zheng Lv <
>> lvzheng19800619@gmail.com
>> >>>>>>>> >
>> >>>>>>>> wrote:
>> >>>>>>>>
>> >>>>>>>>> Hello Everyone,
>> >>>>>>>>>  Recently we often got these in our client logs:
>> >>>>>>>>>  org.apache.hadoop.hbase.client.RetriesExhaustedException: Trying
>> >>>>>>>>>
>> >>>>>>>> to
>> >>>>>
>> >>>>>>  contact region server 172.16.1.208:60020 for region
>> >>>>>>>>>
>> >>>>>>>>>
>> >>>>>>>>
>> >>>>>
>> summary,SITE_0000000032\x01pt\x0120100314000000\x01\x25E7\x258C\x25AE\x25E5\x258E\x25BF\x25E5\x2586\x2580\x25E9\x25B9\x25B0\x25E6\x2591\x25A9\x25E6\x2593\x25A6\x25E6\x259D\x2590\x25E6\x2596\x2599\x25E5\x258E\x2582\x2B\x25E6\x25B1\x25BD\x25E8\x25BD\x25A6\x25E9\x2585\x258D\x25E4\x25BB\x25B6\x25EF\x25BC\x258C\x25E5\x2598\x2580\x25E9\x2593\x2583\x25E9\x2593\x2583--\x25E7\x259C\x259F\x25E5\x25AE\x259E\x25E5\x25AE\x2589\x25E5\x2585\x25A8\x25E7\x259A\x2584\x25E7\x2594\x25B5\x25E8\x25AF\x259D\x25E3\x2580\x2581\x25E7\x25BD\x2591\x25E7\x25BB\x259C\x25E4\x25BA\x2592\x25E5\x258A\x25A8\x25E4\x25BA\x25A4\x25E5\x258F\x258B\x25E7\x25A4\x25BE\x25E5\x258C\x25BA\x25EF\x25BC\x2581,1268640385017,
>> >>>>>
>> >>>>>>  row
>> >>>>>>>>>
>> >>>>>>>>>
>> >>>>>>>>
>> >>>>>
>> 'SITE_0000000032\x01pt\x0120100315000000\x01\x2521\x25EF\x25BC\x2581\x25E9\x2594\x2580\x25E5\x2594\x25AE\x252F\x25E6\x2594\x25B6\x25E8\x25B4\x25AD\x25EF\x25BC\x2581VM700T\x2BVM700T\x2B\x25E5\x259B\x25BE\x25E5\x2583\x258F\x25E4\x25BF\x25A1\x25E5\x258F\x25B7\x25E4\x25BA\x25A7\x25E7\x2594\x259F\x25E5\x2599\x25A8\x2B\x25E7\x2594\x25B5\x25E5\x25AD\x2590\x25E6\x25B5\x258B\x25E9\x2587\x258F\x25E4\x25BB\x25AA\x25E5\x2599\x25A8\x25EF\x25BC\x258C\x25E5\x2598\x2580\x25E9\x2593\x2583\x25E9\x2593\x2583--\x25E7\x259C\x259F\x25E5\x25AE\x259E\x25E5\x25AE\x2589\x25E5\x2585\x25A8\x25E7\x259A\x2584\x25E7\x2594\x25B5\x25E8\x25AF\x259D\x25E3\x2580\x2581\x25E7\x25BD\x2591\x25E7\x25BB\x259C\x25E4\x25BA\x2592\x25E5\x258A\x25A8\x25E4\x25BA\x25A4\x25E5\x258F\x258B\x25E7\x25A4\x25BE\x25E5\x258C\x25BA\x25EF\x25BC\x2581',
>> >>>>>
>> >>>>>>  but failed after 10 attempts.
>> >>>>>>>>> Exceptions:
>> >>>>>>>>> java.io.IOException: java.io.IOException: Cannot open filename
>> >>>>>>>>> /hbase/summary/1491233486/metrics/5046821377427277894
>> >>>>>>>>> at
>> >>>>>>>>>
>> >>>>>>>>>
>> >>>>>>>>
>> >>>>>
>> org.apache.hadoop.hdfs.DFSClient$DFSInputStream.openInfo(DFSClient.java:1474)
>> >>>>>
>> >>>>>>  at
>> >>>>>>>>>
>> >>>>>>>>>
>> >>>>>>>>
>> >>>>>
>> org.apache.hadoop.hdfs.DFSClient$DFSInputStream.chooseDataNode(DFSClient.java:1800)
>> >>>>>
>> >>>>>>  at
>> >>>>>>>>>
>> >>>>>>>>>
>> >>>>>>>>
>> >>>>>
>> org.apache.hadoop.hdfs.DFSClient$DFSInputStream.blockSeekTo(DFSClient.java:1616)
>> >>>>>
>> >>>>>>  at
>> >>>>>>>>>
>> >>>>>>>>
>> >>>>>>>>
>> >>>>>
>> org.apache.hadoop.hdfs.DFSClient$DFSInputStream.read(DFSClient.java:1743)
>> >>>>>
>> >>>>>>  at java.io.DataInputStream.read(DataInputStream.java:132)
>> >>>>>>>>> at
>> >>>>>>>>>
>> >>>>>>>>>
>> >>>>>>>>
>> >>>>>
>> org.apache.hadoop.hbase.io.hfile.BoundedRangeFileInputStream.read(BoundedRangeFileInputStream.java:99)
>> >>>>>
>> >>>>>>  at org.apache.hadoop.io.IOUtils.readFully(IOUtils.java:100)
>> >>>>>>>>> at
>> >>>>>>>>>
>> >>>>>>>>
>> >>>>>>>>
>> >>>>>
>> org.apache.hadoop.hbase.io.hfile.HFile$Reader.decompress(HFile.java:1020)
>> >>>>>
>> >>>>>>  at
>> >>>>>>>>>
>> >>>>>>>>
>> >>>>>>>>
>> >>>>>
>> org.apache.hadoop.hbase.io.hfile.HFile$Reader.readBlock(HFile.java:971)
>> >>>>>
>> >>>>>>  at
>> >>>>>>>>>
>> >>>>>>>>>
>> >>>>>>>>
>> >>>>>
>> org.apache.hadoop.hbase.io.hfile.HFile$Reader$Scanner.loadBlock(HFile.java:1304)
>> >>>>>
>> >>>>>>  at
>> >>>>>>>>>
>> >>>>>>>>>
>> >>>>>>>>
>> >>>>>
>> org.apache.hadoop.hbase.io.hfile.HFile$Reader$Scanner.seekTo(HFile.java:1186)
>> >>>>>
>> >>>>>>  at
>> >>>>>>>>>
>> >>>>>>>>>
>> >>>>>>>>
>> >>>>>
>> org.apache.hadoop.hbase.io.HalfHFileReader$1.seekTo(HalfHFileReader.java:207)
>> >>>>>
>> >>>>>>  at
>> >>>>>>>>>
>> >>>>>>>>>
>> >>>>>>>>
>> >>>>>
>> org.apache.hadoop.hbase.regionserver.StoreFileGetScan.getStoreFile(StoreFileGetScan.java:80)
>> >>>>>
>> >>>>>>  at
>> >>>>>>>>>
>> >>>>>>>>>
>> >>>>>>>>
>> >>>>>
>> org.apache.hadoop.hbase.regionserver.StoreFileGetScan.get(StoreFileGetScan.java:65)
>> >>>>>
>> >>>>>>  at org.apache.hadoop.hbase.regionserver.Store.get(Store.java:1461)
>> >>>>>>>>> at
>> >>>>>>>>>
>> >>>>>>>>
>> org.apache.hadoop.hbase.regionserver.HRegion.get(HRegion.java:2396)
>> >>>>>
>> >>>>>>  at
>> >>>>>>>>>
>> >>>>>>>>
>> org.apache.hadoop.hbase.regionserver.HRegion.get(HRegion.java:2385)
>> >>>>>
>> >>>>>>  at
>> >>>>>>>>>
>> >>>>>>>>>
>> >>>>>>>>
>> >>>>>
>> org.apache.hadoop.hbase.regionserver.HRegionServer.get(HRegionServer.java:1731)
>> >>>>>
>> >>>>>>  at sun.reflect.GeneratedMethodAccessor6.invoke(Unknown Source)
>> >>>>>>>>> at
>> >>>>>>>>>
>> >>>>>>>>>
>> >>>>>>>>
>> >>>>>
>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>> >>>>>
>> >>>>>>  at java.lang.reflect.Method.invoke(Method.java:597)
>> >>>>>>>>> at
>> >>>>>>>>>
>> >>>>>>>>
>> org.apache.hadoop.hbase.ipc.HBaseRPC$Server.call(HBaseRPC.java:657)
>> >>>>>
>> >>>>>>  at
>> >>>>>>>>>
>> >>>>>>>>
>> >>>>>>>>
>> >>>>>
>> org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.java:915)
>> >>>>>
>> >>>>>>   Is there any way to fix this problem? Or is there anything we can
>> >>>>>>>>>
>> >>>>>>>> do
>> >>>>>
>> >>>>>>  even manually to relieve it?
>> >>>>>>>>>  Any suggestion?
>> >>>>>>>>>  Thank you.
>> >>>>>>>>>  LvZheng
>> >>>>>>>>>
>> >>>>>>>>>
>> >>>>>>>>
>> >>>>>>>
>> >>>>>>
>> >>>>>
>> >>>>
>> >>>>
>> >
>>
>

Re: Cannot open filename Exceptions

Posted by Zheng Lv <lv...@gmail.com>.
Hello Stack,
  Does it take you too long time to get logs? May I upload them to any other
sites you are easy to download?
    LvZheng

2010/3/19 Zheng Lv <lv...@gmail.com>

> Hello Stack,
>   I must say thank you, for your patience too.
>   I'm sorry for that you had tried for many times but the logs you got were
> not that usful. Now I have turn the logging to debug level, so if we get
> these exceptions again, I will send you debug logs. Anyway, I still upload
> the logs you want to rapidshare, although they are not in debug level. The
> urls:
>
> http://rapidshare.com/files/365292889/hadoop-root-namenode-cactus207.log.2010-03-15.html
>
> http://rapidshare.com/files/365293127/hbase-root-master-cactus207.log.2010-03-15.html
>
> http://rapidshare.com/files/365293238/hbase-root-regionserver-cactus208.log.2010-03-15.html
>
> http://rapidshare.com/files/365293391/hbase-root-regionserver-cactus209.log.2010-03-15.html
>
> http://rapidshare.com/files/365293488/hbase-root-regionserver-cactus210.log.2010-03-15.html
>
>   >For sure you've upped xceivers on your hdfs cluster and you've upped
> >the file descriptors as per the 'Getting Started'? (Sorry, have to
> >ask).
>   Before I got your mail, we didn't set the properties you mentioned,
> because we didn't got the "too many open files" or something which are
> mentioned in "getting start" docs. But now I have upped these properties.
> We'll see what will happen.
>
>   If you need more infomations, just tell me.
>
>   Thanks again,
>   LvZheng.
>
>
> 2010/3/19 Stack <st...@duboce.net>
>
> Yeah, I had to retry a couple of times ("Too busy; try back later --
>> or sign up premium service!").
>>
>> It would have been nice to have wider log snippets.  I'd like to have
>> seen if the issue was double assignment.  The master log snippet only
>> shows the split.  Regionserver 209's log is the one where the
>> interesting stuff is going on around this time, 2010-03-15
>> 16:06:51,150, but its not in the provided set.  Neither are you
>> running at DEBUG level so it'd be harder to see what is up even if you
>> provided it.
>>
>> Looking in 208, I see a few exceptions beyond the one you paste below.
>>  For sure you've upped xceivers on your hdfs cluster and you've upped
>> the file descriptors as per the 'Getting Started'? (Sorry, have to
>> ask).
>>
>> Can I have more of the logs?  Can I have all of the namenode log, all
>> of the master log and 209's log?  This rapidshare thing is fine with
>> me.  I don't mind retrying.
>>
>> Sorry it took me a while to get to this.
>> St.Ack
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>> On Wed, Mar 17, 2010 at 8:32 PM, Zheng Lv <lv...@gmail.com>
>> wrote:
>> > Hello Stack,
>> >    >Sorry. It's taken me a while.  Let try and get to this this evening
>> >    Is it downloading the log files what take you a while? I m sorry, I
>> used
>> > to upload files to skydrive, but now we cant access the website. Is
>> there
>> > any netdisk or something you can download fast? I can upload to it.
>> >    LvZheng
>> > 2010/3/18 Stack <sa...@gmail.com>
>> >
>> >> Sorry. It's taken me a while.  Let try and get to this this evening
>> >>
>> >> Thank you for your patience
>> >>
>> >>
>> >>
>> >>
>> >> On Mar 17, 2010, at 2:29 AM, Zheng Lv <lv...@gmail.com>
>> wrote:
>> >>
>> >> Hello Stack,
>> >>>  Did you receive my mail?It looks like you didnt.
>> >>>   LvZheng
>> >>>
>> >>> 2010/3/16 Zheng Lv <lv...@gmail.com>
>> >>>
>> >>> Hello Stack,
>> >>>>  I have uploaded some parts of the logs on master, regionserver208
>> and
>> >>>> regionserver210 to:
>> >>>>  http://rapidshare.com/files/363988384/master_207_log.txt.html
>> >>>>  http://rapidshare.com/files/363988673/regionserver_208_log.txt.html
>> >>>>  http://rapidshare.com/files/363988819/regionserver_210_log.txt.html
>> >>>>  I noticed that there are some LeaseExpiredException and "2010-03-15
>> >>>> 16:06:32,864 ERROR
>> >>>> org.apache.hadoop.hbase.regionserver.CompactSplitThread:
>> >>>> Compaction/Split failed for region ..." before 17 oclock. Did these
>> lead
>> >>>> to
>> >>>> the error? Why did these happened? How to avoid these?
>> >>>>  Thanks.
>> >>>>   LvZheng
>> >>>> 2010/3/16 Stack <st...@duboce.net>
>> >>>>
>> >>>> Maybe just the master log would be sufficient from around this time
>> to
>> >>>>> figure the story.
>> >>>>> St.Ack
>> >>>>>
>> >>>>> On Mon, Mar 15, 2010 at 10:04 PM, Stack <st...@duboce.net> wrote:
>> >>>>>
>> >>>>>> Hey Zheng:
>> >>>>>>
>> >>>>>> On Mon, Mar 15, 2010 at 8:16 PM, Zheng Lv <
>> lvzheng19800619@gmail.com>
>> >>>>>>
>> >>>>> wrote:
>> >>>>>
>> >>>>>> Hello Stack,
>> >>>>>>> After we got these exceptions, we restart the cluster and
>> restarted
>> >>>>>>>
>> >>>>>> the
>> >>>>>
>> >>>>>> job that failed, and the job succeeded.
>> >>>>>>> Now when we access
>> >>>>>>>
>> >>>>>> /hbase/summary/1491233486/metrics/5046821377427277894,
>> >>>>>
>> >>>>>> we got " Cannot access
>> >>>>>>> /hbase/summary/1491233486/metrics/5046821377427277894: No such
>> file or
>> >>>>>>> directory." .
>> >>>>>>>
>> >>>>>>>
>> >>>>>> So, that would seem to indicate that the reference was in memory
>> >>>>>> only.. that file was not in filesystem.  You could have tried
>> closing
>> >>>>>> that region.   It would have been interesting also to find history
>> on
>> >>>>>> that region, to try and figure how it came to hold in memory a
>> >>>>>> reference to a file since removed.
>> >>>>>>
>> >>>>>> The messages about this file in namenode logs are in here:
>> >>>>>>> http://rapidshare.com/files/363938595/log.txt.html
>> >>>>>>>
>> >>>>>>
>> >>>>>> This is interesting.  Do you have regionserver logs from 209, 208,
>> and
>> >>>>>> 210 for corresponding times?
>> >>>>>>
>> >>>>>> Thanks,
>> >>>>>> St.Ack
>> >>>>>>
>> >>>>>> The job failed startted about at 17 o'clock.
>> >>>>>>> By the way, the hadoop version we are using is 0.20.1, the hbase
>> >>>>>>>
>> >>>>>> version
>> >>>>>
>> >>>>>> we are using is 0.20.3.
>> >>>>>>>
>> >>>>>>> Regards,
>> >>>>>>> LvZheng
>> >>>>>>> 2010/3/16 Stack <st...@duboce.net>
>> >>>>>>>
>> >>>>>>> Can you get that file from hdfs?
>> >>>>>>>>
>> >>>>>>>> ./bin/hadoop fs -get
>> >>>>>>>>>
>> >>>>>>>> /hbase/summary/1491233486/metrics/5046821377427277894
>> >>>>>>>>
>> >>>>>>>> Does it look wholesome?  Is it empty?
>> >>>>>>>>
>> >>>>>>>> What if you trace the life of that file in regionserver logs or
>> >>>>>>>> probably better, over in namenode log?  If you move this file
>> aside,
>> >>>>>>>> the region deploys?
>> >>>>>>>>
>> >>>>>>>> St.Ack
>> >>>>>>>>
>> >>>>>>>> On Mon, Mar 15, 2010 at 3:40 AM, Zheng Lv <
>> lvzheng19800619@gmail.com
>> >>>>>>>> >
>> >>>>>>>> wrote:
>> >>>>>>>>
>> >>>>>>>>> Hello Everyone,
>> >>>>>>>>>  Recently we often got these in our client logs:
>> >>>>>>>>>  org.apache.hadoop.hbase.client.RetriesExhaustedException:
>> Trying
>> >>>>>>>>>
>> >>>>>>>> to
>> >>>>>
>> >>>>>>  contact region server 172.16.1.208:60020 for region
>> >>>>>>>>>
>> >>>>>>>>>
>> >>>>>>>>
>> >>>>>
>> summary,SITE_0000000032\x01pt\x0120100314000000\x01\x25E7\x258C\x25AE\x25E5\x258E\x25BF\x25E5\x2586\x2580\x25E9\x25B9\x25B0\x25E6\x2591\x25A9\x25E6\x2593\x25A6\x25E6\x259D\x2590\x25E6\x2596\x2599\x25E5\x258E\x2582\x2B\x25E6\x25B1\x25BD\x25E8\x25BD\x25A6\x25E9\x2585\x258D\x25E4\x25BB\x25B6\x25EF\x25BC\x258C\x25E5\x2598\x2580\x25E9\x2593\x2583\x25E9\x2593\x2583--\x25E7\x259C\x259F\x25E5\x25AE\x259E\x25E5\x25AE\x2589\x25E5\x2585\x25A8\x25E7\x259A\x2584\x25E7\x2594\x25B5\x25E8\x25AF\x259D\x25E3\x2580\x2581\x25E7\x25BD\x2591\x25E7\x25BB\x259C\x25E4\x25BA\x2592\x25E5\x258A\x25A8\x25E4\x25BA\x25A4\x25E5\x258F\x258B\x25E7\x25A4\x25BE\x25E5\x258C\x25BA\x25EF\x25BC\x2581,1268640385017,
>> >>>>>
>> >>>>>>  row
>> >>>>>>>>>
>> >>>>>>>>>
>> >>>>>>>>
>> >>>>>
>> 'SITE_0000000032\x01pt\x0120100315000000\x01\x2521\x25EF\x25BC\x2581\x25E9\x2594\x2580\x25E5\x2594\x25AE\x252F\x25E6\x2594\x25B6\x25E8\x25B4\x25AD\x25EF\x25BC\x2581VM700T\x2BVM700T\x2B\x25E5\x259B\x25BE\x25E5\x2583\x258F\x25E4\x25BF\x25A1\x25E5\x258F\x25B7\x25E4\x25BA\x25A7\x25E7\x2594\x259F\x25E5\x2599\x25A8\x2B\x25E7\x2594\x25B5\x25E5\x25AD\x2590\x25E6\x25B5\x258B\x25E9\x2587\x258F\x25E4\x25BB\x25AA\x25E5\x2599\x25A8\x25EF\x25BC\x258C\x25E5\x2598\x2580\x25E9\x2593\x2583\x25E9\x2593\x2583--\x25E7\x259C\x259F\x25E5\x25AE\x259E\x25E5\x25AE\x2589\x25E5\x2585\x25A8\x25E7\x259A\x2584\x25E7\x2594\x25B5\x25E8\x25AF\x259D\x25E3\x2580\x2581\x25E7\x25BD\x2591\x25E7\x25BB\x259C\x25E4\x25BA\x2592\x25E5\x258A\x25A8\x25E4\x25BA\x25A4\x25E5\x258F\x258B\x25E7\x25A4\x25BE\x25E5\x258C\x25BA\x25EF\x25BC\x2581',
>> >>>>>
>> >>>>>>  but failed after 10 attempts.
>> >>>>>>>>> Exceptions:
>> >>>>>>>>> java.io.IOException: java.io.IOException: Cannot open filename
>> >>>>>>>>> /hbase/summary/1491233486/metrics/5046821377427277894
>> >>>>>>>>> at
>> >>>>>>>>>
>> >>>>>>>>>
>> >>>>>>>>
>> >>>>>
>> org.apache.hadoop.hdfs.DFSClient$DFSInputStream.openInfo(DFSClient.java:1474)
>> >>>>>
>> >>>>>>  at
>> >>>>>>>>>
>> >>>>>>>>>
>> >>>>>>>>
>> >>>>>
>> org.apache.hadoop.hdfs.DFSClient$DFSInputStream.chooseDataNode(DFSClient.java:1800)
>> >>>>>
>> >>>>>>  at
>> >>>>>>>>>
>> >>>>>>>>>
>> >>>>>>>>
>> >>>>>
>> org.apache.hadoop.hdfs.DFSClient$DFSInputStream.blockSeekTo(DFSClient.java:1616)
>> >>>>>
>> >>>>>>  at
>> >>>>>>>>>
>> >>>>>>>>
>> >>>>>>>>
>> >>>>>
>> org.apache.hadoop.hdfs.DFSClient$DFSInputStream.read(DFSClient.java:1743)
>> >>>>>
>> >>>>>>  at java.io.DataInputStream.read(DataInputStream.java:132)
>> >>>>>>>>> at
>> >>>>>>>>>
>> >>>>>>>>>
>> >>>>>>>>
>> >>>>>
>> org.apache.hadoop.hbase.io.hfile.BoundedRangeFileInputStream.read(BoundedRangeFileInputStream.java:99)
>> >>>>>
>> >>>>>>  at org.apache.hadoop.io.IOUtils.readFully(IOUtils.java:100)
>> >>>>>>>>> at
>> >>>>>>>>>
>> >>>>>>>>
>> >>>>>>>>
>> >>>>>
>> org.apache.hadoop.hbase.io.hfile.HFile$Reader.decompress(HFile.java:1020)
>> >>>>>
>> >>>>>>  at
>> >>>>>>>>>
>> >>>>>>>>
>> >>>>>>>>
>> >>>>>
>> org.apache.hadoop.hbase.io.hfile.HFile$Reader.readBlock(HFile.java:971)
>> >>>>>
>> >>>>>>  at
>> >>>>>>>>>
>> >>>>>>>>>
>> >>>>>>>>
>> >>>>>
>> org.apache.hadoop.hbase.io.hfile.HFile$Reader$Scanner.loadBlock(HFile.java:1304)
>> >>>>>
>> >>>>>>  at
>> >>>>>>>>>
>> >>>>>>>>>
>> >>>>>>>>
>> >>>>>
>> org.apache.hadoop.hbase.io.hfile.HFile$Reader$Scanner.seekTo(HFile.java:1186)
>> >>>>>
>> >>>>>>  at
>> >>>>>>>>>
>> >>>>>>>>>
>> >>>>>>>>
>> >>>>>
>> org.apache.hadoop.hbase.io.HalfHFileReader$1.seekTo(HalfHFileReader.java:207)
>> >>>>>
>> >>>>>>  at
>> >>>>>>>>>
>> >>>>>>>>>
>> >>>>>>>>
>> >>>>>
>> org.apache.hadoop.hbase.regionserver.StoreFileGetScan.getStoreFile(StoreFileGetScan.java:80)
>> >>>>>
>> >>>>>>  at
>> >>>>>>>>>
>> >>>>>>>>>
>> >>>>>>>>
>> >>>>>
>> org.apache.hadoop.hbase.regionserver.StoreFileGetScan.get(StoreFileGetScan.java:65)
>> >>>>>
>> >>>>>>  at org.apache.hadoop.hbase.regionserver.Store.get(Store.java:1461)
>> >>>>>>>>> at
>> >>>>>>>>>
>> >>>>>>>>
>> org.apache.hadoop.hbase.regionserver.HRegion.get(HRegion.java:2396)
>> >>>>>
>> >>>>>>  at
>> >>>>>>>>>
>> >>>>>>>>
>> org.apache.hadoop.hbase.regionserver.HRegion.get(HRegion.java:2385)
>> >>>>>
>> >>>>>>  at
>> >>>>>>>>>
>> >>>>>>>>>
>> >>>>>>>>
>> >>>>>
>> org.apache.hadoop.hbase.regionserver.HRegionServer.get(HRegionServer.java:1731)
>> >>>>>
>> >>>>>>  at sun.reflect.GeneratedMethodAccessor6.invoke(Unknown Source)
>> >>>>>>>>> at
>> >>>>>>>>>
>> >>>>>>>>>
>> >>>>>>>>
>> >>>>>
>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>> >>>>>
>> >>>>>>  at java.lang.reflect.Method.invoke(Method.java:597)
>> >>>>>>>>> at
>> >>>>>>>>>
>> >>>>>>>>
>> org.apache.hadoop.hbase.ipc.HBaseRPC$Server.call(HBaseRPC.java:657)
>> >>>>>
>> >>>>>>  at
>> >>>>>>>>>
>> >>>>>>>>
>> >>>>>>>>
>> >>>>>
>> org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.java:915)
>> >>>>>
>> >>>>>>   Is there any way to fix this problem? Or is there anything we can
>> >>>>>>>>>
>> >>>>>>>> do
>> >>>>>
>> >>>>>>  even manually to relieve it?
>> >>>>>>>>>  Any suggestion?
>> >>>>>>>>>  Thank you.
>> >>>>>>>>>  LvZheng
>> >>>>>>>>>
>> >>>>>>>>>
>> >>>>>>>>
>> >>>>>>>
>> >>>>>>
>> >>>>>
>> >>>>
>> >>>>
>> >
>>
>
>

Re: Cannot open filename Exceptions

Posted by Zheng Lv <lv...@gmail.com>.
Hello Stack,
  I must say thank you, for your patience too.
  I'm sorry for that you had tried for many times but the logs you got were
not that usful. Now I have turn the logging to debug level, so if we get
these exceptions again, I will send you debug logs. Anyway, I still upload
the logs you want to rapidshare, although they are not in debug level. The
urls:

http://rapidshare.com/files/365292889/hadoop-root-namenode-cactus207.log.2010-03-15.html

http://rapidshare.com/files/365293127/hbase-root-master-cactus207.log.2010-03-15.html

http://rapidshare.com/files/365293238/hbase-root-regionserver-cactus208.log.2010-03-15.html

http://rapidshare.com/files/365293391/hbase-root-regionserver-cactus209.log.2010-03-15.html

http://rapidshare.com/files/365293488/hbase-root-regionserver-cactus210.log.2010-03-15.html

  >For sure you've upped xceivers on your hdfs cluster and you've upped
>the file descriptors as per the 'Getting Started'? (Sorry, have to
>ask).
  Before I got your mail, we didn't set the properties you mentioned,
because we didn't got the "too many open files" or something which are
mentioned in "getting start" docs. But now I have upped these properties.
We'll see what will happen.

  If you need more infomations, just tell me.

  Thanks again,
  LvZheng.


2010/3/19 Stack <st...@duboce.net>

> Yeah, I had to retry a couple of times ("Too busy; try back later --
> or sign up premium service!").
>
> It would have been nice to have wider log snippets.  I'd like to have
> seen if the issue was double assignment.  The master log snippet only
> shows the split.  Regionserver 209's log is the one where the
> interesting stuff is going on around this time, 2010-03-15
> 16:06:51,150, but its not in the provided set.  Neither are you
> running at DEBUG level so it'd be harder to see what is up even if you
> provided it.
>
> Looking in 208, I see a few exceptions beyond the one you paste below.
>  For sure you've upped xceivers on your hdfs cluster and you've upped
> the file descriptors as per the 'Getting Started'? (Sorry, have to
> ask).
>
> Can I have more of the logs?  Can I have all of the namenode log, all
> of the master log and 209's log?  This rapidshare thing is fine with
> me.  I don't mind retrying.
>
> Sorry it took me a while to get to this.
> St.Ack
>
>
>
>
>
>
>
>
>
>
>
> On Wed, Mar 17, 2010 at 8:32 PM, Zheng Lv <lv...@gmail.com>
> wrote:
> > Hello Stack,
> >    >Sorry. It's taken me a while.  Let try and get to this this evening
> >    Is it downloading the log files what take you a while? I m sorry, I
> used
> > to upload files to skydrive, but now we cant access the website. Is there
> > any netdisk or something you can download fast? I can upload to it.
> >    LvZheng
> > 2010/3/18 Stack <sa...@gmail.com>
> >
> >> Sorry. It's taken me a while.  Let try and get to this this evening
> >>
> >> Thank you for your patience
> >>
> >>
> >>
> >>
> >> On Mar 17, 2010, at 2:29 AM, Zheng Lv <lv...@gmail.com>
> wrote:
> >>
> >> Hello Stack,
> >>>  Did you receive my mail?It looks like you didnt.
> >>>   LvZheng
> >>>
> >>> 2010/3/16 Zheng Lv <lv...@gmail.com>
> >>>
> >>> Hello Stack,
> >>>>  I have uploaded some parts of the logs on master, regionserver208 and
> >>>> regionserver210 to:
> >>>>  http://rapidshare.com/files/363988384/master_207_log.txt.html
> >>>>  http://rapidshare.com/files/363988673/regionserver_208_log.txt.html
> >>>>  http://rapidshare.com/files/363988819/regionserver_210_log.txt.html
> >>>>  I noticed that there are some LeaseExpiredException and "2010-03-15
> >>>> 16:06:32,864 ERROR
> >>>> org.apache.hadoop.hbase.regionserver.CompactSplitThread:
> >>>> Compaction/Split failed for region ..." before 17 oclock. Did these
> lead
> >>>> to
> >>>> the error? Why did these happened? How to avoid these?
> >>>>  Thanks.
> >>>>   LvZheng
> >>>> 2010/3/16 Stack <st...@duboce.net>
> >>>>
> >>>> Maybe just the master log would be sufficient from around this time to
> >>>>> figure the story.
> >>>>> St.Ack
> >>>>>
> >>>>> On Mon, Mar 15, 2010 at 10:04 PM, Stack <st...@duboce.net> wrote:
> >>>>>
> >>>>>> Hey Zheng:
> >>>>>>
> >>>>>> On Mon, Mar 15, 2010 at 8:16 PM, Zheng Lv <
> lvzheng19800619@gmail.com>
> >>>>>>
> >>>>> wrote:
> >>>>>
> >>>>>> Hello Stack,
> >>>>>>> After we got these exceptions, we restart the cluster and restarted
> >>>>>>>
> >>>>>> the
> >>>>>
> >>>>>> job that failed, and the job succeeded.
> >>>>>>> Now when we access
> >>>>>>>
> >>>>>> /hbase/summary/1491233486/metrics/5046821377427277894,
> >>>>>
> >>>>>> we got " Cannot access
> >>>>>>> /hbase/summary/1491233486/metrics/5046821377427277894: No such file
> or
> >>>>>>> directory." .
> >>>>>>>
> >>>>>>>
> >>>>>> So, that would seem to indicate that the reference was in memory
> >>>>>> only.. that file was not in filesystem.  You could have tried
> closing
> >>>>>> that region.   It would have been interesting also to find history
> on
> >>>>>> that region, to try and figure how it came to hold in memory a
> >>>>>> reference to a file since removed.
> >>>>>>
> >>>>>> The messages about this file in namenode logs are in here:
> >>>>>>> http://rapidshare.com/files/363938595/log.txt.html
> >>>>>>>
> >>>>>>
> >>>>>> This is interesting.  Do you have regionserver logs from 209, 208,
> and
> >>>>>> 210 for corresponding times?
> >>>>>>
> >>>>>> Thanks,
> >>>>>> St.Ack
> >>>>>>
> >>>>>> The job failed startted about at 17 o'clock.
> >>>>>>> By the way, the hadoop version we are using is 0.20.1, the hbase
> >>>>>>>
> >>>>>> version
> >>>>>
> >>>>>> we are using is 0.20.3.
> >>>>>>>
> >>>>>>> Regards,
> >>>>>>> LvZheng
> >>>>>>> 2010/3/16 Stack <st...@duboce.net>
> >>>>>>>
> >>>>>>> Can you get that file from hdfs?
> >>>>>>>>
> >>>>>>>> ./bin/hadoop fs -get
> >>>>>>>>>
> >>>>>>>> /hbase/summary/1491233486/metrics/5046821377427277894
> >>>>>>>>
> >>>>>>>> Does it look wholesome?  Is it empty?
> >>>>>>>>
> >>>>>>>> What if you trace the life of that file in regionserver logs or
> >>>>>>>> probably better, over in namenode log?  If you move this file
> aside,
> >>>>>>>> the region deploys?
> >>>>>>>>
> >>>>>>>> St.Ack
> >>>>>>>>
> >>>>>>>> On Mon, Mar 15, 2010 at 3:40 AM, Zheng Lv <
> lvzheng19800619@gmail.com
> >>>>>>>> >
> >>>>>>>> wrote:
> >>>>>>>>
> >>>>>>>>> Hello Everyone,
> >>>>>>>>>  Recently we often got these in our client logs:
> >>>>>>>>>  org.apache.hadoop.hbase.client.RetriesExhaustedException: Trying
> >>>>>>>>>
> >>>>>>>> to
> >>>>>
> >>>>>>  contact region server 172.16.1.208:60020 for region
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>
> >>>>>
> summary,SITE_0000000032\x01pt\x0120100314000000\x01\x25E7\x258C\x25AE\x25E5\x258E\x25BF\x25E5\x2586\x2580\x25E9\x25B9\x25B0\x25E6\x2591\x25A9\x25E6\x2593\x25A6\x25E6\x259D\x2590\x25E6\x2596\x2599\x25E5\x258E\x2582\x2B\x25E6\x25B1\x25BD\x25E8\x25BD\x25A6\x25E9\x2585\x258D\x25E4\x25BB\x25B6\x25EF\x25BC\x258C\x25E5\x2598\x2580\x25E9\x2593\x2583\x25E9\x2593\x2583--\x25E7\x259C\x259F\x25E5\x25AE\x259E\x25E5\x25AE\x2589\x25E5\x2585\x25A8\x25E7\x259A\x2584\x25E7\x2594\x25B5\x25E8\x25AF\x259D\x25E3\x2580\x2581\x25E7\x25BD\x2591\x25E7\x25BB\x259C\x25E4\x25BA\x2592\x25E5\x258A\x25A8\x25E4\x25BA\x25A4\x25E5\x258F\x258B\x25E7\x25A4\x25BE\x25E5\x258C\x25BA\x25EF\x25BC\x2581,1268640385017,
> >>>>>
> >>>>>>  row
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>
> >>>>>
> 'SITE_0000000032\x01pt\x0120100315000000\x01\x2521\x25EF\x25BC\x2581\x25E9\x2594\x2580\x25E5\x2594\x25AE\x252F\x25E6\x2594\x25B6\x25E8\x25B4\x25AD\x25EF\x25BC\x2581VM700T\x2BVM700T\x2B\x25E5\x259B\x25BE\x25E5\x2583\x258F\x25E4\x25BF\x25A1\x25E5\x258F\x25B7\x25E4\x25BA\x25A7\x25E7\x2594\x259F\x25E5\x2599\x25A8\x2B\x25E7\x2594\x25B5\x25E5\x25AD\x2590\x25E6\x25B5\x258B\x25E9\x2587\x258F\x25E4\x25BB\x25AA\x25E5\x2599\x25A8\x25EF\x25BC\x258C\x25E5\x2598\x2580\x25E9\x2593\x2583\x25E9\x2593\x2583--\x25E7\x259C\x259F\x25E5\x25AE\x259E\x25E5\x25AE\x2589\x25E5\x2585\x25A8\x25E7\x259A\x2584\x25E7\x2594\x25B5\x25E8\x25AF\x259D\x25E3\x2580\x2581\x25E7\x25BD\x2591\x25E7\x25BB\x259C\x25E4\x25BA\x2592\x25E5\x258A\x25A8\x25E4\x25BA\x25A4\x25E5\x258F\x258B\x25E7\x25A4\x25BE\x25E5\x258C\x25BA\x25EF\x25BC\x2581',
> >>>>>
> >>>>>>  but failed after 10 attempts.
> >>>>>>>>> Exceptions:
> >>>>>>>>> java.io.IOException: java.io.IOException: Cannot open filename
> >>>>>>>>> /hbase/summary/1491233486/metrics/5046821377427277894
> >>>>>>>>> at
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>
> >>>>>
> org.apache.hadoop.hdfs.DFSClient$DFSInputStream.openInfo(DFSClient.java:1474)
> >>>>>
> >>>>>>  at
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>
> >>>>>
> org.apache.hadoop.hdfs.DFSClient$DFSInputStream.chooseDataNode(DFSClient.java:1800)
> >>>>>
> >>>>>>  at
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>
> >>>>>
> org.apache.hadoop.hdfs.DFSClient$DFSInputStream.blockSeekTo(DFSClient.java:1616)
> >>>>>
> >>>>>>  at
> >>>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>
> org.apache.hadoop.hdfs.DFSClient$DFSInputStream.read(DFSClient.java:1743)
> >>>>>
> >>>>>>  at java.io.DataInputStream.read(DataInputStream.java:132)
> >>>>>>>>> at
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>
> >>>>>
> org.apache.hadoop.hbase.io.hfile.BoundedRangeFileInputStream.read(BoundedRangeFileInputStream.java:99)
> >>>>>
> >>>>>>  at org.apache.hadoop.io.IOUtils.readFully(IOUtils.java:100)
> >>>>>>>>> at
> >>>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>
> org.apache.hadoop.hbase.io.hfile.HFile$Reader.decompress(HFile.java:1020)
> >>>>>
> >>>>>>  at
> >>>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>
> org.apache.hadoop.hbase.io.hfile.HFile$Reader.readBlock(HFile.java:971)
> >>>>>
> >>>>>>  at
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>
> >>>>>
> org.apache.hadoop.hbase.io.hfile.HFile$Reader$Scanner.loadBlock(HFile.java:1304)
> >>>>>
> >>>>>>  at
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>
> >>>>>
> org.apache.hadoop.hbase.io.hfile.HFile$Reader$Scanner.seekTo(HFile.java:1186)
> >>>>>
> >>>>>>  at
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>
> >>>>>
> org.apache.hadoop.hbase.io.HalfHFileReader$1.seekTo(HalfHFileReader.java:207)
> >>>>>
> >>>>>>  at
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>
> >>>>>
> org.apache.hadoop.hbase.regionserver.StoreFileGetScan.getStoreFile(StoreFileGetScan.java:80)
> >>>>>
> >>>>>>  at
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>
> >>>>>
> org.apache.hadoop.hbase.regionserver.StoreFileGetScan.get(StoreFileGetScan.java:65)
> >>>>>
> >>>>>>  at org.apache.hadoop.hbase.regionserver.Store.get(Store.java:1461)
> >>>>>>>>> at
> >>>>>>>>>
> >>>>>>>>
> org.apache.hadoop.hbase.regionserver.HRegion.get(HRegion.java:2396)
> >>>>>
> >>>>>>  at
> >>>>>>>>>
> >>>>>>>>
> org.apache.hadoop.hbase.regionserver.HRegion.get(HRegion.java:2385)
> >>>>>
> >>>>>>  at
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>
> >>>>>
> org.apache.hadoop.hbase.regionserver.HRegionServer.get(HRegionServer.java:1731)
> >>>>>
> >>>>>>  at sun.reflect.GeneratedMethodAccessor6.invoke(Unknown Source)
> >>>>>>>>> at
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>
> >>>>>
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
> >>>>>
> >>>>>>  at java.lang.reflect.Method.invoke(Method.java:597)
> >>>>>>>>> at
> >>>>>>>>>
> >>>>>>>>
> org.apache.hadoop.hbase.ipc.HBaseRPC$Server.call(HBaseRPC.java:657)
> >>>>>
> >>>>>>  at
> >>>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>
> org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.java:915)
> >>>>>
> >>>>>>   Is there any way to fix this problem? Or is there anything we can
> >>>>>>>>>
> >>>>>>>> do
> >>>>>
> >>>>>>  even manually to relieve it?
> >>>>>>>>>  Any suggestion?
> >>>>>>>>>  Thank you.
> >>>>>>>>>  LvZheng
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>
> >>>>>>>
> >>>>>>
> >>>>>
> >>>>
> >>>>
> >
>

Re: Cannot open filename Exceptions

Posted by Stack <st...@duboce.net>.
Yeah, I had to retry a couple of times ("Too busy; try back later --
or sign up premium service!").

It would have been nice to have wider log snippets.  I'd like to have
seen if the issue was double assignment.  The master log snippet only
shows the split.  Regionserver 209's log is the one where the
interesting stuff is going on around this time, 2010-03-15
16:06:51,150, but its not in the provided set.  Neither are you
running at DEBUG level so it'd be harder to see what is up even if you
provided it.

Looking in 208, I see a few exceptions beyond the one you paste below.
 For sure you've upped xceivers on your hdfs cluster and you've upped
the file descriptors as per the 'Getting Started'? (Sorry, have to
ask).

Can I have more of the logs?  Can I have all of the namenode log, all
of the master log and 209's log?  This rapidshare thing is fine with
me.  I don't mind retrying.

Sorry it took me a while to get to this.
St.Ack











On Wed, Mar 17, 2010 at 8:32 PM, Zheng Lv <lv...@gmail.com> wrote:
> Hello Stack,
>    >Sorry. It's taken me a while.  Let try and get to this this evening
>    Is it downloading the log files what take you a while? I m sorry, I used
> to upload files to skydrive, but now we cant access the website. Is there
> any netdisk or something you can download fast? I can upload to it.
>    LvZheng
> 2010/3/18 Stack <sa...@gmail.com>
>
>> Sorry. It's taken me a while.  Let try and get to this this evening
>>
>> Thank you for your patience
>>
>>
>>
>>
>> On Mar 17, 2010, at 2:29 AM, Zheng Lv <lv...@gmail.com> wrote:
>>
>> Hello Stack,
>>>  Did you receive my mail?It looks like you didnt.
>>>   LvZheng
>>>
>>> 2010/3/16 Zheng Lv <lv...@gmail.com>
>>>
>>> Hello Stack,
>>>>  I have uploaded some parts of the logs on master, regionserver208 and
>>>> regionserver210 to:
>>>>  http://rapidshare.com/files/363988384/master_207_log.txt.html
>>>>  http://rapidshare.com/files/363988673/regionserver_208_log.txt.html
>>>>  http://rapidshare.com/files/363988819/regionserver_210_log.txt.html
>>>>  I noticed that there are some LeaseExpiredException and "2010-03-15
>>>> 16:06:32,864 ERROR
>>>> org.apache.hadoop.hbase.regionserver.CompactSplitThread:
>>>> Compaction/Split failed for region ..." before 17 oclock. Did these lead
>>>> to
>>>> the error? Why did these happened? How to avoid these?
>>>>  Thanks.
>>>>   LvZheng
>>>> 2010/3/16 Stack <st...@duboce.net>
>>>>
>>>> Maybe just the master log would be sufficient from around this time to
>>>>> figure the story.
>>>>> St.Ack
>>>>>
>>>>> On Mon, Mar 15, 2010 at 10:04 PM, Stack <st...@duboce.net> wrote:
>>>>>
>>>>>> Hey Zheng:
>>>>>>
>>>>>> On Mon, Mar 15, 2010 at 8:16 PM, Zheng Lv <lv...@gmail.com>
>>>>>>
>>>>> wrote:
>>>>>
>>>>>> Hello Stack,
>>>>>>> After we got these exceptions, we restart the cluster and restarted
>>>>>>>
>>>>>> the
>>>>>
>>>>>> job that failed, and the job succeeded.
>>>>>>> Now when we access
>>>>>>>
>>>>>> /hbase/summary/1491233486/metrics/5046821377427277894,
>>>>>
>>>>>> we got " Cannot access
>>>>>>> /hbase/summary/1491233486/metrics/5046821377427277894: No such file or
>>>>>>> directory." .
>>>>>>>
>>>>>>>
>>>>>> So, that would seem to indicate that the reference was in memory
>>>>>> only.. that file was not in filesystem.  You could have tried closing
>>>>>> that region.   It would have been interesting also to find history on
>>>>>> that region, to try and figure how it came to hold in memory a
>>>>>> reference to a file since removed.
>>>>>>
>>>>>> The messages about this file in namenode logs are in here:
>>>>>>> http://rapidshare.com/files/363938595/log.txt.html
>>>>>>>
>>>>>>
>>>>>> This is interesting.  Do you have regionserver logs from 209, 208, and
>>>>>> 210 for corresponding times?
>>>>>>
>>>>>> Thanks,
>>>>>> St.Ack
>>>>>>
>>>>>> The job failed startted about at 17 o'clock.
>>>>>>> By the way, the hadoop version we are using is 0.20.1, the hbase
>>>>>>>
>>>>>> version
>>>>>
>>>>>> we are using is 0.20.3.
>>>>>>>
>>>>>>> Regards,
>>>>>>> LvZheng
>>>>>>> 2010/3/16 Stack <st...@duboce.net>
>>>>>>>
>>>>>>> Can you get that file from hdfs?
>>>>>>>>
>>>>>>>> ./bin/hadoop fs -get
>>>>>>>>>
>>>>>>>> /hbase/summary/1491233486/metrics/5046821377427277894
>>>>>>>>
>>>>>>>> Does it look wholesome?  Is it empty?
>>>>>>>>
>>>>>>>> What if you trace the life of that file in regionserver logs or
>>>>>>>> probably better, over in namenode log?  If you move this file aside,
>>>>>>>> the region deploys?
>>>>>>>>
>>>>>>>> St.Ack
>>>>>>>>
>>>>>>>> On Mon, Mar 15, 2010 at 3:40 AM, Zheng Lv <lvzheng19800619@gmail.com
>>>>>>>> >
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> Hello Everyone,
>>>>>>>>>  Recently we often got these in our client logs:
>>>>>>>>>  org.apache.hadoop.hbase.client.RetriesExhaustedException: Trying
>>>>>>>>>
>>>>>>>> to
>>>>>
>>>>>>  contact region server 172.16.1.208:60020 for region
>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>> summary,SITE_0000000032\x01pt\x0120100314000000\x01\x25E7\x258C\x25AE\x25E5\x258E\x25BF\x25E5\x2586\x2580\x25E9\x25B9\x25B0\x25E6\x2591\x25A9\x25E6\x2593\x25A6\x25E6\x259D\x2590\x25E6\x2596\x2599\x25E5\x258E\x2582\x2B\x25E6\x25B1\x25BD\x25E8\x25BD\x25A6\x25E9\x2585\x258D\x25E4\x25BB\x25B6\x25EF\x25BC\x258C\x25E5\x2598\x2580\x25E9\x2593\x2583\x25E9\x2593\x2583--\x25E7\x259C\x259F\x25E5\x25AE\x259E\x25E5\x25AE\x2589\x25E5\x2585\x25A8\x25E7\x259A\x2584\x25E7\x2594\x25B5\x25E8\x25AF\x259D\x25E3\x2580\x2581\x25E7\x25BD\x2591\x25E7\x25BB\x259C\x25E4\x25BA\x2592\x25E5\x258A\x25A8\x25E4\x25BA\x25A4\x25E5\x258F\x258B\x25E7\x25A4\x25BE\x25E5\x258C\x25BA\x25EF\x25BC\x2581,1268640385017,
>>>>>
>>>>>>  row
>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>> 'SITE_0000000032\x01pt\x0120100315000000\x01\x2521\x25EF\x25BC\x2581\x25E9\x2594\x2580\x25E5\x2594\x25AE\x252F\x25E6\x2594\x25B6\x25E8\x25B4\x25AD\x25EF\x25BC\x2581VM700T\x2BVM700T\x2B\x25E5\x259B\x25BE\x25E5\x2583\x258F\x25E4\x25BF\x25A1\x25E5\x258F\x25B7\x25E4\x25BA\x25A7\x25E7\x2594\x259F\x25E5\x2599\x25A8\x2B\x25E7\x2594\x25B5\x25E5\x25AD\x2590\x25E6\x25B5\x258B\x25E9\x2587\x258F\x25E4\x25BB\x25AA\x25E5\x2599\x25A8\x25EF\x25BC\x258C\x25E5\x2598\x2580\x25E9\x2593\x2583\x25E9\x2593\x2583--\x25E7\x259C\x259F\x25E5\x25AE\x259E\x25E5\x25AE\x2589\x25E5\x2585\x25A8\x25E7\x259A\x2584\x25E7\x2594\x25B5\x25E8\x25AF\x259D\x25E3\x2580\x2581\x25E7\x25BD\x2591\x25E7\x25BB\x259C\x25E4\x25BA\x2592\x25E5\x258A\x25A8\x25E4\x25BA\x25A4\x25E5\x258F\x258B\x25E7\x25A4\x25BE\x25E5\x258C\x25BA\x25EF\x25BC\x2581',
>>>>>
>>>>>>  but failed after 10 attempts.
>>>>>>>>> Exceptions:
>>>>>>>>> java.io.IOException: java.io.IOException: Cannot open filename
>>>>>>>>> /hbase/summary/1491233486/metrics/5046821377427277894
>>>>>>>>> at
>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>> org.apache.hadoop.hdfs.DFSClient$DFSInputStream.openInfo(DFSClient.java:1474)
>>>>>
>>>>>>  at
>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>> org.apache.hadoop.hdfs.DFSClient$DFSInputStream.chooseDataNode(DFSClient.java:1800)
>>>>>
>>>>>>  at
>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>> org.apache.hadoop.hdfs.DFSClient$DFSInputStream.blockSeekTo(DFSClient.java:1616)
>>>>>
>>>>>>  at
>>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>> org.apache.hadoop.hdfs.DFSClient$DFSInputStream.read(DFSClient.java:1743)
>>>>>
>>>>>>  at java.io.DataInputStream.read(DataInputStream.java:132)
>>>>>>>>> at
>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>> org.apache.hadoop.hbase.io.hfile.BoundedRangeFileInputStream.read(BoundedRangeFileInputStream.java:99)
>>>>>
>>>>>>  at org.apache.hadoop.io.IOUtils.readFully(IOUtils.java:100)
>>>>>>>>> at
>>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>> org.apache.hadoop.hbase.io.hfile.HFile$Reader.decompress(HFile.java:1020)
>>>>>
>>>>>>  at
>>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>> org.apache.hadoop.hbase.io.hfile.HFile$Reader.readBlock(HFile.java:971)
>>>>>
>>>>>>  at
>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>> org.apache.hadoop.hbase.io.hfile.HFile$Reader$Scanner.loadBlock(HFile.java:1304)
>>>>>
>>>>>>  at
>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>> org.apache.hadoop.hbase.io.hfile.HFile$Reader$Scanner.seekTo(HFile.java:1186)
>>>>>
>>>>>>  at
>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>> org.apache.hadoop.hbase.io.HalfHFileReader$1.seekTo(HalfHFileReader.java:207)
>>>>>
>>>>>>  at
>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>> org.apache.hadoop.hbase.regionserver.StoreFileGetScan.getStoreFile(StoreFileGetScan.java:80)
>>>>>
>>>>>>  at
>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>> org.apache.hadoop.hbase.regionserver.StoreFileGetScan.get(StoreFileGetScan.java:65)
>>>>>
>>>>>>  at org.apache.hadoop.hbase.regionserver.Store.get(Store.java:1461)
>>>>>>>>> at
>>>>>>>>>
>>>>>>>> org.apache.hadoop.hbase.regionserver.HRegion.get(HRegion.java:2396)
>>>>>
>>>>>>  at
>>>>>>>>>
>>>>>>>> org.apache.hadoop.hbase.regionserver.HRegion.get(HRegion.java:2385)
>>>>>
>>>>>>  at
>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>> org.apache.hadoop.hbase.regionserver.HRegionServer.get(HRegionServer.java:1731)
>>>>>
>>>>>>  at sun.reflect.GeneratedMethodAccessor6.invoke(Unknown Source)
>>>>>>>>> at
>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>>>>>
>>>>>>  at java.lang.reflect.Method.invoke(Method.java:597)
>>>>>>>>> at
>>>>>>>>>
>>>>>>>> org.apache.hadoop.hbase.ipc.HBaseRPC$Server.call(HBaseRPC.java:657)
>>>>>
>>>>>>  at
>>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>> org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.java:915)
>>>>>
>>>>>>   Is there any way to fix this problem? Or is there anything we can
>>>>>>>>>
>>>>>>>> do
>>>>>
>>>>>>  even manually to relieve it?
>>>>>>>>>  Any suggestion?
>>>>>>>>>  Thank you.
>>>>>>>>>  LvZheng
>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>>
>

Re: Cannot open filename Exceptions

Posted by Zheng Lv <lv...@gmail.com>.
Hello Stack,
    >Sorry. It's taken me a while.  Let try and get to this this evening
    Is it downloading the log files what take you a while? I m sorry, I used
to upload files to skydrive, but now we cant access the website. Is there
any netdisk or something you can download fast? I can upload to it.
    LvZheng
2010/3/18 Stack <sa...@gmail.com>

> Sorry. It's taken me a while.  Let try and get to this this evening
>
> Thank you for your patience
>
>
>
>
> On Mar 17, 2010, at 2:29 AM, Zheng Lv <lv...@gmail.com> wrote:
>
> Hello Stack,
>>  Did you receive my mail?It looks like you didnt.
>>   LvZheng
>>
>> 2010/3/16 Zheng Lv <lv...@gmail.com>
>>
>> Hello Stack,
>>>  I have uploaded some parts of the logs on master, regionserver208 and
>>> regionserver210 to:
>>>  http://rapidshare.com/files/363988384/master_207_log.txt.html
>>>  http://rapidshare.com/files/363988673/regionserver_208_log.txt.html
>>>  http://rapidshare.com/files/363988819/regionserver_210_log.txt.html
>>>  I noticed that there are some LeaseExpiredException and "2010-03-15
>>> 16:06:32,864 ERROR
>>> org.apache.hadoop.hbase.regionserver.CompactSplitThread:
>>> Compaction/Split failed for region ..." before 17 oclock. Did these lead
>>> to
>>> the error? Why did these happened? How to avoid these?
>>>  Thanks.
>>>   LvZheng
>>> 2010/3/16 Stack <st...@duboce.net>
>>>
>>> Maybe just the master log would be sufficient from around this time to
>>>> figure the story.
>>>> St.Ack
>>>>
>>>> On Mon, Mar 15, 2010 at 10:04 PM, Stack <st...@duboce.net> wrote:
>>>>
>>>>> Hey Zheng:
>>>>>
>>>>> On Mon, Mar 15, 2010 at 8:16 PM, Zheng Lv <lv...@gmail.com>
>>>>>
>>>> wrote:
>>>>
>>>>> Hello Stack,
>>>>>> After we got these exceptions, we restart the cluster and restarted
>>>>>>
>>>>> the
>>>>
>>>>> job that failed, and the job succeeded.
>>>>>> Now when we access
>>>>>>
>>>>> /hbase/summary/1491233486/metrics/5046821377427277894,
>>>>
>>>>> we got " Cannot access
>>>>>> /hbase/summary/1491233486/metrics/5046821377427277894: No such file or
>>>>>> directory." .
>>>>>>
>>>>>>
>>>>> So, that would seem to indicate that the reference was in memory
>>>>> only.. that file was not in filesystem.  You could have tried closing
>>>>> that region.   It would have been interesting also to find history on
>>>>> that region, to try and figure how it came to hold in memory a
>>>>> reference to a file since removed.
>>>>>
>>>>> The messages about this file in namenode logs are in here:
>>>>>> http://rapidshare.com/files/363938595/log.txt.html
>>>>>>
>>>>>
>>>>> This is interesting.  Do you have regionserver logs from 209, 208, and
>>>>> 210 for corresponding times?
>>>>>
>>>>> Thanks,
>>>>> St.Ack
>>>>>
>>>>> The job failed startted about at 17 o'clock.
>>>>>> By the way, the hadoop version we are using is 0.20.1, the hbase
>>>>>>
>>>>> version
>>>>
>>>>> we are using is 0.20.3.
>>>>>>
>>>>>> Regards,
>>>>>> LvZheng
>>>>>> 2010/3/16 Stack <st...@duboce.net>
>>>>>>
>>>>>> Can you get that file from hdfs?
>>>>>>>
>>>>>>> ./bin/hadoop fs -get
>>>>>>>>
>>>>>>> /hbase/summary/1491233486/metrics/5046821377427277894
>>>>>>>
>>>>>>> Does it look wholesome?  Is it empty?
>>>>>>>
>>>>>>> What if you trace the life of that file in regionserver logs or
>>>>>>> probably better, over in namenode log?  If you move this file aside,
>>>>>>> the region deploys?
>>>>>>>
>>>>>>> St.Ack
>>>>>>>
>>>>>>> On Mon, Mar 15, 2010 at 3:40 AM, Zheng Lv <lvzheng19800619@gmail.com
>>>>>>> >
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Hello Everyone,
>>>>>>>>  Recently we often got these in our client logs:
>>>>>>>>  org.apache.hadoop.hbase.client.RetriesExhaustedException: Trying
>>>>>>>>
>>>>>>> to
>>>>
>>>>>  contact region server 172.16.1.208:60020 for region
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>> summary,SITE_0000000032\x01pt\x0120100314000000\x01\x25E7\x258C\x25AE\x25E5\x258E\x25BF\x25E5\x2586\x2580\x25E9\x25B9\x25B0\x25E6\x2591\x25A9\x25E6\x2593\x25A6\x25E6\x259D\x2590\x25E6\x2596\x2599\x25E5\x258E\x2582\x2B\x25E6\x25B1\x25BD\x25E8\x25BD\x25A6\x25E9\x2585\x258D\x25E4\x25BB\x25B6\x25EF\x25BC\x258C\x25E5\x2598\x2580\x25E9\x2593\x2583\x25E9\x2593\x2583--\x25E7\x259C\x259F\x25E5\x25AE\x259E\x25E5\x25AE\x2589\x25E5\x2585\x25A8\x25E7\x259A\x2584\x25E7\x2594\x25B5\x25E8\x25AF\x259D\x25E3\x2580\x2581\x25E7\x25BD\x2591\x25E7\x25BB\x259C\x25E4\x25BA\x2592\x25E5\x258A\x25A8\x25E4\x25BA\x25A4\x25E5\x258F\x258B\x25E7\x25A4\x25BE\x25E5\x258C\x25BA\x25EF\x25BC\x2581,1268640385017,
>>>>
>>>>>  row
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>> 'SITE_0000000032\x01pt\x0120100315000000\x01\x2521\x25EF\x25BC\x2581\x25E9\x2594\x2580\x25E5\x2594\x25AE\x252F\x25E6\x2594\x25B6\x25E8\x25B4\x25AD\x25EF\x25BC\x2581VM700T\x2BVM700T\x2B\x25E5\x259B\x25BE\x25E5\x2583\x258F\x25E4\x25BF\x25A1\x25E5\x258F\x25B7\x25E4\x25BA\x25A7\x25E7\x2594\x259F\x25E5\x2599\x25A8\x2B\x25E7\x2594\x25B5\x25E5\x25AD\x2590\x25E6\x25B5\x258B\x25E9\x2587\x258F\x25E4\x25BB\x25AA\x25E5\x2599\x25A8\x25EF\x25BC\x258C\x25E5\x2598\x2580\x25E9\x2593\x2583\x25E9\x2593\x2583--\x25E7\x259C\x259F\x25E5\x25AE\x259E\x25E5\x25AE\x2589\x25E5\x2585\x25A8\x25E7\x259A\x2584\x25E7\x2594\x25B5\x25E8\x25AF\x259D\x25E3\x2580\x2581\x25E7\x25BD\x2591\x25E7\x25BB\x259C\x25E4\x25BA\x2592\x25E5\x258A\x25A8\x25E4\x25BA\x25A4\x25E5\x258F\x258B\x25E7\x25A4\x25BE\x25E5\x258C\x25BA\x25EF\x25BC\x2581',
>>>>
>>>>>  but failed after 10 attempts.
>>>>>>>> Exceptions:
>>>>>>>> java.io.IOException: java.io.IOException: Cannot open filename
>>>>>>>> /hbase/summary/1491233486/metrics/5046821377427277894
>>>>>>>> at
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>> org.apache.hadoop.hdfs.DFSClient$DFSInputStream.openInfo(DFSClient.java:1474)
>>>>
>>>>>  at
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>> org.apache.hadoop.hdfs.DFSClient$DFSInputStream.chooseDataNode(DFSClient.java:1800)
>>>>
>>>>>  at
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>> org.apache.hadoop.hdfs.DFSClient$DFSInputStream.blockSeekTo(DFSClient.java:1616)
>>>>
>>>>>  at
>>>>>>>>
>>>>>>>
>>>>>>>
>>>> org.apache.hadoop.hdfs.DFSClient$DFSInputStream.read(DFSClient.java:1743)
>>>>
>>>>>  at java.io.DataInputStream.read(DataInputStream.java:132)
>>>>>>>> at
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>> org.apache.hadoop.hbase.io.hfile.BoundedRangeFileInputStream.read(BoundedRangeFileInputStream.java:99)
>>>>
>>>>>  at org.apache.hadoop.io.IOUtils.readFully(IOUtils.java:100)
>>>>>>>> at
>>>>>>>>
>>>>>>>
>>>>>>>
>>>> org.apache.hadoop.hbase.io.hfile.HFile$Reader.decompress(HFile.java:1020)
>>>>
>>>>>  at
>>>>>>>>
>>>>>>>
>>>>>>>
>>>> org.apache.hadoop.hbase.io.hfile.HFile$Reader.readBlock(HFile.java:971)
>>>>
>>>>>  at
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>> org.apache.hadoop.hbase.io.hfile.HFile$Reader$Scanner.loadBlock(HFile.java:1304)
>>>>
>>>>>  at
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>> org.apache.hadoop.hbase.io.hfile.HFile$Reader$Scanner.seekTo(HFile.java:1186)
>>>>
>>>>>  at
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>> org.apache.hadoop.hbase.io.HalfHFileReader$1.seekTo(HalfHFileReader.java:207)
>>>>
>>>>>  at
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>> org.apache.hadoop.hbase.regionserver.StoreFileGetScan.getStoreFile(StoreFileGetScan.java:80)
>>>>
>>>>>  at
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>> org.apache.hadoop.hbase.regionserver.StoreFileGetScan.get(StoreFileGetScan.java:65)
>>>>
>>>>>  at org.apache.hadoop.hbase.regionserver.Store.get(Store.java:1461)
>>>>>>>> at
>>>>>>>>
>>>>>>> org.apache.hadoop.hbase.regionserver.HRegion.get(HRegion.java:2396)
>>>>
>>>>>  at
>>>>>>>>
>>>>>>> org.apache.hadoop.hbase.regionserver.HRegion.get(HRegion.java:2385)
>>>>
>>>>>  at
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>> org.apache.hadoop.hbase.regionserver.HRegionServer.get(HRegionServer.java:1731)
>>>>
>>>>>  at sun.reflect.GeneratedMethodAccessor6.invoke(Unknown Source)
>>>>>>>> at
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>>>>
>>>>>  at java.lang.reflect.Method.invoke(Method.java:597)
>>>>>>>> at
>>>>>>>>
>>>>>>> org.apache.hadoop.hbase.ipc.HBaseRPC$Server.call(HBaseRPC.java:657)
>>>>
>>>>>  at
>>>>>>>>
>>>>>>>
>>>>>>>
>>>> org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.java:915)
>>>>
>>>>>   Is there any way to fix this problem? Or is there anything we can
>>>>>>>>
>>>>>>> do
>>>>
>>>>>  even manually to relieve it?
>>>>>>>>  Any suggestion?
>>>>>>>>  Thank you.
>>>>>>>>  LvZheng
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>>

Re: Cannot open filename Exceptions

Posted by Stack <sa...@gmail.com>.
Sorry. It's taken me a while.  Let try and get to this this evening

Thank you for your patience



On Mar 17, 2010, at 2:29 AM, Zheng Lv <lv...@gmail.com> wrote:

> Hello Stack,
>  Did you receive my mail?It looks like you didnt.
>    LvZheng
>
> 2010/3/16 Zheng Lv <lv...@gmail.com>
>
>> Hello Stack,
>>  I have uploaded some parts of the logs on master, regionserver208  
>> and
>> regionserver210 to:
>>  http://rapidshare.com/files/363988384/master_207_log.txt.html
>>  http://rapidshare.com/files/363988673/regionserver_208_log.txt.html
>>  http://rapidshare.com/files/363988819/regionserver_210_log.txt.html
>>  I noticed that there are some LeaseExpiredException and "2010-03-15
>> 16:06:32,864 ERROR  
>> org.apache.hadoop.hbase.regionserver.CompactSplitThread:
>> Compaction/Split failed for region ..." before 17 oclock. Did these  
>> lead to
>> the error? Why did these happened? How to avoid these?
>>  Thanks.
>>    LvZheng
>> 2010/3/16 Stack <st...@duboce.net>
>>
>>> Maybe just the master log would be sufficient from around this  
>>> time to
>>> figure the story.
>>> St.Ack
>>>
>>> On Mon, Mar 15, 2010 at 10:04 PM, Stack <st...@duboce.net> wrote:
>>>> Hey Zheng:
>>>>
>>>> On Mon, Mar 15, 2010 at 8:16 PM, Zheng Lv <lvzheng19800619@gmail.com 
>>>> >
>>> wrote:
>>>>> Hello Stack,
>>>>> After we got these exceptions, we restart the cluster and  
>>>>> restarted
>>> the
>>>>> job that failed, and the job succeeded.
>>>>> Now when we access
>>> /hbase/summary/1491233486/metrics/5046821377427277894,
>>>>> we got " Cannot access
>>>>> /hbase/summary/1491233486/metrics/5046821377427277894: No such  
>>>>> file or
>>>>> directory." .
>>>>>
>>>>
>>>> So, that would seem to indicate that the reference was in memory
>>>> only.. that file was not in filesystem.  You could have tried  
>>>> closing
>>>> that region.   It would have been interesting also to find  
>>>> history on
>>>> that region, to try and figure how it came to hold in memory a
>>>> reference to a file since removed.
>>>>
>>>>> The messages about this file in namenode logs are in here:
>>>>> http://rapidshare.com/files/363938595/log.txt.html
>>>>
>>>> This is interesting.  Do you have regionserver logs from 209,  
>>>> 208, and
>>>> 210 for corresponding times?
>>>>
>>>> Thanks,
>>>> St.Ack
>>>>
>>>>> The job failed startted about at 17 o'clock.
>>>>> By the way, the hadoop version we are using is 0.20.1, the hbase
>>> version
>>>>> we are using is 0.20.3.
>>>>>
>>>>> Regards,
>>>>> LvZheng
>>>>> 2010/3/16 Stack <st...@duboce.net>
>>>>>
>>>>>> Can you get that file from hdfs?
>>>>>>
>>>>>>> ./bin/hadoop fs -get
>>>>>> /hbase/summary/1491233486/metrics/5046821377427277894
>>>>>>
>>>>>> Does it look wholesome?  Is it empty?
>>>>>>
>>>>>> What if you trace the life of that file in regionserver logs or
>>>>>> probably better, over in namenode log?  If you move this file  
>>>>>> aside,
>>>>>> the region deploys?
>>>>>>
>>>>>> St.Ack
>>>>>>
>>>>>> On Mon, Mar 15, 2010 at 3:40 AM, Zheng Lv <lvzheng19800619@gmail.com 
>>>>>> >
>>>>>> wrote:
>>>>>>> Hello Everyone,
>>>>>>>   Recently we often got these in our client logs:
>>>>>>>   org.apache.hadoop.hbase.client.RetriesExhaustedException:  
>>>>>>> Trying
>>> to
>>>>>>> contact region server 172.16.1.208:60020 for region
>>>>>>>
>>>>>>
>>> summary,SITE_0000000032\x01pt\x0120100314000000\x01\x25E7\x258C 
>>> \x25AE\x25E5\x258E\x25BF 
>>> \x25E5\ 
>>> x2586\ 
>>> x2580\ 
>>> x25E9\x25B9\x25B0\x25E6\x2591\x25A9\x25E6\x2593\x25A6\x25E6\x259D 
>>> \x2590\x25E6\x2596\x2599\x25E5\x258E\x2582\x2B\x25E6\x25B1\x25BD 
>>> \x25E8\x25BD\x25A6\x25E9\x2585\x258D\x25E4\x25BB\x25B6\x25EF\x25BC 
>>> \x258C\x25E5\x2598\x2580\x25E9\x2593\x2583\x25E9\x2593\x2583-- 
>>> \x25E7\x259C\x259F\x25E5\x25AE\x259E\x25E5\x25AE 
>>> \x2589\x25E5\x2585\x25A8\x25E7\x259A 
>>> \x2584\x25E7\x2594\x25B5\x25E8\x25AF\x259D 
>>> \x25E3\x2580\x2581\x25E7\x25BD\x2591\x25E7\x25BB\x259C\x25E4\x25BA 
>>> \x2592\x25E5\x258A\x25A8\x25E4\x25BA\x25A4\x25E5\x258F\x258B 
>>> \x25E7\x25A4\x25BE\x25E5\x258C\x25BA\x25EF\x25BC 
>>> \x2581,1268640385017,
>>>>>>> row
>>>>>>>
>>>>>>
>>> 'SITE_0000000032\x01pt\x0120100315000000\x01\x2521\x25EF\x25BC 
>>> \x2581\x25E9\x2594\x2580\x25E5\x2594\x25AE\x252F 
>>> \x25E6\x2594\x25B6\x25E8\x25B4\x25AD\x25EF\x25BC\x2581VM700T 
>>> \x2BVM700T\x2B\x25E5\x259B\x25BE\x25E5\x2583\x258F\x25E4\x25BF 
>>> \x25A1\x25E5\x258F\x25B7\x25E4\x25BA\x25A7\x25E7\x2594\x259F 
>>> \x25E5\x2599\x25A8\x2B\x25E7\x2594\x25B5\x25E5\x25AD 
>>> \x2590\x25E6\x25B5\x258B\x25E9\x2587\x258F\x25E4\x25BB\x25AA 
>>> \x25E5\x2599\x25A8\x25EF\x25BC\x258C 
>>> \x25E5\x2598\x2580\x25E9\x2593\x2583\x25E9\x2593\x2583-- 
>>> \x25E7\x259C\x259F\x25E5\x25AE\x259E\x25E5\x25AE 
>>> \x2589\x25E5\x2585\x25A8\x25E7\x259A 
>>> \x2584\x25E7\x2594\x25B5\x25E8\x25AF\x259D 
>>> \x25E3\x2580\x2581\x25E7\x25BD\x2591\x25E7\x25BB\x259C\x25E4\x25BA 
>>> \x2592\x25E5\x258A\x25A8\x25E4\x25BA\x25A4\x25E5\x258F\x258B 
>>> \x25E7\x25A4\x25BE\x25E5\x258C\x25BA\x25EF\x25BC\x2581',
>>>>>>> but failed after 10 attempts.
>>>>>>> Exceptions:
>>>>>>> java.io.IOException: java.io.IOException: Cannot open filename
>>>>>>> /hbase/summary/1491233486/metrics/5046821377427277894
>>>>>>> at
>>>>>>>
>>>>>>
>>> org.apache.hadoop.hdfs.DFSClient$DFSInputStream.openInfo 
>>> (DFSClient.java:1474)
>>>>>>> at
>>>>>>>
>>>>>>
>>> org.apache.hadoop.hdfs.DFSClient$DFSInputStream.chooseDataNode 
>>> (DFSClient.java:1800)
>>>>>>> at
>>>>>>>
>>>>>>
>>> org.apache.hadoop.hdfs.DFSClient$DFSInputStream.blockSeekTo 
>>> (DFSClient.java:1616)
>>>>>>> at
>>>>>>
>>> org.apache.hadoop.hdfs.DFSClient$DFSInputStream.read 
>>> (DFSClient.java:1743)
>>>>>>> at java.io.DataInputStream.read(DataInputStream.java:132)
>>>>>>> at
>>>>>>>
>>>>>>
>>> org.apache.hadoop.hbase.io.hfile.BoundedRangeFileInputStream.read 
>>> (BoundedRangeFileInputStream.java:99)
>>>>>>> at org.apache.hadoop.io.IOUtils.readFully(IOUtils.java:100)
>>>>>>> at
>>>>>>
>>> org.apache.hadoop.hbase.io.hfile.HFile$Reader.decompress 
>>> (HFile.java:1020)
>>>>>>> at
>>>>>>
>>> org.apache.hadoop.hbase.io.hfile.HFile$Reader.readBlock(HFile.java: 
>>> 971)
>>>>>>> at
>>>>>>>
>>>>>>
>>> org.apache.hadoop.hbase.io.hfile.HFile$Reader$Scanner.loadBlock 
>>> (HFile.java:1304)
>>>>>>> at
>>>>>>>
>>>>>>
>>> org.apache.hadoop.hbase.io.hfile.HFile$Reader$Scanner.seekTo 
>>> (HFile.java:1186)
>>>>>>> at
>>>>>>>
>>>>>>
>>> org.apache.hadoop.hbase.io.HalfHFileReader$1.seekTo 
>>> (HalfHFileReader.java:207)
>>>>>>> at
>>>>>>>
>>>>>>
>>> org.apache.hadoop.hbase.regionserver.StoreFileGetScan.getStoreFile 
>>> (StoreFileGetScan.java:80)
>>>>>>> at
>>>>>>>
>>>>>>
>>> org.apache.hadoop.hbase.regionserver.StoreFileGetScan.get 
>>> (StoreFileGetScan.java:65)
>>>>>>> at org.apache.hadoop.hbase.regionserver.Store.get(Store.java: 
>>>>>>> 1461)
>>>>>>> at
>>> org.apache.hadoop.hbase.regionserver.HRegion.get(HRegion.java:2396)
>>>>>>> at
>>> org.apache.hadoop.hbase.regionserver.HRegion.get(HRegion.java:2385)
>>>>>>> at
>>>>>>>
>>>>>>
>>> org.apache.hadoop.hbase.regionserver.HRegionServer.get 
>>> (HRegionServer.java:1731)
>>>>>>> at sun.reflect.GeneratedMethodAccessor6.invoke(Unknown Source)
>>>>>>> at
>>>>>>>
>>>>>>
>>> sun.reflect.DelegatingMethodAccessorImpl.invoke 
>>> (DelegatingMethodAccessorImpl.java:25)
>>>>>>> at java.lang.reflect.Method.invoke(Method.java:597)
>>>>>>> at
>>> org.apache.hadoop.hbase.ipc.HBaseRPC$Server.call(HBaseRPC.java:657)
>>>>>>> at
>>>>>>
>>> org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run 
>>> (HBaseServer.java:915)
>>>>>>>   Is there any way to fix this problem? Or is there anything  
>>>>>>> we can
>>> do
>>>>>>> even manually to relieve it?
>>>>>>>   Any suggestion?
>>>>>>>   Thank you.
>>>>>>>   LvZheng
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>>

Re: Cannot open filename Exceptions

Posted by Zheng Lv <lv...@gmail.com>.
Hello Stack,
  Did you receive my mail?It looks like you didnt.
    LvZheng

2010/3/16 Zheng Lv <lv...@gmail.com>

> Hello Stack,
>   I have uploaded some parts of the logs on master, regionserver208 and
> regionserver210 to:
>   http://rapidshare.com/files/363988384/master_207_log.txt.html
>   http://rapidshare.com/files/363988673/regionserver_208_log.txt.html
>   http://rapidshare.com/files/363988819/regionserver_210_log.txt.html
>   I noticed that there are some LeaseExpiredException and "2010-03-15
> 16:06:32,864 ERROR org.apache.hadoop.hbase.regionserver.CompactSplitThread:
> Compaction/Split failed for region ..." before 17 oclock. Did these lead to
> the error? Why did these happened? How to avoid these?
>   Thanks.
>     LvZheng
> 2010/3/16 Stack <st...@duboce.net>
>
>> Maybe just the master log would be sufficient from around this time to
>> figure the story.
>> St.Ack
>>
>> On Mon, Mar 15, 2010 at 10:04 PM, Stack <st...@duboce.net> wrote:
>> > Hey Zheng:
>> >
>> > On Mon, Mar 15, 2010 at 8:16 PM, Zheng Lv <lv...@gmail.com>
>> wrote:
>> >> Hello Stack,
>> >>  After we got these exceptions, we restart the cluster and restarted
>> the
>> >> job that failed, and the job succeeded.
>> >>  Now when we access
>> /hbase/summary/1491233486/metrics/5046821377427277894,
>> >> we got " Cannot access
>> >> /hbase/summary/1491233486/metrics/5046821377427277894: No such file or
>> >> directory." .
>> >>
>> >
>> > So, that would seem to indicate that the reference was in memory
>> > only.. that file was not in filesystem.  You could have tried closing
>> > that region.   It would have been interesting also to find history on
>> > that region, to try and figure how it came to hold in memory a
>> > reference to a file since removed.
>> >
>> >>  The messages about this file in namenode logs are in here:
>> >> http://rapidshare.com/files/363938595/log.txt.html
>> >
>> > This is interesting.  Do you have regionserver logs from 209, 208, and
>> > 210 for corresponding times?
>> >
>> > Thanks,
>> > St.Ack
>> >
>> >>  The job failed startted about at 17 o'clock.
>> >>  By the way, the hadoop version we are using is 0.20.1, the hbase
>> version
>> >> we are using is 0.20.3.
>> >>
>> >>  Regards,
>> >>  LvZheng
>> >> 2010/3/16 Stack <st...@duboce.net>
>> >>
>> >>> Can you get that file from hdfs?
>> >>>
>> >>> > ./bin/hadoop fs -get
>> >>>  /hbase/summary/1491233486/metrics/5046821377427277894
>> >>>
>> >>> Does it look wholesome?  Is it empty?
>> >>>
>> >>> What if you trace the life of that file in regionserver logs or
>> >>> probably better, over in namenode log?  If you move this file aside,
>> >>> the region deploys?
>> >>>
>> >>> St.Ack
>> >>>
>> >>> On Mon, Mar 15, 2010 at 3:40 AM, Zheng Lv <lv...@gmail.com>
>> >>> wrote:
>> >>> > Hello Everyone,
>> >>> >    Recently we often got these in our client logs:
>> >>> >    org.apache.hadoop.hbase.client.RetriesExhaustedException: Trying
>> to
>> >>> > contact region server 172.16.1.208:60020 for region
>> >>> >
>> >>>
>> summary,SITE_0000000032\x01pt\x0120100314000000\x01\x25E7\x258C\x25AE\x25E5\x258E\x25BF\x25E5\x2586\x2580\x25E9\x25B9\x25B0\x25E6\x2591\x25A9\x25E6\x2593\x25A6\x25E6\x259D\x2590\x25E6\x2596\x2599\x25E5\x258E\x2582\x2B\x25E6\x25B1\x25BD\x25E8\x25BD\x25A6\x25E9\x2585\x258D\x25E4\x25BB\x25B6\x25EF\x25BC\x258C\x25E5\x2598\x2580\x25E9\x2593\x2583\x25E9\x2593\x2583--\x25E7\x259C\x259F\x25E5\x25AE\x259E\x25E5\x25AE\x2589\x25E5\x2585\x25A8\x25E7\x259A\x2584\x25E7\x2594\x25B5\x25E8\x25AF\x259D\x25E3\x2580\x2581\x25E7\x25BD\x2591\x25E7\x25BB\x259C\x25E4\x25BA\x2592\x25E5\x258A\x25A8\x25E4\x25BA\x25A4\x25E5\x258F\x258B\x25E7\x25A4\x25BE\x25E5\x258C\x25BA\x25EF\x25BC\x2581,1268640385017,
>> >>> > row
>> >>> >
>> >>>
>> 'SITE_0000000032\x01pt\x0120100315000000\x01\x2521\x25EF\x25BC\x2581\x25E9\x2594\x2580\x25E5\x2594\x25AE\x252F\x25E6\x2594\x25B6\x25E8\x25B4\x25AD\x25EF\x25BC\x2581VM700T\x2BVM700T\x2B\x25E5\x259B\x25BE\x25E5\x2583\x258F\x25E4\x25BF\x25A1\x25E5\x258F\x25B7\x25E4\x25BA\x25A7\x25E7\x2594\x259F\x25E5\x2599\x25A8\x2B\x25E7\x2594\x25B5\x25E5\x25AD\x2590\x25E6\x25B5\x258B\x25E9\x2587\x258F\x25E4\x25BB\x25AA\x25E5\x2599\x25A8\x25EF\x25BC\x258C\x25E5\x2598\x2580\x25E9\x2593\x2583\x25E9\x2593\x2583--\x25E7\x259C\x259F\x25E5\x25AE\x259E\x25E5\x25AE\x2589\x25E5\x2585\x25A8\x25E7\x259A\x2584\x25E7\x2594\x25B5\x25E8\x25AF\x259D\x25E3\x2580\x2581\x25E7\x25BD\x2591\x25E7\x25BB\x259C\x25E4\x25BA\x2592\x25E5\x258A\x25A8\x25E4\x25BA\x25A4\x25E5\x258F\x258B\x25E7\x25A4\x25BE\x25E5\x258C\x25BA\x25EF\x25BC\x2581',
>> >>> > but failed after 10 attempts.
>> >>> > Exceptions:
>> >>> > java.io.IOException: java.io.IOException: Cannot open filename
>> >>> > /hbase/summary/1491233486/metrics/5046821377427277894
>> >>> > at
>> >>> >
>> >>>
>> org.apache.hadoop.hdfs.DFSClient$DFSInputStream.openInfo(DFSClient.java:1474)
>> >>> > at
>> >>> >
>> >>>
>> org.apache.hadoop.hdfs.DFSClient$DFSInputStream.chooseDataNode(DFSClient.java:1800)
>> >>> > at
>> >>> >
>> >>>
>> org.apache.hadoop.hdfs.DFSClient$DFSInputStream.blockSeekTo(DFSClient.java:1616)
>> >>> > at
>> >>>
>> org.apache.hadoop.hdfs.DFSClient$DFSInputStream.read(DFSClient.java:1743)
>> >>> > at java.io.DataInputStream.read(DataInputStream.java:132)
>> >>> > at
>> >>> >
>> >>>
>> org.apache.hadoop.hbase.io.hfile.BoundedRangeFileInputStream.read(BoundedRangeFileInputStream.java:99)
>> >>> > at org.apache.hadoop.io.IOUtils.readFully(IOUtils.java:100)
>> >>> > at
>> >>>
>> org.apache.hadoop.hbase.io.hfile.HFile$Reader.decompress(HFile.java:1020)
>> >>> > at
>> >>>
>> org.apache.hadoop.hbase.io.hfile.HFile$Reader.readBlock(HFile.java:971)
>> >>> > at
>> >>> >
>> >>>
>> org.apache.hadoop.hbase.io.hfile.HFile$Reader$Scanner.loadBlock(HFile.java:1304)
>> >>> > at
>> >>> >
>> >>>
>> org.apache.hadoop.hbase.io.hfile.HFile$Reader$Scanner.seekTo(HFile.java:1186)
>> >>> > at
>> >>> >
>> >>>
>> org.apache.hadoop.hbase.io.HalfHFileReader$1.seekTo(HalfHFileReader.java:207)
>> >>> > at
>> >>> >
>> >>>
>> org.apache.hadoop.hbase.regionserver.StoreFileGetScan.getStoreFile(StoreFileGetScan.java:80)
>> >>> > at
>> >>> >
>> >>>
>> org.apache.hadoop.hbase.regionserver.StoreFileGetScan.get(StoreFileGetScan.java:65)
>> >>> > at org.apache.hadoop.hbase.regionserver.Store.get(Store.java:1461)
>> >>> > at
>> org.apache.hadoop.hbase.regionserver.HRegion.get(HRegion.java:2396)
>> >>> > at
>> org.apache.hadoop.hbase.regionserver.HRegion.get(HRegion.java:2385)
>> >>> > at
>> >>> >
>> >>>
>> org.apache.hadoop.hbase.regionserver.HRegionServer.get(HRegionServer.java:1731)
>> >>> > at sun.reflect.GeneratedMethodAccessor6.invoke(Unknown Source)
>> >>> > at
>> >>> >
>> >>>
>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>> >>> > at java.lang.reflect.Method.invoke(Method.java:597)
>> >>> > at
>> org.apache.hadoop.hbase.ipc.HBaseRPC$Server.call(HBaseRPC.java:657)
>> >>> > at
>> >>>
>> org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.java:915)
>> >>> >    Is there any way to fix this problem? Or is there anything we can
>> do
>> >>> > even manually to relieve it?
>> >>> >    Any suggestion?
>> >>> >    Thank you.
>> >>> >    LvZheng
>> >>> >
>> >>>
>> >>
>> >
>>
>
>

Re: Cannot open filename Exceptions

Posted by Zheng Lv <lv...@gmail.com>.
Hello Stack,
  I have uploaded some parts of the logs on master, regionserver208 and
regionserver210 to:
  http://rapidshare.com/files/363988384/master_207_log.txt.html
  http://rapidshare.com/files/363988673/regionserver_208_log.txt.html
  http://rapidshare.com/files/363988819/regionserver_210_log.txt.html
  I noticed that there are some LeaseExpiredException and "2010-03-15
16:06:32,864 ERROR org.apache.hadoop.hbase.regionserver.CompactSplitThread:
Compaction/Split failed for region ..." before 17 oclock. Did these lead to
the error? Why did these happened? How to avoid these?
  Thanks.
  LvZheng
2010/3/16 Stack <st...@duboce.net>

> Maybe just the master log would be sufficient from around this time to
> figure the story.
> St.Ack
>
> On Mon, Mar 15, 2010 at 10:04 PM, Stack <st...@duboce.net> wrote:
> > Hey Zheng:
> >
> > On Mon, Mar 15, 2010 at 8:16 PM, Zheng Lv <lv...@gmail.com>
> wrote:
> >> Hello Stack,
> >>  After we got these exceptions, we restart the cluster and restarted the
> >> job that failed, and the job succeeded.
> >>  Now when we access
> /hbase/summary/1491233486/metrics/5046821377427277894,
> >> we got " Cannot access
> >> /hbase/summary/1491233486/metrics/5046821377427277894: No such file or
> >> directory." .
> >>
> >
> > So, that would seem to indicate that the reference was in memory
> > only.. that file was not in filesystem.  You could have tried closing
> > that region.   It would have been interesting also to find history on
> > that region, to try and figure how it came to hold in memory a
> > reference to a file since removed.
> >
> >>  The messages about this file in namenode logs are in here:
> >> http://rapidshare.com/files/363938595/log.txt.html
> >
> > This is interesting.  Do you have regionserver logs from 209, 208, and
> > 210 for corresponding times?
> >
> > Thanks,
> > St.Ack
> >
> >>  The job failed startted about at 17 o'clock.
> >>  By the way, the hadoop version we are using is 0.20.1, the hbase
> version
> >> we are using is 0.20.3.
> >>
> >>  Regards,
> >>  LvZheng
> >> 2010/3/16 Stack <st...@duboce.net>
> >>
> >>> Can you get that file from hdfs?
> >>>
> >>> > ./bin/hadoop fs -get
> >>>  /hbase/summary/1491233486/metrics/5046821377427277894
> >>>
> >>> Does it look wholesome?  Is it empty?
> >>>
> >>> What if you trace the life of that file in regionserver logs or
> >>> probably better, over in namenode log?  If you move this file aside,
> >>> the region deploys?
> >>>
> >>> St.Ack
> >>>
> >>> On Mon, Mar 15, 2010 at 3:40 AM, Zheng Lv <lv...@gmail.com>
> >>> wrote:
> >>> > Hello Everyone,
> >>> >    Recently we often got these in our client logs:
> >>> >    org.apache.hadoop.hbase.client.RetriesExhaustedException: Trying
> to
> >>> > contact region server 172.16.1.208:60020 for region
> >>> >
> >>>
> summary,SITE_0000000032\x01pt\x0120100314000000\x01\x25E7\x258C\x25AE\x25E5\x258E\x25BF\x25E5\x2586\x2580\x25E9\x25B9\x25B0\x25E6\x2591\x25A9\x25E6\x2593\x25A6\x25E6\x259D\x2590\x25E6\x2596\x2599\x25E5\x258E\x2582\x2B\x25E6\x25B1\x25BD\x25E8\x25BD\x25A6\x25E9\x2585\x258D\x25E4\x25BB\x25B6\x25EF\x25BC\x258C\x25E5\x2598\x2580\x25E9\x2593\x2583\x25E9\x2593\x2583--\x25E7\x259C\x259F\x25E5\x25AE\x259E\x25E5\x25AE\x2589\x25E5\x2585\x25A8\x25E7\x259A\x2584\x25E7\x2594\x25B5\x25E8\x25AF\x259D\x25E3\x2580\x2581\x25E7\x25BD\x2591\x25E7\x25BB\x259C\x25E4\x25BA\x2592\x25E5\x258A\x25A8\x25E4\x25BA\x25A4\x25E5\x258F\x258B\x25E7\x25A4\x25BE\x25E5\x258C\x25BA\x25EF\x25BC\x2581,1268640385017,
> >>> > row
> >>> >
> >>>
> 'SITE_0000000032\x01pt\x0120100315000000\x01\x2521\x25EF\x25BC\x2581\x25E9\x2594\x2580\x25E5\x2594\x25AE\x252F\x25E6\x2594\x25B6\x25E8\x25B4\x25AD\x25EF\x25BC\x2581VM700T\x2BVM700T\x2B\x25E5\x259B\x25BE\x25E5\x2583\x258F\x25E4\x25BF\x25A1\x25E5\x258F\x25B7\x25E4\x25BA\x25A7\x25E7\x2594\x259F\x25E5\x2599\x25A8\x2B\x25E7\x2594\x25B5\x25E5\x25AD\x2590\x25E6\x25B5\x258B\x25E9\x2587\x258F\x25E4\x25BB\x25AA\x25E5\x2599\x25A8\x25EF\x25BC\x258C\x25E5\x2598\x2580\x25E9\x2593\x2583\x25E9\x2593\x2583--\x25E7\x259C\x259F\x25E5\x25AE\x259E\x25E5\x25AE\x2589\x25E5\x2585\x25A8\x25E7\x259A\x2584\x25E7\x2594\x25B5\x25E8\x25AF\x259D\x25E3\x2580\x2581\x25E7\x25BD\x2591\x25E7\x25BB\x259C\x25E4\x25BA\x2592\x25E5\x258A\x25A8\x25E4\x25BA\x25A4\x25E5\x258F\x258B\x25E7\x25A4\x25BE\x25E5\x258C\x25BA\x25EF\x25BC\x2581',
> >>> > but failed after 10 attempts.
> >>> > Exceptions:
> >>> > java.io.IOException: java.io.IOException: Cannot open filename
> >>> > /hbase/summary/1491233486/metrics/5046821377427277894
> >>> > at
> >>> >
> >>>
> org.apache.hadoop.hdfs.DFSClient$DFSInputStream.openInfo(DFSClient.java:1474)
> >>> > at
> >>> >
> >>>
> org.apache.hadoop.hdfs.DFSClient$DFSInputStream.chooseDataNode(DFSClient.java:1800)
> >>> > at
> >>> >
> >>>
> org.apache.hadoop.hdfs.DFSClient$DFSInputStream.blockSeekTo(DFSClient.java:1616)
> >>> > at
> >>>
> org.apache.hadoop.hdfs.DFSClient$DFSInputStream.read(DFSClient.java:1743)
> >>> > at java.io.DataInputStream.read(DataInputStream.java:132)
> >>> > at
> >>> >
> >>>
> org.apache.hadoop.hbase.io.hfile.BoundedRangeFileInputStream.read(BoundedRangeFileInputStream.java:99)
> >>> > at org.apache.hadoop.io.IOUtils.readFully(IOUtils.java:100)
> >>> > at
> >>>
> org.apache.hadoop.hbase.io.hfile.HFile$Reader.decompress(HFile.java:1020)
> >>> > at
> >>> org.apache.hadoop.hbase.io.hfile.HFile$Reader.readBlock(HFile.java:971)
> >>> > at
> >>> >
> >>>
> org.apache.hadoop.hbase.io.hfile.HFile$Reader$Scanner.loadBlock(HFile.java:1304)
> >>> > at
> >>> >
> >>>
> org.apache.hadoop.hbase.io.hfile.HFile$Reader$Scanner.seekTo(HFile.java:1186)
> >>> > at
> >>> >
> >>>
> org.apache.hadoop.hbase.io.HalfHFileReader$1.seekTo(HalfHFileReader.java:207)
> >>> > at
> >>> >
> >>>
> org.apache.hadoop.hbase.regionserver.StoreFileGetScan.getStoreFile(StoreFileGetScan.java:80)
> >>> > at
> >>> >
> >>>
> org.apache.hadoop.hbase.regionserver.StoreFileGetScan.get(StoreFileGetScan.java:65)
> >>> > at org.apache.hadoop.hbase.regionserver.Store.get(Store.java:1461)
> >>> > at
> org.apache.hadoop.hbase.regionserver.HRegion.get(HRegion.java:2396)
> >>> > at
> org.apache.hadoop.hbase.regionserver.HRegion.get(HRegion.java:2385)
> >>> > at
> >>> >
> >>>
> org.apache.hadoop.hbase.regionserver.HRegionServer.get(HRegionServer.java:1731)
> >>> > at sun.reflect.GeneratedMethodAccessor6.invoke(Unknown Source)
> >>> > at
> >>> >
> >>>
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
> >>> > at java.lang.reflect.Method.invoke(Method.java:597)
> >>> > at
> org.apache.hadoop.hbase.ipc.HBaseRPC$Server.call(HBaseRPC.java:657)
> >>> > at
> >>>
> org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.java:915)
> >>> >    Is there any way to fix this problem? Or is there anything we can
> do
> >>> > even manually to relieve it?
> >>> >    Any suggestion?
> >>> >    Thank you.
> >>> >    LvZheng
> >>> >
> >>>
> >>
> >
>

Re: Cannot open filename Exceptions

Posted by Stack <st...@duboce.net>.
Maybe just the master log would be sufficient from around this time to
figure the story.
St.Ack

On Mon, Mar 15, 2010 at 10:04 PM, Stack <st...@duboce.net> wrote:
> Hey Zheng:
>
> On Mon, Mar 15, 2010 at 8:16 PM, Zheng Lv <lv...@gmail.com> wrote:
>> Hello Stack,
>>  After we got these exceptions, we restart the cluster and restarted the
>> job that failed, and the job succeeded.
>>  Now when we access /hbase/summary/1491233486/metrics/5046821377427277894,
>> we got " Cannot access
>> /hbase/summary/1491233486/metrics/5046821377427277894: No such file or
>> directory." .
>>
>
> So, that would seem to indicate that the reference was in memory
> only.. that file was not in filesystem.  You could have tried closing
> that region.   It would have been interesting also to find history on
> that region, to try and figure how it came to hold in memory a
> reference to a file since removed.
>
>>  The messages about this file in namenode logs are in here:
>> http://rapidshare.com/files/363938595/log.txt.html
>
> This is interesting.  Do you have regionserver logs from 209, 208, and
> 210 for corresponding times?
>
> Thanks,
> St.Ack
>
>>  The job failed startted about at 17 o'clock.
>>  By the way, the hadoop version we are using is 0.20.1, the hbase version
>> we are using is 0.20.3.
>>
>>  Regards,
>>  LvZheng
>> 2010/3/16 Stack <st...@duboce.net>
>>
>>> Can you get that file from hdfs?
>>>
>>> > ./bin/hadoop fs -get
>>>  /hbase/summary/1491233486/metrics/5046821377427277894
>>>
>>> Does it look wholesome?  Is it empty?
>>>
>>> What if you trace the life of that file in regionserver logs or
>>> probably better, over in namenode log?  If you move this file aside,
>>> the region deploys?
>>>
>>> St.Ack
>>>
>>> On Mon, Mar 15, 2010 at 3:40 AM, Zheng Lv <lv...@gmail.com>
>>> wrote:
>>> > Hello Everyone,
>>> >    Recently we often got these in our client logs:
>>> >    org.apache.hadoop.hbase.client.RetriesExhaustedException: Trying to
>>> > contact region server 172.16.1.208:60020 for region
>>> >
>>> summary,SITE_0000000032\x01pt\x0120100314000000\x01\x25E7\x258C\x25AE\x25E5\x258E\x25BF\x25E5\x2586\x2580\x25E9\x25B9\x25B0\x25E6\x2591\x25A9\x25E6\x2593\x25A6\x25E6\x259D\x2590\x25E6\x2596\x2599\x25E5\x258E\x2582\x2B\x25E6\x25B1\x25BD\x25E8\x25BD\x25A6\x25E9\x2585\x258D\x25E4\x25BB\x25B6\x25EF\x25BC\x258C\x25E5\x2598\x2580\x25E9\x2593\x2583\x25E9\x2593\x2583--\x25E7\x259C\x259F\x25E5\x25AE\x259E\x25E5\x25AE\x2589\x25E5\x2585\x25A8\x25E7\x259A\x2584\x25E7\x2594\x25B5\x25E8\x25AF\x259D\x25E3\x2580\x2581\x25E7\x25BD\x2591\x25E7\x25BB\x259C\x25E4\x25BA\x2592\x25E5\x258A\x25A8\x25E4\x25BA\x25A4\x25E5\x258F\x258B\x25E7\x25A4\x25BE\x25E5\x258C\x25BA\x25EF\x25BC\x2581,1268640385017,
>>> > row
>>> >
>>> 'SITE_0000000032\x01pt\x0120100315000000\x01\x2521\x25EF\x25BC\x2581\x25E9\x2594\x2580\x25E5\x2594\x25AE\x252F\x25E6\x2594\x25B6\x25E8\x25B4\x25AD\x25EF\x25BC\x2581VM700T\x2BVM700T\x2B\x25E5\x259B\x25BE\x25E5\x2583\x258F\x25E4\x25BF\x25A1\x25E5\x258F\x25B7\x25E4\x25BA\x25A7\x25E7\x2594\x259F\x25E5\x2599\x25A8\x2B\x25E7\x2594\x25B5\x25E5\x25AD\x2590\x25E6\x25B5\x258B\x25E9\x2587\x258F\x25E4\x25BB\x25AA\x25E5\x2599\x25A8\x25EF\x25BC\x258C\x25E5\x2598\x2580\x25E9\x2593\x2583\x25E9\x2593\x2583--\x25E7\x259C\x259F\x25E5\x25AE\x259E\x25E5\x25AE\x2589\x25E5\x2585\x25A8\x25E7\x259A\x2584\x25E7\x2594\x25B5\x25E8\x25AF\x259D\x25E3\x2580\x2581\x25E7\x25BD\x2591\x25E7\x25BB\x259C\x25E4\x25BA\x2592\x25E5\x258A\x25A8\x25E4\x25BA\x25A4\x25E5\x258F\x258B\x25E7\x25A4\x25BE\x25E5\x258C\x25BA\x25EF\x25BC\x2581',
>>> > but failed after 10 attempts.
>>> > Exceptions:
>>> > java.io.IOException: java.io.IOException: Cannot open filename
>>> > /hbase/summary/1491233486/metrics/5046821377427277894
>>> > at
>>> >
>>> org.apache.hadoop.hdfs.DFSClient$DFSInputStream.openInfo(DFSClient.java:1474)
>>> > at
>>> >
>>> org.apache.hadoop.hdfs.DFSClient$DFSInputStream.chooseDataNode(DFSClient.java:1800)
>>> > at
>>> >
>>> org.apache.hadoop.hdfs.DFSClient$DFSInputStream.blockSeekTo(DFSClient.java:1616)
>>> > at
>>> org.apache.hadoop.hdfs.DFSClient$DFSInputStream.read(DFSClient.java:1743)
>>> > at java.io.DataInputStream.read(DataInputStream.java:132)
>>> > at
>>> >
>>> org.apache.hadoop.hbase.io.hfile.BoundedRangeFileInputStream.read(BoundedRangeFileInputStream.java:99)
>>> > at org.apache.hadoop.io.IOUtils.readFully(IOUtils.java:100)
>>> > at
>>> org.apache.hadoop.hbase.io.hfile.HFile$Reader.decompress(HFile.java:1020)
>>> > at
>>> org.apache.hadoop.hbase.io.hfile.HFile$Reader.readBlock(HFile.java:971)
>>> > at
>>> >
>>> org.apache.hadoop.hbase.io.hfile.HFile$Reader$Scanner.loadBlock(HFile.java:1304)
>>> > at
>>> >
>>> org.apache.hadoop.hbase.io.hfile.HFile$Reader$Scanner.seekTo(HFile.java:1186)
>>> > at
>>> >
>>> org.apache.hadoop.hbase.io.HalfHFileReader$1.seekTo(HalfHFileReader.java:207)
>>> > at
>>> >
>>> org.apache.hadoop.hbase.regionserver.StoreFileGetScan.getStoreFile(StoreFileGetScan.java:80)
>>> > at
>>> >
>>> org.apache.hadoop.hbase.regionserver.StoreFileGetScan.get(StoreFileGetScan.java:65)
>>> > at org.apache.hadoop.hbase.regionserver.Store.get(Store.java:1461)
>>> > at org.apache.hadoop.hbase.regionserver.HRegion.get(HRegion.java:2396)
>>> > at org.apache.hadoop.hbase.regionserver.HRegion.get(HRegion.java:2385)
>>> > at
>>> >
>>> org.apache.hadoop.hbase.regionserver.HRegionServer.get(HRegionServer.java:1731)
>>> > at sun.reflect.GeneratedMethodAccessor6.invoke(Unknown Source)
>>> > at
>>> >
>>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>>> > at java.lang.reflect.Method.invoke(Method.java:597)
>>> > at org.apache.hadoop.hbase.ipc.HBaseRPC$Server.call(HBaseRPC.java:657)
>>> > at
>>> org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.java:915)
>>> >    Is there any way to fix this problem? Or is there anything we can do
>>> > even manually to relieve it?
>>> >    Any suggestion?
>>> >    Thank you.
>>> >    LvZheng
>>> >
>>>
>>
>

Re: Cannot open filename Exceptions

Posted by Stack <st...@duboce.net>.
Hey Zheng:

On Mon, Mar 15, 2010 at 8:16 PM, Zheng Lv <lv...@gmail.com> wrote:
> Hello Stack,
>  After we got these exceptions, we restart the cluster and restarted the
> job that failed, and the job succeeded.
>  Now when we access /hbase/summary/1491233486/metrics/5046821377427277894,
> we got " Cannot access
> /hbase/summary/1491233486/metrics/5046821377427277894: No such file or
> directory." .
>

So, that would seem to indicate that the reference was in memory
only.. that file was not in filesystem.  You could have tried closing
that region.   It would have been interesting also to find history on
that region, to try and figure how it came to hold in memory a
reference to a file since removed.

>  The messages about this file in namenode logs are in here:
> http://rapidshare.com/files/363938595/log.txt.html

This is interesting.  Do you have regionserver logs from 209, 208, and
210 for corresponding times?

Thanks,
St.Ack

>  The job failed startted about at 17 o'clock.
>  By the way, the hadoop version we are using is 0.20.1, the hbase version
> we are using is 0.20.3.
>
>  Regards,
>  LvZheng
> 2010/3/16 Stack <st...@duboce.net>
>
>> Can you get that file from hdfs?
>>
>> > ./bin/hadoop fs -get
>>  /hbase/summary/1491233486/metrics/5046821377427277894
>>
>> Does it look wholesome?  Is it empty?
>>
>> What if you trace the life of that file in regionserver logs or
>> probably better, over in namenode log?  If you move this file aside,
>> the region deploys?
>>
>> St.Ack
>>
>> On Mon, Mar 15, 2010 at 3:40 AM, Zheng Lv <lv...@gmail.com>
>> wrote:
>> > Hello Everyone,
>> >    Recently we often got these in our client logs:
>> >    org.apache.hadoop.hbase.client.RetriesExhaustedException: Trying to
>> > contact region server 172.16.1.208:60020 for region
>> >
>> summary,SITE_0000000032\x01pt\x0120100314000000\x01\x25E7\x258C\x25AE\x25E5\x258E\x25BF\x25E5\x2586\x2580\x25E9\x25B9\x25B0\x25E6\x2591\x25A9\x25E6\x2593\x25A6\x25E6\x259D\x2590\x25E6\x2596\x2599\x25E5\x258E\x2582\x2B\x25E6\x25B1\x25BD\x25E8\x25BD\x25A6\x25E9\x2585\x258D\x25E4\x25BB\x25B6\x25EF\x25BC\x258C\x25E5\x2598\x2580\x25E9\x2593\x2583\x25E9\x2593\x2583--\x25E7\x259C\x259F\x25E5\x25AE\x259E\x25E5\x25AE\x2589\x25E5\x2585\x25A8\x25E7\x259A\x2584\x25E7\x2594\x25B5\x25E8\x25AF\x259D\x25E3\x2580\x2581\x25E7\x25BD\x2591\x25E7\x25BB\x259C\x25E4\x25BA\x2592\x25E5\x258A\x25A8\x25E4\x25BA\x25A4\x25E5\x258F\x258B\x25E7\x25A4\x25BE\x25E5\x258C\x25BA\x25EF\x25BC\x2581,1268640385017,
>> > row
>> >
>> 'SITE_0000000032\x01pt\x0120100315000000\x01\x2521\x25EF\x25BC\x2581\x25E9\x2594\x2580\x25E5\x2594\x25AE\x252F\x25E6\x2594\x25B6\x25E8\x25B4\x25AD\x25EF\x25BC\x2581VM700T\x2BVM700T\x2B\x25E5\x259B\x25BE\x25E5\x2583\x258F\x25E4\x25BF\x25A1\x25E5\x258F\x25B7\x25E4\x25BA\x25A7\x25E7\x2594\x259F\x25E5\x2599\x25A8\x2B\x25E7\x2594\x25B5\x25E5\x25AD\x2590\x25E6\x25B5\x258B\x25E9\x2587\x258F\x25E4\x25BB\x25AA\x25E5\x2599\x25A8\x25EF\x25BC\x258C\x25E5\x2598\x2580\x25E9\x2593\x2583\x25E9\x2593\x2583--\x25E7\x259C\x259F\x25E5\x25AE\x259E\x25E5\x25AE\x2589\x25E5\x2585\x25A8\x25E7\x259A\x2584\x25E7\x2594\x25B5\x25E8\x25AF\x259D\x25E3\x2580\x2581\x25E7\x25BD\x2591\x25E7\x25BB\x259C\x25E4\x25BA\x2592\x25E5\x258A\x25A8\x25E4\x25BA\x25A4\x25E5\x258F\x258B\x25E7\x25A4\x25BE\x25E5\x258C\x25BA\x25EF\x25BC\x2581',
>> > but failed after 10 attempts.
>> > Exceptions:
>> > java.io.IOException: java.io.IOException: Cannot open filename
>> > /hbase/summary/1491233486/metrics/5046821377427277894
>> > at
>> >
>> org.apache.hadoop.hdfs.DFSClient$DFSInputStream.openInfo(DFSClient.java:1474)
>> > at
>> >
>> org.apache.hadoop.hdfs.DFSClient$DFSInputStream.chooseDataNode(DFSClient.java:1800)
>> > at
>> >
>> org.apache.hadoop.hdfs.DFSClient$DFSInputStream.blockSeekTo(DFSClient.java:1616)
>> > at
>> org.apache.hadoop.hdfs.DFSClient$DFSInputStream.read(DFSClient.java:1743)
>> > at java.io.DataInputStream.read(DataInputStream.java:132)
>> > at
>> >
>> org.apache.hadoop.hbase.io.hfile.BoundedRangeFileInputStream.read(BoundedRangeFileInputStream.java:99)
>> > at org.apache.hadoop.io.IOUtils.readFully(IOUtils.java:100)
>> > at
>> org.apache.hadoop.hbase.io.hfile.HFile$Reader.decompress(HFile.java:1020)
>> > at
>> org.apache.hadoop.hbase.io.hfile.HFile$Reader.readBlock(HFile.java:971)
>> > at
>> >
>> org.apache.hadoop.hbase.io.hfile.HFile$Reader$Scanner.loadBlock(HFile.java:1304)
>> > at
>> >
>> org.apache.hadoop.hbase.io.hfile.HFile$Reader$Scanner.seekTo(HFile.java:1186)
>> > at
>> >
>> org.apache.hadoop.hbase.io.HalfHFileReader$1.seekTo(HalfHFileReader.java:207)
>> > at
>> >
>> org.apache.hadoop.hbase.regionserver.StoreFileGetScan.getStoreFile(StoreFileGetScan.java:80)
>> > at
>> >
>> org.apache.hadoop.hbase.regionserver.StoreFileGetScan.get(StoreFileGetScan.java:65)
>> > at org.apache.hadoop.hbase.regionserver.Store.get(Store.java:1461)
>> > at org.apache.hadoop.hbase.regionserver.HRegion.get(HRegion.java:2396)
>> > at org.apache.hadoop.hbase.regionserver.HRegion.get(HRegion.java:2385)
>> > at
>> >
>> org.apache.hadoop.hbase.regionserver.HRegionServer.get(HRegionServer.java:1731)
>> > at sun.reflect.GeneratedMethodAccessor6.invoke(Unknown Source)
>> > at
>> >
>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>> > at java.lang.reflect.Method.invoke(Method.java:597)
>> > at org.apache.hadoop.hbase.ipc.HBaseRPC$Server.call(HBaseRPC.java:657)
>> > at
>> org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.java:915)
>> >    Is there any way to fix this problem? Or is there anything we can do
>> > even manually to relieve it?
>> >    Any suggestion?
>> >    Thank you.
>> >    LvZheng
>> >
>>
>

Re: Cannot open filename Exceptions

Posted by Zheng Lv <lv...@gmail.com>.
Hello Stack,
  After we got these exceptions, we restart the cluster and restarted the
job that failed, and the job succeeded.
  Now when we access /hbase/summary/1491233486/metrics/5046821377427277894,
we got " Cannot access
/hbase/summary/1491233486/metrics/5046821377427277894: No such file or
directory." .

  The messages about this file in namenode logs are in here:
http://rapidshare.com/files/363938595/log.txt.html
  The job failed startted about at 17 o'clock.
  By the way, the hadoop version we are using is 0.20.1, the hbase version
we are using is 0.20.3.

  Regards,
  LvZheng
2010/3/16 Stack <st...@duboce.net>

> Can you get that file from hdfs?
>
> > ./bin/hadoop fs -get
>  /hbase/summary/1491233486/metrics/5046821377427277894
>
> Does it look wholesome?  Is it empty?
>
> What if you trace the life of that file in regionserver logs or
> probably better, over in namenode log?  If you move this file aside,
> the region deploys?
>
> St.Ack
>
> On Mon, Mar 15, 2010 at 3:40 AM, Zheng Lv <lv...@gmail.com>
> wrote:
> > Hello Everyone,
> >    Recently we often got these in our client logs:
> >    org.apache.hadoop.hbase.client.RetriesExhaustedException: Trying to
> > contact region server 172.16.1.208:60020 for region
> >
> summary,SITE_0000000032\x01pt\x0120100314000000\x01\x25E7\x258C\x25AE\x25E5\x258E\x25BF\x25E5\x2586\x2580\x25E9\x25B9\x25B0\x25E6\x2591\x25A9\x25E6\x2593\x25A6\x25E6\x259D\x2590\x25E6\x2596\x2599\x25E5\x258E\x2582\x2B\x25E6\x25B1\x25BD\x25E8\x25BD\x25A6\x25E9\x2585\x258D\x25E4\x25BB\x25B6\x25EF\x25BC\x258C\x25E5\x2598\x2580\x25E9\x2593\x2583\x25E9\x2593\x2583--\x25E7\x259C\x259F\x25E5\x25AE\x259E\x25E5\x25AE\x2589\x25E5\x2585\x25A8\x25E7\x259A\x2584\x25E7\x2594\x25B5\x25E8\x25AF\x259D\x25E3\x2580\x2581\x25E7\x25BD\x2591\x25E7\x25BB\x259C\x25E4\x25BA\x2592\x25E5\x258A\x25A8\x25E4\x25BA\x25A4\x25E5\x258F\x258B\x25E7\x25A4\x25BE\x25E5\x258C\x25BA\x25EF\x25BC\x2581,1268640385017,
> > row
> >
> 'SITE_0000000032\x01pt\x0120100315000000\x01\x2521\x25EF\x25BC\x2581\x25E9\x2594\x2580\x25E5\x2594\x25AE\x252F\x25E6\x2594\x25B6\x25E8\x25B4\x25AD\x25EF\x25BC\x2581VM700T\x2BVM700T\x2B\x25E5\x259B\x25BE\x25E5\x2583\x258F\x25E4\x25BF\x25A1\x25E5\x258F\x25B7\x25E4\x25BA\x25A7\x25E7\x2594\x259F\x25E5\x2599\x25A8\x2B\x25E7\x2594\x25B5\x25E5\x25AD\x2590\x25E6\x25B5\x258B\x25E9\x2587\x258F\x25E4\x25BB\x25AA\x25E5\x2599\x25A8\x25EF\x25BC\x258C\x25E5\x2598\x2580\x25E9\x2593\x2583\x25E9\x2593\x2583--\x25E7\x259C\x259F\x25E5\x25AE\x259E\x25E5\x25AE\x2589\x25E5\x2585\x25A8\x25E7\x259A\x2584\x25E7\x2594\x25B5\x25E8\x25AF\x259D\x25E3\x2580\x2581\x25E7\x25BD\x2591\x25E7\x25BB\x259C\x25E4\x25BA\x2592\x25E5\x258A\x25A8\x25E4\x25BA\x25A4\x25E5\x258F\x258B\x25E7\x25A4\x25BE\x25E5\x258C\x25BA\x25EF\x25BC\x2581',
> > but failed after 10 attempts.
> > Exceptions:
> > java.io.IOException: java.io.IOException: Cannot open filename
> > /hbase/summary/1491233486/metrics/5046821377427277894
> > at
> >
> org.apache.hadoop.hdfs.DFSClient$DFSInputStream.openInfo(DFSClient.java:1474)
> > at
> >
> org.apache.hadoop.hdfs.DFSClient$DFSInputStream.chooseDataNode(DFSClient.java:1800)
> > at
> >
> org.apache.hadoop.hdfs.DFSClient$DFSInputStream.blockSeekTo(DFSClient.java:1616)
> > at
> org.apache.hadoop.hdfs.DFSClient$DFSInputStream.read(DFSClient.java:1743)
> > at java.io.DataInputStream.read(DataInputStream.java:132)
> > at
> >
> org.apache.hadoop.hbase.io.hfile.BoundedRangeFileInputStream.read(BoundedRangeFileInputStream.java:99)
> > at org.apache.hadoop.io.IOUtils.readFully(IOUtils.java:100)
> > at
> org.apache.hadoop.hbase.io.hfile.HFile$Reader.decompress(HFile.java:1020)
> > at
> org.apache.hadoop.hbase.io.hfile.HFile$Reader.readBlock(HFile.java:971)
> > at
> >
> org.apache.hadoop.hbase.io.hfile.HFile$Reader$Scanner.loadBlock(HFile.java:1304)
> > at
> >
> org.apache.hadoop.hbase.io.hfile.HFile$Reader$Scanner.seekTo(HFile.java:1186)
> > at
> >
> org.apache.hadoop.hbase.io.HalfHFileReader$1.seekTo(HalfHFileReader.java:207)
> > at
> >
> org.apache.hadoop.hbase.regionserver.StoreFileGetScan.getStoreFile(StoreFileGetScan.java:80)
> > at
> >
> org.apache.hadoop.hbase.regionserver.StoreFileGetScan.get(StoreFileGetScan.java:65)
> > at org.apache.hadoop.hbase.regionserver.Store.get(Store.java:1461)
> > at org.apache.hadoop.hbase.regionserver.HRegion.get(HRegion.java:2396)
> > at org.apache.hadoop.hbase.regionserver.HRegion.get(HRegion.java:2385)
> > at
> >
> org.apache.hadoop.hbase.regionserver.HRegionServer.get(HRegionServer.java:1731)
> > at sun.reflect.GeneratedMethodAccessor6.invoke(Unknown Source)
> > at
> >
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
> > at java.lang.reflect.Method.invoke(Method.java:597)
> > at org.apache.hadoop.hbase.ipc.HBaseRPC$Server.call(HBaseRPC.java:657)
> > at
> org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.java:915)
> >    Is there any way to fix this problem? Or is there anything we can do
> > even manually to relieve it?
> >    Any suggestion?
> >    Thank you.
> >    LvZheng
> >
>

Re: Cannot open filename Exceptions

Posted by Stack <st...@duboce.net>.
Can you get that file from hdfs?

> ./bin/hadoop fs -get  /hbase/summary/1491233486/metrics/5046821377427277894

Does it look wholesome?  Is it empty?

What if you trace the life of that file in regionserver logs or
probably better, over in namenode log?  If you move this file aside,
the region deploys?

St.Ack

On Mon, Mar 15, 2010 at 3:40 AM, Zheng Lv <lv...@gmail.com> wrote:
> Hello Everyone,
>    Recently we often got these in our client logs:
>    org.apache.hadoop.hbase.client.RetriesExhaustedException: Trying to
> contact region server 172.16.1.208:60020 for region
> summary,SITE_0000000032\x01pt\x0120100314000000\x01\x25E7\x258C\x25AE\x25E5\x258E\x25BF\x25E5\x2586\x2580\x25E9\x25B9\x25B0\x25E6\x2591\x25A9\x25E6\x2593\x25A6\x25E6\x259D\x2590\x25E6\x2596\x2599\x25E5\x258E\x2582\x2B\x25E6\x25B1\x25BD\x25E8\x25BD\x25A6\x25E9\x2585\x258D\x25E4\x25BB\x25B6\x25EF\x25BC\x258C\x25E5\x2598\x2580\x25E9\x2593\x2583\x25E9\x2593\x2583--\x25E7\x259C\x259F\x25E5\x25AE\x259E\x25E5\x25AE\x2589\x25E5\x2585\x25A8\x25E7\x259A\x2584\x25E7\x2594\x25B5\x25E8\x25AF\x259D\x25E3\x2580\x2581\x25E7\x25BD\x2591\x25E7\x25BB\x259C\x25E4\x25BA\x2592\x25E5\x258A\x25A8\x25E4\x25BA\x25A4\x25E5\x258F\x258B\x25E7\x25A4\x25BE\x25E5\x258C\x25BA\x25EF\x25BC\x2581,1268640385017,
> row
> 'SITE_0000000032\x01pt\x0120100315000000\x01\x2521\x25EF\x25BC\x2581\x25E9\x2594\x2580\x25E5\x2594\x25AE\x252F\x25E6\x2594\x25B6\x25E8\x25B4\x25AD\x25EF\x25BC\x2581VM700T\x2BVM700T\x2B\x25E5\x259B\x25BE\x25E5\x2583\x258F\x25E4\x25BF\x25A1\x25E5\x258F\x25B7\x25E4\x25BA\x25A7\x25E7\x2594\x259F\x25E5\x2599\x25A8\x2B\x25E7\x2594\x25B5\x25E5\x25AD\x2590\x25E6\x25B5\x258B\x25E9\x2587\x258F\x25E4\x25BB\x25AA\x25E5\x2599\x25A8\x25EF\x25BC\x258C\x25E5\x2598\x2580\x25E9\x2593\x2583\x25E9\x2593\x2583--\x25E7\x259C\x259F\x25E5\x25AE\x259E\x25E5\x25AE\x2589\x25E5\x2585\x25A8\x25E7\x259A\x2584\x25E7\x2594\x25B5\x25E8\x25AF\x259D\x25E3\x2580\x2581\x25E7\x25BD\x2591\x25E7\x25BB\x259C\x25E4\x25BA\x2592\x25E5\x258A\x25A8\x25E4\x25BA\x25A4\x25E5\x258F\x258B\x25E7\x25A4\x25BE\x25E5\x258C\x25BA\x25EF\x25BC\x2581',
> but failed after 10 attempts.
> Exceptions:
> java.io.IOException: java.io.IOException: Cannot open filename
> /hbase/summary/1491233486/metrics/5046821377427277894
> at
> org.apache.hadoop.hdfs.DFSClient$DFSInputStream.openInfo(DFSClient.java:1474)
> at
> org.apache.hadoop.hdfs.DFSClient$DFSInputStream.chooseDataNode(DFSClient.java:1800)
> at
> org.apache.hadoop.hdfs.DFSClient$DFSInputStream.blockSeekTo(DFSClient.java:1616)
> at org.apache.hadoop.hdfs.DFSClient$DFSInputStream.read(DFSClient.java:1743)
> at java.io.DataInputStream.read(DataInputStream.java:132)
> at
> org.apache.hadoop.hbase.io.hfile.BoundedRangeFileInputStream.read(BoundedRangeFileInputStream.java:99)
> at org.apache.hadoop.io.IOUtils.readFully(IOUtils.java:100)
> at org.apache.hadoop.hbase.io.hfile.HFile$Reader.decompress(HFile.java:1020)
> at org.apache.hadoop.hbase.io.hfile.HFile$Reader.readBlock(HFile.java:971)
> at
> org.apache.hadoop.hbase.io.hfile.HFile$Reader$Scanner.loadBlock(HFile.java:1304)
> at
> org.apache.hadoop.hbase.io.hfile.HFile$Reader$Scanner.seekTo(HFile.java:1186)
> at
> org.apache.hadoop.hbase.io.HalfHFileReader$1.seekTo(HalfHFileReader.java:207)
> at
> org.apache.hadoop.hbase.regionserver.StoreFileGetScan.getStoreFile(StoreFileGetScan.java:80)
> at
> org.apache.hadoop.hbase.regionserver.StoreFileGetScan.get(StoreFileGetScan.java:65)
> at org.apache.hadoop.hbase.regionserver.Store.get(Store.java:1461)
> at org.apache.hadoop.hbase.regionserver.HRegion.get(HRegion.java:2396)
> at org.apache.hadoop.hbase.regionserver.HRegion.get(HRegion.java:2385)
> at
> org.apache.hadoop.hbase.regionserver.HRegionServer.get(HRegionServer.java:1731)
> at sun.reflect.GeneratedMethodAccessor6.invoke(Unknown Source)
> at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
> at java.lang.reflect.Method.invoke(Method.java:597)
> at org.apache.hadoop.hbase.ipc.HBaseRPC$Server.call(HBaseRPC.java:657)
> at org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.java:915)
>    Is there any way to fix this problem? Or is there anything we can do
> even manually to relieve it?
>    Any suggestion?
>    Thank you.
>    LvZheng
>