You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@hbase.apache.org by Cyril Scetbon <cy...@free.fr> on 2012/07/05 09:45:20 UTC

distributed log splitting aborted

Hi,

I can nolonger start my cluster correctly and get messages like http://pastebin.com/T56wrJxE (taken on one region server)

I suppose Hbase is not done for being stopped but only for having some nodes going down ??? HDFS is not complaining, it's only HBase that can't start correctly :(

I suppose some data has not been flushed and it's not really important for me. Is there a way to fix theses errors even if I will lose data ?

thanks

Cyril SCETBON

Re: distributed log splitting aborted

Posted by Cyril Scetbon <cy...@free.fr>.

A network issue ?? it's weird, cause reads/writes are working well and not rising errors (I'll double check it)

Regards
Cyril SCETBON

On Jul 9, 2012, at 10:55 PM, Jean-Daniel Cryans wrote:

> We've been running with distributed splitting here for >6 months and
> never had this issue. Also the exceptions you are seeing come from
> HDFS and not HBase, the fact that it worked from the master and not
> the region servers seem to point to a network configuration issue
> because the actual splitting code is really the same.
> 
> J-D
> 
> On Sun, Jul 8, 2012 at 2:25 PM, Cyril Scetbon <cy...@free.fr> wrote:
>> I've finally succeeded in starting my cluster by disabling hbase.master.distributed.log.splitting
>> 
>> it took less than 10 minutes to start it compared to the whole night without any success with distributed log splitting enabled. Don't you think like me that it's just buggy ??
>> 
>> thanks
>> 
>> Cyril SCETBON
>> 
>> On Jul 6, 2012, at 8:40 PM, Cyril Scetbon wrote:
>> 
>>> As you can see in the master log, region servers are in charge of splitting log files (not found I suppose) and it's retried several times (I didn't check if it's always redone)  on different region servers. You can for example follow a failing split concerning a file not found in the hadoop filesystem :
>>> 
>>> http://pastebin.com/RbcLdbcs
>>> 
>>> Regards
>>> 
>>> Cyril SCETBON
>>> 
>>> On Jul 6, 2012, at 8:17 PM, Cyril Scetbon wrote:
>>> 
>>>> Here are the log files you asked for :
>>>> 
>>>> http://pastebin.com/xRBuQdNS  <---- hbase-master.log
>>>> 
>>>> http://pastebin.com/u6WYQT6R <---- hdfs-namenode.log
>>>> 
>>>> If you find the fix to this damn issue I'll enjoy !
>>>> 
>>>> Thanks
>>>> 
>>>> Cyril SCETBON
>>>> 
>>>> On Jul 5, 2012, at 11:44 PM, Jean-Daniel Cryans wrote:
>>>> 
>>>>> Interesting... Can you read the file? Try a "hadoop dfs -cat" on it
>>>>> and see if it goes to the end of it.
>>>>> 
>>>>> It could also be useful to see a bigger portion of the master log, for
>>>>> all I know maybe it handles it somehow and there's a problem
>>>>> elsewhere.
>>>>> 
>>>>> Finally, which Hadoop version are you using?
>>>>> 
>>>>> Thx,
>>>>> 
>>>>> J-D
>>>>> 
>>>>> On Thu, Jul 5, 2012 at 1:58 PM, Cyril Scetbon <cy...@free.fr> wrote:
>>>>>> yes :
>>>>>> 
>>>>>> /hbase/.logs/hb-d12,60020,1341429679981-splitting/hb-d12%2C60020%2C1341429679981.134143064971
>>>>>> 
>>>>>> I did a fsck and here is the report :
>>>>>> 
>>>>>> Status: HEALTHY
>>>>>> Total size:    618827621255 B (Total open files size: 868 B)
>>>>>> Total dirs:    4801
>>>>>> Total files:   2825 (Files currently being written: 42)
>>>>>> Total blocks (validated):      11479 (avg. block size 53909541 B) (Total open file blocks (not validated): 41)
>>>>>> Minimally replicated blocks:   11479 (100.0 %)
>>>>>> Over-replicated blocks:        1 (0.008711561 %)
>>>>>> Under-replicated blocks:       0 (0.0 %)
>>>>>> Mis-replicated blocks:         0 (0.0 %)
>>>>>> Default replication factor:    4
>>>>>> Average block replication:     4.0000873
>>>>>> Corrupt blocks:                0
>>>>>> Missing replicas:              0 (0.0 %)
>>>>>> Number of data-nodes:          12
>>>>>> Number of racks:               1
>>>>>> FSCK ended at Thu Jul 05 20:56:35 UTC 2012 in 795 milliseconds
>>>>>> 
>>>>>> 
>>>>>> The filesystem under path '/hbase' is HEALTHY
>>>>>> 
>>>>>> Cyril SCETBON
>>>>>> 
>>>>>> Cyril SCETBON
>>>>>> 
>>>>>> On Jul 5, 2012, at 7:59 PM, Jean-Daniel Cryans wrote:
>>>>>> 
>>>>>>> Does this file really exist in HDFS?
>>>>>>> 
>>>>>>> hdfs://hb-zk1:54310/hbase/.logs/hb-d12,60020,1341429679981-splitting/hb-d12%2C60020%2C1341429679981.1341430649711
>>>>>>> 
>>>>>>> If so, did you run fsck in HDFS?
>>>>>>> 
>>>>>>> It would be weird if HDFS doesn't report anything bad but somehow the
>>>>>>> clients (like HBase) can't read it.
>>>>>>> 
>>>>>>> J-D
>>>>>>> 
>>>>>>> On Thu, Jul 5, 2012 at 12:45 AM, Cyril Scetbon <cy...@free.fr> wrote:
>>>>>>>> Hi,
>>>>>>>> 
>>>>>>>> I can nolonger start my cluster correctly and get messages like http://pastebin.com/T56wrJxE (taken on one region server)
>>>>>>>> 
>>>>>>>> I suppose Hbase is not done for being stopped but only for having some nodes going down ??? HDFS is not complaining, it's only HBase that can't start correctly :(
>>>>>>>> 
>>>>>>>> I suppose some data has not been flushed and it's not really important for me. Is there a way to fix theses errors even if I will lose data ?
>>>>>>>> 
>>>>>>>> thanks
>>>>>>>> 
>>>>>>>> Cyril SCETBON
>>>>>>>> 
>>>>>> 
>>>> 
>>> 
>>

Re: distributed log splitting aborted

Posted by Jean-Daniel Cryans <jd...@apache.org>.

We've been running with distributed splitting here for >6 months and
never had this issue. Also the exceptions you are seeing come from
HDFS and not HBase, the fact that it worked from the master and not
the region servers seem to point to a network configuration issue
because the actual splitting code is really the same.

J-D

On Sun, Jul 8, 2012 at 2:25 PM, Cyril Scetbon <cy...@free.fr> wrote:
> I've finally succeeded in starting my cluster by disabling hbase.master.distributed.log.splitting
>
> it took less than 10 minutes to start it compared to the whole night without any success with distributed log splitting enabled. Don't you think like me that it's just buggy ??
>
> thanks
>
> Cyril SCETBON
>
> On Jul 6, 2012, at 8:40 PM, Cyril Scetbon wrote:
>
>> As you can see in the master log, region servers are in charge of splitting log files (not found I suppose) and it's retried several times (I didn't check if it's always redone)  on different region servers. You can for example follow a failing split concerning a file not found in the hadoop filesystem :
>>
>> http://pastebin.com/RbcLdbcs
>>
>> Regards
>>
>> Cyril SCETBON
>>
>> On Jul 6, 2012, at 8:17 PM, Cyril Scetbon wrote:
>>
>>> Here are the log files you asked for :
>>>
>>> http://pastebin.com/xRBuQdNS  <---- hbase-master.log
>>>
>>> http://pastebin.com/u6WYQT6R <---- hdfs-namenode.log
>>>
>>> If you find the fix to this damn issue I'll enjoy !
>>>
>>> Thanks
>>>
>>> Cyril SCETBON
>>>
>>> On Jul 5, 2012, at 11:44 PM, Jean-Daniel Cryans wrote:
>>>
>>>> Interesting... Can you read the file? Try a "hadoop dfs -cat" on it
>>>> and see if it goes to the end of it.
>>>>
>>>> It could also be useful to see a bigger portion of the master log, for
>>>> all I know maybe it handles it somehow and there's a problem
>>>> elsewhere.
>>>>
>>>> Finally, which Hadoop version are you using?
>>>>
>>>> Thx,
>>>>
>>>> J-D
>>>>
>>>> On Thu, Jul 5, 2012 at 1:58 PM, Cyril Scetbon <cy...@free.fr> wrote:
>>>>> yes :
>>>>>
>>>>> /hbase/.logs/hb-d12,60020,1341429679981-splitting/hb-d12%2C60020%2C1341429679981.134143064971
>>>>>
>>>>> I did a fsck and here is the report :
>>>>>
>>>>> Status: HEALTHY
>>>>> Total size:    618827621255 B (Total open files size: 868 B)
>>>>> Total dirs:    4801
>>>>> Total files:   2825 (Files currently being written: 42)
>>>>> Total blocks (validated):      11479 (avg. block size 53909541 B) (Total open file blocks (not validated): 41)
>>>>> Minimally replicated blocks:   11479 (100.0 %)
>>>>> Over-replicated blocks:        1 (0.008711561 %)
>>>>> Under-replicated blocks:       0 (0.0 %)
>>>>> Mis-replicated blocks:         0 (0.0 %)
>>>>> Default replication factor:    4
>>>>> Average block replication:     4.0000873
>>>>> Corrupt blocks:                0
>>>>> Missing replicas:              0 (0.0 %)
>>>>> Number of data-nodes:          12
>>>>> Number of racks:               1
>>>>> FSCK ended at Thu Jul 05 20:56:35 UTC 2012 in 795 milliseconds
>>>>>
>>>>>
>>>>> The filesystem under path '/hbase' is HEALTHY
>>>>>
>>>>> Cyril SCETBON
>>>>>
>>>>> Cyril SCETBON
>>>>>
>>>>> On Jul 5, 2012, at 7:59 PM, Jean-Daniel Cryans wrote:
>>>>>
>>>>>> Does this file really exist in HDFS?
>>>>>>
>>>>>> hdfs://hb-zk1:54310/hbase/.logs/hb-d12,60020,1341429679981-splitting/hb-d12%2C60020%2C1341429679981.1341430649711
>>>>>>
>>>>>> If so, did you run fsck in HDFS?
>>>>>>
>>>>>> It would be weird if HDFS doesn't report anything bad but somehow the
>>>>>> clients (like HBase) can't read it.
>>>>>>
>>>>>> J-D
>>>>>>
>>>>>> On Thu, Jul 5, 2012 at 12:45 AM, Cyril Scetbon <cy...@free.fr> wrote:
>>>>>>> Hi,
>>>>>>>
>>>>>>> I can nolonger start my cluster correctly and get messages like http://pastebin.com/T56wrJxE (taken on one region server)
>>>>>>>
>>>>>>> I suppose Hbase is not done for being stopped but only for having some nodes going down ??? HDFS is not complaining, it's only HBase that can't start correctly :(
>>>>>>>
>>>>>>> I suppose some data has not been flushed and it's not really important for me. Is there a way to fix theses errors even if I will lose data ?
>>>>>>>
>>>>>>> thanks
>>>>>>>
>>>>>>> Cyril SCETBON
>>>>>>>
>>>>>
>>>
>>
>

Re: distributed log splitting aborted

Posted by Cyril Scetbon <cy...@free.fr>.

I've finally succeeded in starting my cluster by disabling hbase.master.distributed.log.splitting

it took less than 10 minutes to start it compared to the whole night without any success with distributed log splitting enabled. Don't you think like me that it's just buggy ??

thanks

Cyril SCETBON

On Jul 6, 2012, at 8:40 PM, Cyril Scetbon wrote:

> As you can see in the master log, region servers are in charge of splitting log files (not found I suppose) and it's retried several times (I didn't check if it's always redone)  on different region servers. You can for example follow a failing split concerning a file not found in the hadoop filesystem :
> 
> http://pastebin.com/RbcLdbcs
> 
> Regards
> 
> Cyril SCETBON
> 
> On Jul 6, 2012, at 8:17 PM, Cyril Scetbon wrote:
> 
>> Here are the log files you asked for :
>> 
>> http://pastebin.com/xRBuQdNS  <---- hbase-master.log
>> 
>> http://pastebin.com/u6WYQT6R <---- hdfs-namenode.log
>> 
>> If you find the fix to this damn issue I'll enjoy !
>> 
>> Thanks
>> 
>> Cyril SCETBON
>> 
>> On Jul 5, 2012, at 11:44 PM, Jean-Daniel Cryans wrote:
>> 
>>> Interesting... Can you read the file? Try a "hadoop dfs -cat" on it
>>> and see if it goes to the end of it.
>>> 
>>> It could also be useful to see a bigger portion of the master log, for
>>> all I know maybe it handles it somehow and there's a problem
>>> elsewhere.
>>> 
>>> Finally, which Hadoop version are you using?
>>> 
>>> Thx,
>>> 
>>> J-D
>>> 
>>> On Thu, Jul 5, 2012 at 1:58 PM, Cyril Scetbon <cy...@free.fr> wrote:
>>>> yes :
>>>> 
>>>> /hbase/.logs/hb-d12,60020,1341429679981-splitting/hb-d12%2C60020%2C1341429679981.134143064971
>>>> 
>>>> I did a fsck and here is the report :
>>>> 
>>>> Status: HEALTHY
>>>> Total size:    618827621255 B (Total open files size: 868 B)
>>>> Total dirs:    4801
>>>> Total files:   2825 (Files currently being written: 42)
>>>> Total blocks (validated):      11479 (avg. block size 53909541 B) (Total open file blocks (not validated): 41)
>>>> Minimally replicated blocks:   11479 (100.0 %)
>>>> Over-replicated blocks:        1 (0.008711561 %)
>>>> Under-replicated blocks:       0 (0.0 %)
>>>> Mis-replicated blocks:         0 (0.0 %)
>>>> Default replication factor:    4
>>>> Average block replication:     4.0000873
>>>> Corrupt blocks:                0
>>>> Missing replicas:              0 (0.0 %)
>>>> Number of data-nodes:          12
>>>> Number of racks:               1
>>>> FSCK ended at Thu Jul 05 20:56:35 UTC 2012 in 795 milliseconds
>>>> 
>>>> 
>>>> The filesystem under path '/hbase' is HEALTHY
>>>> 
>>>> Cyril SCETBON
>>>> 
>>>> Cyril SCETBON
>>>> 
>>>> On Jul 5, 2012, at 7:59 PM, Jean-Daniel Cryans wrote:
>>>> 
>>>>> Does this file really exist in HDFS?
>>>>> 
>>>>> hdfs://hb-zk1:54310/hbase/.logs/hb-d12,60020,1341429679981-splitting/hb-d12%2C60020%2C1341429679981.1341430649711
>>>>> 
>>>>> If so, did you run fsck in HDFS?
>>>>> 
>>>>> It would be weird if HDFS doesn't report anything bad but somehow the
>>>>> clients (like HBase) can't read it.
>>>>> 
>>>>> J-D
>>>>> 
>>>>> On Thu, Jul 5, 2012 at 12:45 AM, Cyril Scetbon <cy...@free.fr> wrote:
>>>>>> Hi,
>>>>>> 
>>>>>> I can nolonger start my cluster correctly and get messages like http://pastebin.com/T56wrJxE (taken on one region server)
>>>>>> 
>>>>>> I suppose Hbase is not done for being stopped but only for having some nodes going down ??? HDFS is not complaining, it's only HBase that can't start correctly :(
>>>>>> 
>>>>>> I suppose some data has not been flushed and it's not really important for me. Is there a way to fix theses errors even if I will lose data ?
>>>>>> 
>>>>>> thanks
>>>>>> 
>>>>>> Cyril SCETBON
>>>>>> 
>>>> 
>> 
>

Re: distributed log splitting aborted

Posted by Cyril Scetbon <cy...@free.fr>.

As you can see in the master log, region servers are in charge of splitting log files (not found I suppose) and it's retried several times (I didn't check if it's always redone)  on different region servers. You can for example follow a failing split concerning a file not found in the hadoop filesystem :

http://pastebin.com/RbcLdbcs

Regards

Cyril SCETBON

On Jul 6, 2012, at 8:17 PM, Cyril Scetbon wrote:

> Here are the log files you asked for :
> 
> http://pastebin.com/xRBuQdNS  <---- hbase-master.log
> 
> http://pastebin.com/u6WYQT6R <---- hdfs-namenode.log
> 
> If you find the fix to this damn issue I'll enjoy !
> 
> Thanks
> 
> Cyril SCETBON
> 
> On Jul 5, 2012, at 11:44 PM, Jean-Daniel Cryans wrote:
> 
>> Interesting... Can you read the file? Try a "hadoop dfs -cat" on it
>> and see if it goes to the end of it.
>> 
>> It could also be useful to see a bigger portion of the master log, for
>> all I know maybe it handles it somehow and there's a problem
>> elsewhere.
>> 
>> Finally, which Hadoop version are you using?
>> 
>> Thx,
>> 
>> J-D
>> 
>> On Thu, Jul 5, 2012 at 1:58 PM, Cyril Scetbon <cy...@free.fr> wrote:
>>> yes :
>>> 
>>> /hbase/.logs/hb-d12,60020,1341429679981-splitting/hb-d12%2C60020%2C1341429679981.134143064971
>>> 
>>> I did a fsck and here is the report :
>>> 
>>> Status: HEALTHY
>>> Total size:    618827621255 B (Total open files size: 868 B)
>>> Total dirs:    4801
>>> Total files:   2825 (Files currently being written: 42)
>>> Total blocks (validated):      11479 (avg. block size 53909541 B) (Total open file blocks (not validated): 41)
>>> Minimally replicated blocks:   11479 (100.0 %)
>>> Over-replicated blocks:        1 (0.008711561 %)
>>> Under-replicated blocks:       0 (0.0 %)
>>> Mis-replicated blocks:         0 (0.0 %)
>>> Default replication factor:    4
>>> Average block replication:     4.0000873
>>> Corrupt blocks:                0
>>> Missing replicas:              0 (0.0 %)
>>> Number of data-nodes:          12
>>> Number of racks:               1
>>> FSCK ended at Thu Jul 05 20:56:35 UTC 2012 in 795 milliseconds
>>> 
>>> 
>>> The filesystem under path '/hbase' is HEALTHY
>>> 
>>> Cyril SCETBON
>>> 
>>> Cyril SCETBON
>>> 
>>> On Jul 5, 2012, at 7:59 PM, Jean-Daniel Cryans wrote:
>>> 
>>>> Does this file really exist in HDFS?
>>>> 
>>>> hdfs://hb-zk1:54310/hbase/.logs/hb-d12,60020,1341429679981-splitting/hb-d12%2C60020%2C1341429679981.1341430649711
>>>> 
>>>> If so, did you run fsck in HDFS?
>>>> 
>>>> It would be weird if HDFS doesn't report anything bad but somehow the
>>>> clients (like HBase) can't read it.
>>>> 
>>>> J-D
>>>> 
>>>> On Thu, Jul 5, 2012 at 12:45 AM, Cyril Scetbon <cy...@free.fr> wrote:
>>>>> Hi,
>>>>> 
>>>>> I can nolonger start my cluster correctly and get messages like http://pastebin.com/T56wrJxE (taken on one region server)
>>>>> 
>>>>> I suppose Hbase is not done for being stopped but only for having some nodes going down ??? HDFS is not complaining, it's only HBase that can't start correctly :(
>>>>> 
>>>>> I suppose some data has not been flushed and it's not really important for me. Is there a way to fix theses errors even if I will lose data ?
>>>>> 
>>>>> thanks
>>>>> 
>>>>> Cyril SCETBON
>>>>> 
>>> 
>

Re: distributed log splitting aborted

Posted by Cyril Scetbon <cy...@free.fr>.

Here are the log files you asked for :

http://pastebin.com/xRBuQdNS  <---- hbase-master.log

http://pastebin.com/u6WYQT6R <---- hdfs-namenode.log

If you find the fix to this damn issue I'll enjoy !

Thanks

Cyril SCETBON

On Jul 5, 2012, at 11:44 PM, Jean-Daniel Cryans wrote:

> Interesting... Can you read the file? Try a "hadoop dfs -cat" on it
> and see if it goes to the end of it.
> 
> It could also be useful to see a bigger portion of the master log, for
> all I know maybe it handles it somehow and there's a problem
> elsewhere.
> 
> Finally, which Hadoop version are you using?
> 
> Thx,
> 
> J-D
> 
> On Thu, Jul 5, 2012 at 1:58 PM, Cyril Scetbon <cy...@free.fr> wrote:
>> yes :
>> 
>> /hbase/.logs/hb-d12,60020,1341429679981-splitting/hb-d12%2C60020%2C1341429679981.134143064971
>> 
>> I did a fsck and here is the report :
>> 
>> Status: HEALTHY
>> Total size:    618827621255 B (Total open files size: 868 B)
>> Total dirs:    4801
>> Total files:   2825 (Files currently being written: 42)
>> Total blocks (validated):      11479 (avg. block size 53909541 B) (Total open file blocks (not validated): 41)
>> Minimally replicated blocks:   11479 (100.0 %)
>> Over-replicated blocks:        1 (0.008711561 %)
>> Under-replicated blocks:       0 (0.0 %)
>> Mis-replicated blocks:         0 (0.0 %)
>> Default replication factor:    4
>> Average block replication:     4.0000873
>> Corrupt blocks:                0
>> Missing replicas:              0 (0.0 %)
>> Number of data-nodes:          12
>> Number of racks:               1
>> FSCK ended at Thu Jul 05 20:56:35 UTC 2012 in 795 milliseconds
>> 
>> 
>> The filesystem under path '/hbase' is HEALTHY
>> 
>> Cyril SCETBON
>> 
>> Cyril SCETBON
>> 
>> On Jul 5, 2012, at 7:59 PM, Jean-Daniel Cryans wrote:
>> 
>>> Does this file really exist in HDFS?
>>> 
>>> hdfs://hb-zk1:54310/hbase/.logs/hb-d12,60020,1341429679981-splitting/hb-d12%2C60020%2C1341429679981.1341430649711
>>> 
>>> If so, did you run fsck in HDFS?
>>> 
>>> It would be weird if HDFS doesn't report anything bad but somehow the
>>> clients (like HBase) can't read it.
>>> 
>>> J-D
>>> 
>>> On Thu, Jul 5, 2012 at 12:45 AM, Cyril Scetbon <cy...@free.fr> wrote:
>>>> Hi,
>>>> 
>>>> I can nolonger start my cluster correctly and get messages like http://pastebin.com/T56wrJxE (taken on one region server)
>>>> 
>>>> I suppose Hbase is not done for being stopped but only for having some nodes going down ??? HDFS is not complaining, it's only HBase that can't start correctly :(
>>>> 
>>>> I suppose some data has not been flushed and it's not really important for me. Is there a way to fix theses errors even if I will lose data ?
>>>> 
>>>> thanks
>>>> 
>>>> Cyril SCETBON
>>>> 
>>

Re: distributed log splitting aborted

Posted by Cyril Scetbon <cy...@free.fr>.

dfs.datanode.max.xcievers is set to 4096 and the soft limit of nofile is set to 32768 (it is the default in the package)

However when I log in as hdfs it's set to 1024 and I can't find if it's set somewhere to more...

Cyril SCETBON

On Jul 6, 2012, at 12:19 PM, N Keywal wrote:

> Hi Cyril,
> 
> BTW, have you checked dfs.datanode.max.xcievers and ulimit -n? When
> underconfigured they can cause this type of errors, even if it seems
> it's not the case here...
> 
> Cheers,
> 
> N.
> 
> On Fri, Jul 6, 2012 at 11:31 AM, Cyril Scetbon <cy...@free.fr> wrote:
>> The file is now missing but I have tried with another one and you can see the error :
>> 
>> shell> hdfs dfs -ls "/hbase/.logs/hb-d11,60020,1341097456894-splitting/hb-d11%2C60020%2C1341097456894.1341421613446"
>> Found 1 items
>> -rw-r--r--   4 hbase supergroup          0 2012-07-04 17:06 /hbase/.logs/hb-d11,60020,1341097456894-splitting/hb-d11%2C60020%2C1341097456894.1341421613446
>> shell> hdfs dfs -cat "/hbase/.logs/hb-d11,60020,1341097456894-splitting/hb-d11%2C60020%2C1341097456894.1341421613446"
>> 12/07/06 09:27:51 WARN hdfs.DFSClient: Last block locations not available. Datanodes might not have reported blocks completely. Will retry for 3 times
>> 12/07/06 09:27:55 WARN hdfs.DFSClient: Last block locations not available. Datanodes might not have reported blocks completely. Will retry for 2 times
>> 12/07/06 09:27:59 WARN hdfs.DFSClient: Last block locations not available. Datanodes might not have reported blocks completely. Will retry for 1 times
>> cat: Could not obtain the last block locations.
>> 
>> I'm using hadoop 2.0 from Cloudera package (CDH4) with hbase 0.92.1
>> 
>> Regards
>> Cyril SCETBON
>> 
>> On Jul 5, 2012, at 11:44 PM, Jean-Daniel Cryans wrote:
>> 
>>> Interesting... Can you read the file? Try a "hadoop dfs -cat" on it
>>> and see if it goes to the end of it.
>>> 
>>> It could also be useful to see a bigger portion of the master log, for
>>> all I know maybe it handles it somehow and there's a problem
>>> elsewhere.
>>> 
>>> Finally, which Hadoop version are you using?
>>> 
>>> Thx,
>>> 
>>> J-D
>>> 
>>> On Thu, Jul 5, 2012 at 1:58 PM, Cyril Scetbon <cy...@free.fr> wrote:
>>>> yes :
>>>> 
>>>> /hbase/.logs/hb-d12,60020,1341429679981-splitting/hb-d12%2C60020%2C1341429679981.134143064971
>>>> 
>>>> I did a fsck and here is the report :
>>>> 
>>>> Status: HEALTHY
>>>> Total size:    618827621255 B (Total open files size: 868 B)
>>>> Total dirs:    4801
>>>> Total files:   2825 (Files currently being written: 42)
>>>> Total blocks (validated):      11479 (avg. block size 53909541 B) (Total open file blocks (not validated): 41)
>>>> Minimally replicated blocks:   11479 (100.0 %)
>>>> Over-replicated blocks:        1 (0.008711561 %)
>>>> Under-replicated blocks:       0 (0.0 %)
>>>> Mis-replicated blocks:         0 (0.0 %)
>>>> Default replication factor:    4
>>>> Average block replication:     4.0000873
>>>> Corrupt blocks:                0
>>>> Missing replicas:              0 (0.0 %)
>>>> Number of data-nodes:          12
>>>> Number of racks:               1
>>>> FSCK ended at Thu Jul 05 20:56:35 UTC 2012 in 795 milliseconds
>>>> 
>>>> 
>>>> The filesystem under path '/hbase' is HEALTHY
>>>> 
>>>> Cyril SCETBON
>>>> 
>>>> Cyril SCETBON
>>>> 
>>>> On Jul 5, 2012, at 7:59 PM, Jean-Daniel Cryans wrote:
>>>> 
>>>>> Does this file really exist in HDFS?
>>>>> 
>>>>> hdfs://hb-zk1:54310/hbase/.logs/hb-d12,60020,1341429679981-splitting/hb-d12%2C60020%2C1341429679981.1341430649711
>>>>> 
>>>>> If so, did you run fsck in HDFS?
>>>>> 
>>>>> It would be weird if HDFS doesn't report anything bad but somehow the
>>>>> clients (like HBase) can't read it.
>>>>> 
>>>>> J-D
>>>>> 
>>>>> On Thu, Jul 5, 2012 at 12:45 AM, Cyril Scetbon <cy...@free.fr> wrote:
>>>>>> Hi,
>>>>>> 
>>>>>> I can nolonger start my cluster correctly and get messages like http://pastebin.com/T56wrJxE (taken on one region server)
>>>>>> 
>>>>>> I suppose Hbase is not done for being stopped but only for having some nodes going down ??? HDFS is not complaining, it's only HBase that can't start correctly :(
>>>>>> 
>>>>>> I suppose some data has not been flushed and it's not really important for me. Is there a way to fix theses errors even if I will lose data ?
>>>>>> 
>>>>>> thanks
>>>>>> 
>>>>>> Cyril SCETBON
>>>>>> 
>>>> 
>>

Re: distributed log splitting aborted

Posted by N Keywal <nk...@gmail.com>.

Hi Cyril,

BTW, have you checked dfs.datanode.max.xcievers and ulimit -n? When
underconfigured they can cause this type of errors, even if it seems
it's not the case here...

Cheers,

N.

On Fri, Jul 6, 2012 at 11:31 AM, Cyril Scetbon <cy...@free.fr> wrote:
> The file is now missing but I have tried with another one and you can see the error :
>
> shell> hdfs dfs -ls "/hbase/.logs/hb-d11,60020,1341097456894-splitting/hb-d11%2C60020%2C1341097456894.1341421613446"
> Found 1 items
> -rw-r--r--   4 hbase supergroup          0 2012-07-04 17:06 /hbase/.logs/hb-d11,60020,1341097456894-splitting/hb-d11%2C60020%2C1341097456894.1341421613446
> shell> hdfs dfs -cat "/hbase/.logs/hb-d11,60020,1341097456894-splitting/hb-d11%2C60020%2C1341097456894.1341421613446"
> 12/07/06 09:27:51 WARN hdfs.DFSClient: Last block locations not available. Datanodes might not have reported blocks completely. Will retry for 3 times
> 12/07/06 09:27:55 WARN hdfs.DFSClient: Last block locations not available. Datanodes might not have reported blocks completely. Will retry for 2 times
> 12/07/06 09:27:59 WARN hdfs.DFSClient: Last block locations not available. Datanodes might not have reported blocks completely. Will retry for 1 times
> cat: Could not obtain the last block locations.
>
> I'm using hadoop 2.0 from Cloudera package (CDH4) with hbase 0.92.1
>
> Regards
> Cyril SCETBON
>
> On Jul 5, 2012, at 11:44 PM, Jean-Daniel Cryans wrote:
>
>> Interesting... Can you read the file? Try a "hadoop dfs -cat" on it
>> and see if it goes to the end of it.
>>
>> It could also be useful to see a bigger portion of the master log, for
>> all I know maybe it handles it somehow and there's a problem
>> elsewhere.
>>
>> Finally, which Hadoop version are you using?
>>
>> Thx,
>>
>> J-D
>>
>> On Thu, Jul 5, 2012 at 1:58 PM, Cyril Scetbon <cy...@free.fr> wrote:
>>> yes :
>>>
>>> /hbase/.logs/hb-d12,60020,1341429679981-splitting/hb-d12%2C60020%2C1341429679981.134143064971
>>>
>>> I did a fsck and here is the report :
>>>
>>> Status: HEALTHY
>>> Total size:    618827621255 B (Total open files size: 868 B)
>>> Total dirs:    4801
>>> Total files:   2825 (Files currently being written: 42)
>>> Total blocks (validated):      11479 (avg. block size 53909541 B) (Total open file blocks (not validated): 41)
>>> Minimally replicated blocks:   11479 (100.0 %)
>>> Over-replicated blocks:        1 (0.008711561 %)
>>> Under-replicated blocks:       0 (0.0 %)
>>> Mis-replicated blocks:         0 (0.0 %)
>>> Default replication factor:    4
>>> Average block replication:     4.0000873
>>> Corrupt blocks:                0
>>> Missing replicas:              0 (0.0 %)
>>> Number of data-nodes:          12
>>> Number of racks:               1
>>> FSCK ended at Thu Jul 05 20:56:35 UTC 2012 in 795 milliseconds
>>>
>>>
>>> The filesystem under path '/hbase' is HEALTHY
>>>
>>> Cyril SCETBON
>>>
>>> Cyril SCETBON
>>>
>>> On Jul 5, 2012, at 7:59 PM, Jean-Daniel Cryans wrote:
>>>
>>>> Does this file really exist in HDFS?
>>>>
>>>> hdfs://hb-zk1:54310/hbase/.logs/hb-d12,60020,1341429679981-splitting/hb-d12%2C60020%2C1341429679981.1341430649711
>>>>
>>>> If so, did you run fsck in HDFS?
>>>>
>>>> It would be weird if HDFS doesn't report anything bad but somehow the
>>>> clients (like HBase) can't read it.
>>>>
>>>> J-D
>>>>
>>>> On Thu, Jul 5, 2012 at 12:45 AM, Cyril Scetbon <cy...@free.fr> wrote:
>>>>> Hi,
>>>>>
>>>>> I can nolonger start my cluster correctly and get messages like http://pastebin.com/T56wrJxE (taken on one region server)
>>>>>
>>>>> I suppose Hbase is not done for being stopped but only for having some nodes going down ??? HDFS is not complaining, it's only HBase that can't start correctly :(
>>>>>
>>>>> I suppose some data has not been flushed and it's not really important for me. Is there a way to fix theses errors even if I will lose data ?
>>>>>
>>>>> thanks
>>>>>
>>>>> Cyril SCETBON
>>>>>
>>>
>

Re: distributed log splitting aborted

Posted by Cyril Scetbon <cy...@free.fr>.

The file is now missing but I have tried with another one and you can see the error :

shell> hdfs dfs -ls "/hbase/.logs/hb-d11,60020,1341097456894-splitting/hb-d11%2C60020%2C1341097456894.1341421613446"
Found 1 items
-rw-r--r--   4 hbase supergroup          0 2012-07-04 17:06 /hbase/.logs/hb-d11,60020,1341097456894-splitting/hb-d11%2C60020%2C1341097456894.1341421613446
shell> hdfs dfs -cat "/hbase/.logs/hb-d11,60020,1341097456894-splitting/hb-d11%2C60020%2C1341097456894.1341421613446"
12/07/06 09:27:51 WARN hdfs.DFSClient: Last block locations not available. Datanodes might not have reported blocks completely. Will retry for 3 times
12/07/06 09:27:55 WARN hdfs.DFSClient: Last block locations not available. Datanodes might not have reported blocks completely. Will retry for 2 times
12/07/06 09:27:59 WARN hdfs.DFSClient: Last block locations not available. Datanodes might not have reported blocks completely. Will retry for 1 times
cat: Could not obtain the last block locations.

I'm using hadoop 2.0 from Cloudera package (CDH4) with hbase 0.92.1

Regards
Cyril SCETBON

On Jul 5, 2012, at 11:44 PM, Jean-Daniel Cryans wrote:

> Interesting... Can you read the file? Try a "hadoop dfs -cat" on it
> and see if it goes to the end of it.
> 
> It could also be useful to see a bigger portion of the master log, for
> all I know maybe it handles it somehow and there's a problem
> elsewhere.
> 
> Finally, which Hadoop version are you using?
> 
> Thx,
> 
> J-D
> 
> On Thu, Jul 5, 2012 at 1:58 PM, Cyril Scetbon <cy...@free.fr> wrote:
>> yes :
>> 
>> /hbase/.logs/hb-d12,60020,1341429679981-splitting/hb-d12%2C60020%2C1341429679981.134143064971
>> 
>> I did a fsck and here is the report :
>> 
>> Status: HEALTHY
>> Total size:    618827621255 B (Total open files size: 868 B)
>> Total dirs:    4801
>> Total files:   2825 (Files currently being written: 42)
>> Total blocks (validated):      11479 (avg. block size 53909541 B) (Total open file blocks (not validated): 41)
>> Minimally replicated blocks:   11479 (100.0 %)
>> Over-replicated blocks:        1 (0.008711561 %)
>> Under-replicated blocks:       0 (0.0 %)
>> Mis-replicated blocks:         0 (0.0 %)
>> Default replication factor:    4
>> Average block replication:     4.0000873
>> Corrupt blocks:                0
>> Missing replicas:              0 (0.0 %)
>> Number of data-nodes:          12
>> Number of racks:               1
>> FSCK ended at Thu Jul 05 20:56:35 UTC 2012 in 795 milliseconds
>> 
>> 
>> The filesystem under path '/hbase' is HEALTHY
>> 
>> Cyril SCETBON
>> 
>> Cyril SCETBON
>> 
>> On Jul 5, 2012, at 7:59 PM, Jean-Daniel Cryans wrote:
>> 
>>> Does this file really exist in HDFS?
>>> 
>>> hdfs://hb-zk1:54310/hbase/.logs/hb-d12,60020,1341429679981-splitting/hb-d12%2C60020%2C1341429679981.1341430649711
>>> 
>>> If so, did you run fsck in HDFS?
>>> 
>>> It would be weird if HDFS doesn't report anything bad but somehow the
>>> clients (like HBase) can't read it.
>>> 
>>> J-D
>>> 
>>> On Thu, Jul 5, 2012 at 12:45 AM, Cyril Scetbon <cy...@free.fr> wrote:
>>>> Hi,
>>>> 
>>>> I can nolonger start my cluster correctly and get messages like http://pastebin.com/T56wrJxE (taken on one region server)
>>>> 
>>>> I suppose Hbase is not done for being stopped but only for having some nodes going down ??? HDFS is not complaining, it's only HBase that can't start correctly :(
>>>> 
>>>> I suppose some data has not been flushed and it's not really important for me. Is there a way to fix theses errors even if I will lose data ?
>>>> 
>>>> thanks
>>>> 
>>>> Cyril SCETBON
>>>> 
>>

Re: distributed log splitting aborted

Posted by Jean-Daniel Cryans <jd...@apache.org>.

Interesting... Can you read the file? Try a "hadoop dfs -cat" on it
and see if it goes to the end of it.

It could also be useful to see a bigger portion of the master log, for
all I know maybe it handles it somehow and there's a problem
elsewhere.

Finally, which Hadoop version are you using?

Thx,

J-D

On Thu, Jul 5, 2012 at 1:58 PM, Cyril Scetbon <cy...@free.fr> wrote:
> yes :
>
> /hbase/.logs/hb-d12,60020,1341429679981-splitting/hb-d12%2C60020%2C1341429679981.134143064971
>
> I did a fsck and here is the report :
>
> Status: HEALTHY
>  Total size:    618827621255 B (Total open files size: 868 B)
>  Total dirs:    4801
>  Total files:   2825 (Files currently being written: 42)
>  Total blocks (validated):      11479 (avg. block size 53909541 B) (Total open file blocks (not validated): 41)
>  Minimally replicated blocks:   11479 (100.0 %)
>  Over-replicated blocks:        1 (0.008711561 %)
>  Under-replicated blocks:       0 (0.0 %)
>  Mis-replicated blocks:         0 (0.0 %)
>  Default replication factor:    4
>  Average block replication:     4.0000873
>  Corrupt blocks:                0
>  Missing replicas:              0 (0.0 %)
>  Number of data-nodes:          12
>  Number of racks:               1
> FSCK ended at Thu Jul 05 20:56:35 UTC 2012 in 795 milliseconds
>
>
> The filesystem under path '/hbase' is HEALTHY
>
> Cyril SCETBON
>
> Cyril SCETBON
>
> On Jul 5, 2012, at 7:59 PM, Jean-Daniel Cryans wrote:
>
>> Does this file really exist in HDFS?
>>
>> hdfs://hb-zk1:54310/hbase/.logs/hb-d12,60020,1341429679981-splitting/hb-d12%2C60020%2C1341429679981.1341430649711
>>
>> If so, did you run fsck in HDFS?
>>
>> It would be weird if HDFS doesn't report anything bad but somehow the
>> clients (like HBase) can't read it.
>>
>> J-D
>>
>> On Thu, Jul 5, 2012 at 12:45 AM, Cyril Scetbon <cy...@free.fr> wrote:
>>> Hi,
>>>
>>> I can nolonger start my cluster correctly and get messages like http://pastebin.com/T56wrJxE (taken on one region server)
>>>
>>> I suppose Hbase is not done for being stopped but only for having some nodes going down ??? HDFS is not complaining, it's only HBase that can't start correctly :(
>>>
>>> I suppose some data has not been flushed and it's not really important for me. Is there a way to fix theses errors even if I will lose data ?
>>>
>>> thanks
>>>
>>> Cyril SCETBON
>>>
>

Re: distributed log splitting aborted

Posted by Cyril Scetbon <cy...@free.fr>.

yes :

/hbase/.logs/hb-d12,60020,1341429679981-splitting/hb-d12%2C60020%2C1341429679981.134143064971

I did a fsck and here is the report :

Status: HEALTHY
 Total size:	618827621255 B (Total open files size: 868 B)
 Total dirs:	4801
 Total files:	2825 (Files currently being written: 42)
 Total blocks (validated):	11479 (avg. block size 53909541 B) (Total open file blocks (not validated): 41)
 Minimally replicated blocks:	11479 (100.0 %)
 Over-replicated blocks:	1 (0.008711561 %)
 Under-replicated blocks:	0 (0.0 %)
 Mis-replicated blocks:		0 (0.0 %)
 Default replication factor:	4
 Average block replication:	4.0000873
 Corrupt blocks:		0
 Missing replicas:		0 (0.0 %)
 Number of data-nodes:		12
 Number of racks:		1
FSCK ended at Thu Jul 05 20:56:35 UTC 2012 in 795 milliseconds


The filesystem under path '/hbase' is HEALTHY

Cyril SCETBON

Cyril SCETBON

On Jul 5, 2012, at 7:59 PM, Jean-Daniel Cryans wrote:

> Does this file really exist in HDFS?
> 
> hdfs://hb-zk1:54310/hbase/.logs/hb-d12,60020,1341429679981-splitting/hb-d12%2C60020%2C1341429679981.1341430649711
> 
> If so, did you run fsck in HDFS?
> 
> It would be weird if HDFS doesn't report anything bad but somehow the
> clients (like HBase) can't read it.
> 
> J-D
> 
> On Thu, Jul 5, 2012 at 12:45 AM, Cyril Scetbon <cy...@free.fr> wrote:
>> Hi,
>> 
>> I can nolonger start my cluster correctly and get messages like http://pastebin.com/T56wrJxE (taken on one region server)
>> 
>> I suppose Hbase is not done for being stopped but only for having some nodes going down ??? HDFS is not complaining, it's only HBase that can't start correctly :(
>> 
>> I suppose some data has not been flushed and it's not really important for me. Is there a way to fix theses errors even if I will lose data ?
>> 
>> thanks
>> 
>> Cyril SCETBON
>>

Re: distributed log splitting aborted

Posted by Jean-Daniel Cryans <jd...@apache.org>.

Does this file really exist in HDFS?

hdfs://hb-zk1:54310/hbase/.logs/hb-d12,60020,1341429679981-splitting/hb-d12%2C60020%2C1341429679981.1341430649711

If so, did you run fsck in HDFS?

It would be weird if HDFS doesn't report anything bad but somehow the
clients (like HBase) can't read it.

J-D

On Thu, Jul 5, 2012 at 12:45 AM, Cyril Scetbon <cy...@free.fr> wrote:
> Hi,
>
> I can nolonger start my cluster correctly and get messages like http://pastebin.com/T56wrJxE (taken on one region server)
>
> I suppose Hbase is not done for being stopped but only for having some nodes going down ??? HDFS is not complaining, it's only HBase that can't start correctly :(
>
> I suppose some data has not been flushed and it's not really important for me. Is there a way to fix theses errors even if I will lose data ?
>
> thanks
>
> Cyril SCETBON
>