You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@hbase.apache.org by Jamie Cockrill <ja...@gmail.com> on 2010/08/02 15:16:19 UTC

Regionserver tanked, can't seem to get master back up fully

Hi All,

I set off a long-running loading job over the weekend and it seems to
have rather destroyed my hbase cluster. Most of the nodes were down
this morning and upon restarting them, I'm now persistently getting
the following message every few ms in the master logs:

DfsClient: Could not complete file
/hbase/.logs/compute17.cluster1.lan,60020,1280518716613/a filename

That file is a zero-byte file on the HDFS. The data-nodes all look
fine and don't seem to have had any trouble. I'm not especially fussed
about having to rebuild that table and reload it, but the trouble is
now that I can't start the cluster properly so I can drop the table.

Does anyone know how I can remove the table/fix these errors manually.
As I said, I'm not fussed about data-loss.

thanks

Jamie

Re: Regionserver tanked, can't seem to get master back up fully

Posted by Jamie Cockrill <ja...@gmail.com>.

All,

I seem to have managed to resolve this myself. I basically went and
found the recovered.edits file in the directory on the HDFS that
relates to the broken region and moved it. Once I'd done that it
stopped complaining and started fully.

Presumably this means I've lost all the data in that file (thankfully
only a few KB), which isn't too much of a problem as I'll just reload
it in a sec.

ta

Jamie

On 2 August 2010 14:16, Jamie Cockrill <ja...@gmail.com> wrote:
> Hi All,
>
> I set off a long-running loading job over the weekend and it seems to
> have rather destroyed my hbase cluster. Most of the nodes were down
> this morning and upon restarting them, I'm now persistently getting
> the following message every few ms in the master logs:
>
> DfsClient: Could not complete file
> /hbase/.logs/compute17.cluster1.lan,60020,1280518716613/a filename
>
> That file is a zero-byte file on the HDFS. The data-nodes all look
> fine and don't seem to have had any trouble. I'm not especially fussed
> about having to rebuild that table and reload it, but the trouble is
> now that I can't start the cluster properly so I can drop the table.
>
> Does anyone know how I can remove the table/fix these errors manually.
> As I said, I'm not fussed about data-loss.
>
> thanks
>
> Jamie
>

Re: Regionserver tanked, can't seem to get master back up fully

Posted by Jean-Daniel Cryans <jd...@apache.org>.

We'll know for sure when we see those stack traces (both master and DNs).

J-D

On Tue, Aug 3, 2010 at 6:22 AM, Jamie Cockrill <ja...@gmail.com> wrote:
> Hi JD,
>
> The cluster is on a separated network, I'll see if any of the traces
> remain. As for the ulimit and xceivers bit, those are setup correctly
> as per the API doc you mention.
>
> Thanks
>
> Jamie
>
> On 2 August 2010 19:18, Jean-Daniel Cryans <jd...@apache.org> wrote:
>> Is that coming from the master? If so, it means that it was trying to
>> write recovered data from a failed region server and wasn't able to do
>> so. It sounds bad.
>>
>> - Can we get full stack traces of that error?
>> - Did you check the datanode logs for any exception? Very often
>> (strong emphasis on "very"), it's an issue with either ulimit or
>> xcievers. Is your cluster configured per the last bullet on that page?
>> http://hbase.apache.org/docs/r0.20.6/api/overview-summary.html#requirements
>>
>> Thx
>>
>> J-D
>>
>> On Mon, Aug 2, 2010 at 6:16 AM, Jamie Cockrill <ja...@gmail.com> wrote:
>>> Hi All,
>>>
>>> I set off a long-running loading job over the weekend and it seems to
>>> have rather destroyed my hbase cluster. Most of the nodes were down
>>> this morning and upon restarting them, I'm now persistently getting
>>> the following message every few ms in the master logs:
>>>
>>> DfsClient: Could not complete file
>>> /hbase/.logs/compute17.cluster1.lan,60020,1280518716613/a filename
>>>
>>> That file is a zero-byte file on the HDFS. The data-nodes all look
>>> fine and don't seem to have had any trouble. I'm not especially fussed
>>> about having to rebuild that table and reload it, but the trouble is
>>> now that I can't start the cluster properly so I can drop the table.
>>>
>>> Does anyone know how I can remove the table/fix these errors manually.
>>> As I said, I'm not fussed about data-loss.
>>>
>>> thanks
>>>
>>> Jamie
>>>
>>
>

Re: Regionserver tanked, can't seem to get master back up fully

Posted by Jamie Cockrill <ja...@gmail.com>.

PS, yes that was coming from master

On 3 August 2010 14:22, Jamie Cockrill <ja...@gmail.com> wrote:
> Hi JD,
>
> The cluster is on a separated network, I'll see if any of the traces
> remain. As for the ulimit and xceivers bit, those are setup correctly
> as per the API doc you mention.
>
> Thanks
>
> Jamie
>
> On 2 August 2010 19:18, Jean-Daniel Cryans <jd...@apache.org> wrote:
>> Is that coming from the master? If so, it means that it was trying to
>> write recovered data from a failed region server and wasn't able to do
>> so. It sounds bad.
>>
>> - Can we get full stack traces of that error?
>> - Did you check the datanode logs for any exception? Very often
>> (strong emphasis on "very"), it's an issue with either ulimit or
>> xcievers. Is your cluster configured per the last bullet on that page?
>> http://hbase.apache.org/docs/r0.20.6/api/overview-summary.html#requirements
>>
>> Thx
>>
>> J-D
>>
>> On Mon, Aug 2, 2010 at 6:16 AM, Jamie Cockrill <ja...@gmail.com> wrote:
>>> Hi All,
>>>
>>> I set off a long-running loading job over the weekend and it seems to
>>> have rather destroyed my hbase cluster. Most of the nodes were down
>>> this morning and upon restarting them, I'm now persistently getting
>>> the following message every few ms in the master logs:
>>>
>>> DfsClient: Could not complete file
>>> /hbase/.logs/compute17.cluster1.lan,60020,1280518716613/a filename
>>>
>>> That file is a zero-byte file on the HDFS. The data-nodes all look
>>> fine and don't seem to have had any trouble. I'm not especially fussed
>>> about having to rebuild that table and reload it, but the trouble is
>>> now that I can't start the cluster properly so I can drop the table.
>>>
>>> Does anyone know how I can remove the table/fix these errors manually.
>>> As I said, I'm not fussed about data-loss.
>>>
>>> thanks
>>>
>>> Jamie
>>>
>>
>

Re: Regionserver tanked, can't seem to get master back up fully

Posted by Jamie Cockrill <ja...@gmail.com>.

Hi JD,

The cluster is on a separated network, I'll see if any of the traces
remain. As for the ulimit and xceivers bit, those are setup correctly
as per the API doc you mention.

Thanks

Jamie

On 2 August 2010 19:18, Jean-Daniel Cryans <jd...@apache.org> wrote:
> Is that coming from the master? If so, it means that it was trying to
> write recovered data from a failed region server and wasn't able to do
> so. It sounds bad.
>
> - Can we get full stack traces of that error?
> - Did you check the datanode logs for any exception? Very often
> (strong emphasis on "very"), it's an issue with either ulimit or
> xcievers. Is your cluster configured per the last bullet on that page?
> http://hbase.apache.org/docs/r0.20.6/api/overview-summary.html#requirements
>
> Thx
>
> J-D
>
> On Mon, Aug 2, 2010 at 6:16 AM, Jamie Cockrill <ja...@gmail.com> wrote:
>> Hi All,
>>
>> I set off a long-running loading job over the weekend and it seems to
>> have rather destroyed my hbase cluster. Most of the nodes were down
>> this morning and upon restarting them, I'm now persistently getting
>> the following message every few ms in the master logs:
>>
>> DfsClient: Could not complete file
>> /hbase/.logs/compute17.cluster1.lan,60020,1280518716613/a filename
>>
>> That file is a zero-byte file on the HDFS. The data-nodes all look
>> fine and don't seem to have had any trouble. I'm not especially fussed
>> about having to rebuild that table and reload it, but the trouble is
>> now that I can't start the cluster properly so I can drop the table.
>>
>> Does anyone know how I can remove the table/fix these errors manually.
>> As I said, I'm not fussed about data-loss.
>>
>> thanks
>>
>> Jamie
>>
>

Re: Regionserver tanked, can't seem to get master back up fully

Posted by Jean-Daniel Cryans <jd...@apache.org>.

Is that coming from the master? If so, it means that it was trying to
write recovered data from a failed region server and wasn't able to do
so. It sounds bad.

- Can we get full stack traces of that error?
- Did you check the datanode logs for any exception? Very often
(strong emphasis on "very"), it's an issue with either ulimit or
xcievers. Is your cluster configured per the last bullet on that page?
http://hbase.apache.org/docs/r0.20.6/api/overview-summary.html#requirements

Thx

J-D

On Mon, Aug 2, 2010 at 6:16 AM, Jamie Cockrill <ja...@gmail.com> wrote:
> Hi All,
>
> I set off a long-running loading job over the weekend and it seems to
> have rather destroyed my hbase cluster. Most of the nodes were down
> this morning and upon restarting them, I'm now persistently getting
> the following message every few ms in the master logs:
>
> DfsClient: Could not complete file
> /hbase/.logs/compute17.cluster1.lan,60020,1280518716613/a filename
>
> That file is a zero-byte file on the HDFS. The data-nodes all look
> fine and don't seem to have had any trouble. I'm not especially fussed
> about having to rebuild that table and reload it, but the trouble is
> now that I can't start the cluster properly so I can drop the table.
>
> Does anyone know how I can remove the table/fix these errors manually.
> As I said, I'm not fussed about data-loss.
>
> thanks
>
> Jamie
>