You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@hbase.apache.org by Kang Minwoo <mi...@outlook.com> on 2019/05/29 04:39:05 UTC

Disk hot swap for data node while hbase use short-circuit

Hello, Users.

I use JBOD for data node. Some times the disk in the data node has a problem.

The first time, I shut down all instance include data node and region server in the machine that has a disk problem.
But It is not a good solution. So I improve the process.

When I detect disk problem in the server. I just perform disk hot swap.

But System administrator complains of some FD that still open so they cannot remove the disk.
Regionserver has an FD, I use short circuit reads feature. (HBase version 1.2.9)

When we first met this issue, we force unmount disk and remount.
But after this process, kernel report error[1].

So we avoid this issue. purge stale FD.

I think this issue is common.
I want to know how hbase-users deal with this issue.

Thank you very much for sharing your experience.

Best regards,
Minwoo Kang

[1]: https://www.thegeekdiary.com/xfs_log_force-error-5-returned-xfs-error-centos-rhel-7/

Re: Disk hot swap for data node while hbase use short-circuit

Posted by Josh Elser <el...@apache.org>.

Reminds me of https://issues.apache.org/jira/browse/HBASE-21915 too. 
Agree with Wei-Chiu that I'd start by ruling out HDFS issues first, and 
then start worrying about HBase issues :)

On 6/1/19 8:05 PM, Wei-Chiu Chuang wrote:
> I think i found a similar bug report that matches your symptom: HDFS-12204
> <https://issues.apache.org/jira/browse/HDFS-12204> (Dfsclient Do not close
> file descriptor when using shortcircuit)
> 
> On Wed, May 29, 2019 at 11:37 PM Kang Minwoo <mi...@outlook.com>
> wrote:
> 
>> I think these file opened for reads. because that block is finalized.
>>
>> ---
>> ls -al /proc/regionserver_pid/fd
>> 902 -> /data_path/current/finalized/~/blk_1 (deleted)
>> 946 -> /data_path/current/finalized/~/blk_2 (deleted)
>> 947 -> /data_path/current/finalized/~/blk_3.meta (deleted)
>> ---
>>
>> I think it is not an HBase bug. This is because DFSClient checks stale fd
>> when the fetch method invoked.
>>
>> Best regards,
>> Minwoo Kang
>>
>> ________________________________________
>> 보낸 사람: Wei-Chiu Chuang <we...@cloudera.com.INVALID>
>> 보낸 날짜: 2019년 5월 29일 수요일 20:51
>> 받는 사람: user@hbase.apache.org
>> 제목: Re: Disk hot swap for data node while hbase use short-circuit
>>
>> Do you have a list of files that was being opened? I'd like to know if
>> those are files opened for writes or for reads.
>>
>> If you are on the more recent version of Hadoop (2.8.0 and above),
>> there's a HDFS command to interrupt ongoing writes to DataNodes (HDFS-9945
>> <https://issues.apache.org/jira/browse/HDFS-9945>)
>>
>>
>> https://hadoop.apache.org/docs/r2.8.5/hadoop-project-dist/hadoop-hdfs/HDFSCommands.html#dfsadmin
>> hdfs dfsadmin -evictWriters
>>
>> Looking at HDFS hotswap implementation, it looks like DataNode doesn't
>> interrupt writers when a volume is removed. That sounds like a bug.
>>
>> On Tue, May 28, 2019 at 9:39 PM Kang Minwoo <mi...@outlook.com>
>> wrote:
>>
>>> Hello, Users.
>>>
>>> I use JBOD for data node. Some times the disk in the data node has a
>>> problem.
>>>
>>> The first time, I shut down all instance include data node and region
>>> server in the machine that has a disk problem.
>>> But It is not a good solution. So I improve the process.
>>>
>>> When I detect disk problem in the server. I just perform disk hot swap.
>>>
>>> But System administrator complains of some FD that still open so they
>>> cannot remove the disk.
>>> Regionserver has an FD, I use short circuit reads feature. (HBase version
>>> 1.2.9)
>>>
>>> When we first met this issue, we force unmount disk and remount.
>>> But after this process, kernel report error[1].
>>>
>>> So we avoid this issue. purge stale FD.
>>>
>>> I think this issue is common.
>>> I want to know how hbase-users deal with this issue.
>>>
>>> Thank you very much for sharing your experience.
>>>
>>> Best regards,
>>> Minwoo Kang
>>>
>>> [1]:
>>>
>> https://www.thegeekdiary.com/xfs_log_force-error-5-returned-xfs-error-centos-rhel-7/
>>>
>>
>

Re: Disk hot swap for data node while hbase use short-circuit

Posted by Wei-Chiu Chuang <we...@cloudera.com.INVALID>.

I think i found a similar bug report that matches your symptom: HDFS-12204
<https://issues.apache.org/jira/browse/HDFS-12204> (Dfsclient Do not close
file descriptor when using shortcircuit)

On Wed, May 29, 2019 at 11:37 PM Kang Minwoo <mi...@outlook.com>
wrote:

> I think these file opened for reads. because that block is finalized.
>
> ---
> ls -al /proc/regionserver_pid/fd
> 902 -> /data_path/current/finalized/~/blk_1 (deleted)
> 946 -> /data_path/current/finalized/~/blk_2 (deleted)
> 947 -> /data_path/current/finalized/~/blk_3.meta (deleted)
> ---
>
> I think it is not an HBase bug. This is because DFSClient checks stale fd
> when the fetch method invoked.
>
> Best regards,
> Minwoo Kang
>
> ________________________________________
> 보낸 사람: Wei-Chiu Chuang <we...@cloudera.com.INVALID>
> 보낸 날짜: 2019년 5월 29일 수요일 20:51
> 받는 사람: user@hbase.apache.org
> 제목: Re: Disk hot swap for data node while hbase use short-circuit
>
> Do you have a list of files that was being opened? I'd like to know if
> those are files opened for writes or for reads.
>
> If you are on the more recent version of Hadoop (2.8.0 and above),
> there's a HDFS command to interrupt ongoing writes to DataNodes (HDFS-9945
> <https://issues.apache.org/jira/browse/HDFS-9945>)
>
>
> https://hadoop.apache.org/docs/r2.8.5/hadoop-project-dist/hadoop-hdfs/HDFSCommands.html#dfsadmin
> hdfs dfsadmin -evictWriters
>
> Looking at HDFS hotswap implementation, it looks like DataNode doesn't
> interrupt writers when a volume is removed. That sounds like a bug.
>
> On Tue, May 28, 2019 at 9:39 PM Kang Minwoo <mi...@outlook.com>
> wrote:
>
> > Hello, Users.
> >
> > I use JBOD for data node. Some times the disk in the data node has a
> > problem.
> >
> > The first time, I shut down all instance include data node and region
> > server in the machine that has a disk problem.
> > But It is not a good solution. So I improve the process.
> >
> > When I detect disk problem in the server. I just perform disk hot swap.
> >
> > But System administrator complains of some FD that still open so they
> > cannot remove the disk.
> > Regionserver has an FD, I use short circuit reads feature. (HBase version
> > 1.2.9)
> >
> > When we first met this issue, we force unmount disk and remount.
> > But after this process, kernel report error[1].
> >
> > So we avoid this issue. purge stale FD.
> >
> > I think this issue is common.
> > I want to know how hbase-users deal with this issue.
> >
> > Thank you very much for sharing your experience.
> >
> > Best regards,
> > Minwoo Kang
> >
> > [1]:
> >
> https://www.thegeekdiary.com/xfs_log_force-error-5-returned-xfs-error-centos-rhel-7/
> >
>

Re: Disk hot swap for data node while hbase use short-circuit

Posted by Kang Minwoo <mi...@outlook.com>.

I think these file opened for reads. because that block is finalized.

---
ls -al /proc/regionserver_pid/fd
902 -> /data_path/current/finalized/~/blk_1 (deleted)
946 -> /data_path/current/finalized/~/blk_2 (deleted)
947 -> /data_path/current/finalized/~/blk_3.meta (deleted)
---

I think it is not an HBase bug. This is because DFSClient checks stale fd when the fetch method invoked.

Best regards,
Minwoo Kang

________________________________________
보낸 사람: Wei-Chiu Chuang <we...@cloudera.com.INVALID>
보낸 날짜: 2019년 5월 29일 수요일 20:51
받는 사람: user@hbase.apache.org
제목: Re: Disk hot swap for data node while hbase use short-circuit

Do you have a list of files that was being opened? I'd like to know if
those are files opened for writes or for reads.

If you are on the more recent version of Hadoop (2.8.0 and above),
there's a HDFS command to interrupt ongoing writes to DataNodes (HDFS-9945
<https://issues.apache.org/jira/browse/HDFS-9945>)

https://hadoop.apache.org/docs/r2.8.5/hadoop-project-dist/hadoop-hdfs/HDFSCommands.html#dfsadmin
hdfs dfsadmin -evictWriters

Looking at HDFS hotswap implementation, it looks like DataNode doesn't
interrupt writers when a volume is removed. That sounds like a bug.

On Tue, May 28, 2019 at 9:39 PM Kang Minwoo <mi...@outlook.com> wrote:

> Hello, Users.
>
> I use JBOD for data node. Some times the disk in the data node has a
> problem.
>
> The first time, I shut down all instance include data node and region
> server in the machine that has a disk problem.
> But It is not a good solution. So I improve the process.
>
> When I detect disk problem in the server. I just perform disk hot swap.
>
> But System administrator complains of some FD that still open so they
> cannot remove the disk.
> Regionserver has an FD, I use short circuit reads feature. (HBase version
> 1.2.9)
>
> When we first met this issue, we force unmount disk and remount.
> But after this process, kernel report error[1].
>
> So we avoid this issue. purge stale FD.
>
> I think this issue is common.
> I want to know how hbase-users deal with this issue.
>
> Thank you very much for sharing your experience.
>
> Best regards,
> Minwoo Kang
>
> [1]:
> https://www.thegeekdiary.com/xfs_log_force-error-5-returned-xfs-error-centos-rhel-7/
>

Re: Disk hot swap for data node while hbase use short-circuit

Posted by Wei-Chiu Chuang <we...@cloudera.com.INVALID>.

Do you have a list of files that was being opened? I'd like to know if
those are files opened for writes or for reads.

If you are on the more recent version of Hadoop (2.8.0 and above),
there's a HDFS command to interrupt ongoing writes to DataNodes (HDFS-9945
<https://issues.apache.org/jira/browse/HDFS-9945>)

https://hadoop.apache.org/docs/r2.8.5/hadoop-project-dist/hadoop-hdfs/HDFSCommands.html#dfsadmin
hdfs dfsadmin -evictWriters

Looking at HDFS hotswap implementation, it looks like DataNode doesn't
interrupt writers when a volume is removed. That sounds like a bug.

On Tue, May 28, 2019 at 9:39 PM Kang Minwoo <mi...@outlook.com> wrote:

> Hello, Users.
>
> I use JBOD for data node. Some times the disk in the data node has a
> problem.
>
> The first time, I shut down all instance include data node and region
> server in the machine that has a disk problem.
> But It is not a good solution. So I improve the process.
>
> When I detect disk problem in the server. I just perform disk hot swap.
>
> But System administrator complains of some FD that still open so they
> cannot remove the disk.
> Regionserver has an FD, I use short circuit reads feature. (HBase version
> 1.2.9)
>
> When we first met this issue, we force unmount disk and remount.
> But after this process, kernel report error[1].
>
> So we avoid this issue. purge stale FD.
>
> I think this issue is common.
> I want to know how hbase-users deal with this issue.
>
> Thank you very much for sharing your experience.
>
> Best regards,
> Minwoo Kang
>
> [1]:
> https://www.thegeekdiary.com/xfs_log_force-error-5-returned-xfs-error-centos-rhel-7/
>