You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@hbase.apache.org by "Michael Stack (Jira)" <ji...@apache.org> on 2020/05/22 22:44:00 UTC
[jira] [Comment Edited] (HBASE-22041) [k8s] The crashed node exists in onlineServer forever, and if it holds the meta data, master will start up hang.

    [ https://issues.apache.org/jira/browse/HBASE-22041?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17114435#comment-17114435 ] 

Michael Stack edited comment on HBASE-22041 at 5/22/20, 10:43 PM:
------------------------------------------------------------------

I wonder if we changed the AbstractRPCChannel so that rather than cache an address (InetSocketAddress), instead we just cached the remote ServerName as per https://kubernetes.io/docs/tutorials/stateful-application/basic-stateful-set/. The Pod will come back w/ same 'name' but may have a different IP ("This is why it is important not to configure other applications to connect to Pods in a StatefulSet by IP address."). I think this is what we are doing when we cache an ISA. Let me see.... (Thanks for the logs [~timoha])

Oh, I bet HDFS gets confused too... Let me check logs.


was (Author: stack):
I wonder if we changed the AbstractRPCChannel so that rather than cache an address (InetSocketAddress), instead we just cached the remote ServerName as per https://kubernetes.io/docs/tutorials/stateful-application/basic-stateful-set/. The Pod will come back w/ same 'name' but may have a different IP ("This is why it is important not to configure other applications to connect to Pods in a StatefulSet by IP address."). I think this is what we are doing when we cache an ISA. Let me see.... (Thanks for the logs [~timoha])

> [k8s] The crashed node exists in onlineServer forever, and if it holds the meta data, master will start up hang.
> ----------------------------------------------------------------------------------------------------------------
>
>                 Key: HBASE-22041
>                 URL: https://issues.apache.org/jira/browse/HBASE-22041
>             Project: HBase
>          Issue Type: Bug
>            Reporter: lujie
>            Priority: Critical
>         Attachments: bug.zip, hbasemaster.log, normal.zip
>
>
> while master fresh boot, we  crash (kill- 9) the RS who hold meta. we find that the master startup fails and print  thounds of logs like:
> {code:java}
> 2019-03-13 01:09:54,896 WARN [RSProcedureDispatcher-pool4-t1] procedure.RSProcedureDispatcher: request to server hadoop14,16020,1552410583724 failed due to java.net.ConnectException: Call to hadoop14/172.16.1.131:16020 failed on connection exception: org.apache.hbase.thirdparty.io.netty.channel.AbstractChannel$AnnotatedConnectException: syscall:getsockopt(..) failed: Connection refused: hadoop14/172.16.1.131:16020, try=0, retrying...
> 2019-03-13 01:09:55,004 WARN [RSProcedureDispatcher-pool4-t2] procedure.RSProcedureDispatcher: request to server hadoop14,16020,1552410583724 failed due to org.apache.hadoop.hbase.ipc.FailedServerException: Call to hadoop14/172.16.1.131:16020 failed on local exception: org.apache.hadoop.hbase.ipc.FailedServerException: This server is in the failed servers list: hadoop14/172.16.1.131:16020, try=1, retrying...
> 2019-03-13 01:09:55,114 WARN [RSProcedureDispatcher-pool4-t3] procedure.RSProcedureDispatcher: request to server hadoop14,16020,1552410583724 failed due to org.apache.hadoop.hbase.ipc.FailedServerException: Call to hadoop14/172.16.1.131:16020 failed on local exception: org.apache.hadoop.hbase.ipc.FailedServerException: This server is in the failed servers list: hadoop14/172.16.1.131:16020, try=2, retrying...
> 2019-03-13 01:09:55,219 WARN [RSProcedureDispatcher-pool4-t4] procedure.RSProcedureDispatcher: request to server hadoop14,16020,1552410583724 failed due to org.apache.hadoop.hbase.ipc.FailedServerException: Call to hadoop14/172.16.1.131:16020 failed on local exception: org.apache.hadoop.hbase.ipc.FailedServerException: This server is in the failed servers list: hadoop14/172.16.1.131:16020, try=3, retrying...
> 2019-03-13 01:09:55,324 WARN [RSProcedureDispatcher-pool4-t5] procedure.RSProcedureDispatcher: request to server hadoop14,16020,1552410583724 failed due to org.apache.hadoop.hbase.ipc.FailedServerException: Call to hadoop14/172.16.1.131:16020 failed on local exception: org.apache.hadoop.hbase.ipc.FailedServerException: This server is in the failed servers list: hadoop14/172.16.1.131:16020, try=4, retrying...
> 2019-03-13 01:09:55,428 WARN [RSProcedureDispatcher-pool4-t6] procedure.RSProcedureDispatcher: request to server hadoop14,16020,1552410583724 failed due to org.apache.hadoop.hbase.ipc.FailedServerException: Call to hadoop14/172.16.1.131:16020 failed on local exception: org.apache.hadoop.hbase.ipc.FailedServerException: This server is in the failed servers list: hadoop14/172.16.1.131:16020, try=5, retrying...
> 2019-03-13 01:09:55,533 WARN [RSProcedureDispatcher-pool4-t7] procedure.RSProcedureDispatcher: request to server hadoop14,16020,1552410583724 failed due to org.apache.hadoop.hbase.ipc.FailedServerException: Call to hadoop14/172.16.1.131:16020 failed on local exception: org.apache.hadoop.hbase.ipc.FailedServerException: This server is in the failed servers list: hadoop14/172.16.1.131:16020, try=6, retrying...
> 2019-03-13 01:09:55,638 WARN [RSProcedureDispatcher-pool4-t8] procedure.RSProcedureDispatcher: request to server hadoop14,16020,1552410583724 failed due to org.apache.hadoop.hbase.ipc.FailedServerException: Call to hadoop14/172.16.1.131:16020 failed on local exception: org.apache.hadoop.hbase.ipc.FailedServerException: This server is in the failed servers list: hadoop14/172.16.1.131:16020, try=7, retrying...
> 2019-03-13 01:09:55,755 WARN [RSProcedureDispatcher-pool4-t9] procedure.RSProcedureDispatcher: request to server hadoop14,16020,1552410583724 failed due to org.apache.hadoop.hbase.ipc.FailedServerException: Call to hadoop14/172.16.1.131:16020 failed on local exception: org.apache.hadoop.hbase.ipc.FailedServerException: This server is in the failed servers list: hadoop14/172.16.1.131:16020, try=8, retrying...
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)