You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@dolphinscheduler.apache.org by GitBox <gi...@apache.org> on 2020/03/03 07:22:13 UTC

[GitHub] [incubator-dolphinscheduler] nd49782 opened a new issue #2065: 随着调度任务增多,ZooKeeper同步缓慢导致Socket超时,从而引起所有Master服务挂掉

nd49782 opened a new issue #2065: 随着调度任务增多,ZooKeeper同步缓慢导致Socket超时,从而引起所有Master服务挂掉
URL: https://github.com/apache/incubator-dolphinscheduler/issues/2065
 
 
   ZooKeeper版本:3.4.11
   DolphinScheduler版本:1.2.0
   
   DS于2月初进行投产使用,初期部署任务较少,一切正常,顺利跑了20天左右,从20号到2月底陆续将其它调度任务迁移至DS中。随着调度任务增多,在ZooKeeper同步事务日志数据时缓慢,可能因为同步的数据也越来越多了,从而导致SOCKET超时,进而引起所有Master服务挂掉(其它服务没事,就两个Master都挂了)。
   
   这个情况发生了两次,第一次只排查到了ZK的问题,没有深入排查,重启Master解决。
   第二次(3月1号发生)进行了深入排查,在对ZK进行参数优化后发现并没有明显改善,ZK日志报错依旧(3月2号调整参数并重启了ZK,3号上午发现依旧报错),虽然依旧报错,但截至目前观察,Master还是正常在工作的,并不像前两次直接挂掉。
   但是很明显这个问题并没有解决,目前没有更好的排查思路,请教一下各位,十分感谢!!!
   
   **主Master发生故障时的日志:**
   
   > [INFO] 2020-03-01 15:21:43.119 org.apache.dolphinscheduler.server.master.MasterServer:[181] - master server is stopping ..., cause : i was judged to death, release resources and stop myself
   [ERROR] 2020-03-01 15:21:43.122 org.apache.dolphinscheduler.common.zk.AbstractZKClient:[543] - MASTER server(192.168.1.251) of myself dead , stopping...
   [INFO] 2020-03-01 15:21:46.120 org.apache.dolphinscheduler.server.master.MasterServer:[198] - heartbeat service stopped
   [INFO] 2020-03-01 15:21:46.121 org.quartz.core.QuartzScheduler:[694] - Scheduler DolphinScheduler_$_bd12.zjzy.cn1582527406125 shutting down.
   [INFO] 2020-03-01 15:21:46.121 org.quartz.core.QuartzScheduler:[613] - Scheduler DolphinScheduler_$_bd12.zjzy.cn1582527406125 paused.
   [INFO] 2020-03-01 15:21:46.124 com.alibaba.druid.pool.DruidDataSource:[1928] - {dataSource-1} closed
   [INFO] 2020-03-01 15:21:46.125 org.quartz.core.QuartzScheduler:[771] - Scheduler DolphinScheduler_$_bd12.zjzy.cn1582527406125 shutdown complete.
   [INFO] 2020-03-01 15:21:46.125 org.apache.dolphinscheduler.server.quartz.QuartzExecutors:[126] - Quartz service stopped, and halt all tasks
   [INFO] 2020-03-01 15:21:46.126 org.apache.dolphinscheduler.server.master.MasterServer:[207] - Quartz service stopped
   [INFO] 2020-03-01 15:21:46.130 org.apache.dolphinscheduler.server.master.MasterServer:[215] - threadpool service stopped
   [INFO] 2020-03-01 15:21:46.130 org.apache.dolphinscheduler.server.master.MasterServer:[223] - master scheduler service stopped
   [INFO] 2020-03-01 15:21:46.136 org.apache.zookeeper.ZooKeeper:[693] - Session: 0x300006b30d0047a closed
   [INFO] 2020-03-01 15:21:46.137 org.apache.dolphinscheduler.common.zk.AbstractZKClient:[138] - zookeeper close ...
   [INFO] 2020-03-01 15:21:46.137 org.apache.dolphinscheduler.server.master.MasterServer:[231] - zookeeper service stopped
   
   **另一个Master**
   
   > [INFO] 2020-03-01 15:21:43.421 org.apache.dolphinscheduler.server.master.MasterServer:[181] - master server is stopping ..., cause : i was judged to death, release resources and stop myself
   [INFO] 2020-03-01 15:21:43.422 org.apache.dolphinscheduler.server.zk.ZKMasterClient:[257] - MASTER node deleted : /dolphinscheduler/masters/192.168.1.251_0000000011
   [INFO] 2020-03-01 15:21:43.427 org.apache.dolphinscheduler.common.zk.AbstractZKClient:[287] - MASTER server dead , and /dolphinscheduler/masters/192.168.1.251_0000000011 added to zk dead server path success
   [INFO] 2020-03-01 15:21:43.445 org.apache.dolphinscheduler.server.zk.ZKMasterClient:[477] - start master failover ...
   [INFO] 2020-03-01 15:21:43.448 org.apache.dolphinscheduler.server.zk.ZKMasterClient:[486] - master failover end
   [ERROR] 2020-03-01 15:21:43.450 org.apache.dolphinscheduler.common.zk.AbstractZKClient:[543] - MASTER server(192.168.1.252) of myself dead , stopping...
   [INFO] 2020-03-01 15:21:46.422 org.apache.dolphinscheduler.server.master.MasterServer:[198] - heartbeat service stopped
   [INFO] 2020-03-01 15:21:46.422 org.quartz.core.QuartzScheduler:[694] - Scheduler DolphinScheduler_$_bd13.zjzy.cn1582527403002 shutting down.
   [INFO] 2020-03-01 15:21:46.422 org.quartz.core.QuartzScheduler:[613] - Scheduler DolphinScheduler_$_bd13.zjzy.cn1582527403002 paused.
   [INFO] 2020-03-01 15:21:46.425 com.alibaba.druid.pool.DruidDataSource:[1928] - {dataSource-2} closed
   [INFO] 2020-03-01 15:21:46.426 org.quartz.core.QuartzScheduler:[771] - Scheduler DolphinScheduler_$_bd13.zjzy.cn1582527403002 shutdown complete.
   [INFO] 2020-03-01 15:21:46.426 org.apache.dolphinscheduler.server.quartz.QuartzExecutors:[126] - Quartz service stopped, and halt all tasks
   [INFO] 2020-03-01 15:21:46.426 org.apache.dolphinscheduler.server.master.MasterServer:[207] - Quartz service stopped
   [INFO] 2020-03-01 15:21:46.430 org.apache.dolphinscheduler.server.master.MasterServer:[215] - threadpool service stopped
   [INFO] 2020-03-01 15:21:46.430 org.apache.dolphinscheduler.server.master.MasterServer:[223] - master scheduler service stopped
   [INFO] 2020-03-01 15:21:46.433 org.apache.zookeeper.ZooKeeper:[693] - Session: 0x400006ceb1d045e closed
   [INFO] 2020-03-01 15:21:46.435 org.apache.dolphinscheduler.common.zk.AbstractZKClient:[138] - zookeeper close ...
   [INFO] 2020-03-01 15:21:46.435 org.apache.dolphinscheduler.server.master.MasterServer:[231] - zookeeper service stopped
   
   **ZK leader发生故障时的日志:(时间倒序)**
   
   > Sun Mar 01 15:23:09 CST 2020 WARN NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn - Unable to read additional data from client sessionid 0x300006b30d003b0, likely client has closed socket
   Sun Mar 01 15:22:22 CST 2020 WARN NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn - Unable to read additional data from client sessionid 0x400006ceb1d03c8, likely client has closed socket
   Sun Mar 01 15:21:57 CST 2020 WARN LearnerHandler-/192.168.1.240:60090:LearnerHandler - ******* GOODBYE /192.168.1.240:60090 ********
   Sun Mar 01 15:21:57 CST 2020 ERROR LearnerHandler-/192.168.1.240:60090:LearnerHandler - Unexpected exception causing shutdown while sock still open
   java.net.SocketTimeoutException: Read timed out
   	at java.net.SocketInputStream.socketRead0(Native Method)
   	at java.net.SocketInputStream.socketRead(SocketInputStream.java:116)
   	at java.net.SocketInputStream.read(SocketInputStream.java:171)
   	at java.net.SocketInputStream.read(SocketInputStream.java:141)
   	at java.io.BufferedInputStream.fill(BufferedInputStream.java:246)
   	at java.io.BufferedInputStream.read(BufferedInputStream.java:265)
   	at java.io.DataInputStream.readInt(DataInputStream.java:387)
   	at org.apache.jute.BinaryInputArchive.readInt(BinaryInputArchive.java:63)
   	at org.apache.zookeeper.server.quorum.QuorumPacket.deserialize(QuorumPacket.java:85)
   	at org.apache.jute.BinaryInputArchive.readRecord(BinaryInputArchive.java:99)
   	at org.apache.zookeeper.server.quorum.LearnerHandler.run(LearnerHandler.java:559)
   Sun Mar 01 07:21:45 CST 2020 WARN SyncThread:4:FileTxnLog - fsync-ing the write ahead log in SyncThread:4 took 1787ms which will adversely effect operation latency. See the ZooKeeper troubleshooting guide
   Sun Mar 01 07:13:20 CST 2020 WARN SyncThread:4:FileTxnLog - fsync-ing the write ahead log in SyncThread:4 took 1018ms which will adversely effect operation latency. See the ZooKeeper troubleshooting guide
   Sun Mar 01 07:04:28 CST 2020 WARN SyncThread:4:FileTxnLog - fsync-ing the write ahead log in SyncThread:4 took 1222ms which will adversely effect operation latency. See the ZooKeeper troubleshooting guide
   Sun Mar 01 06:34:02 CST 2020 WARN SyncThread:4:FileTxnLog - fsync-ing the write ahead log in SyncThread:4 took 1137ms which will adversely effect operation latency. See the ZooKeeper troubleshooting guide
   
   在引发SocketTimeoutException: Read timed out之前有大片的
   `fsync-ing the write ahead log in SyncThread:4 took 1137ms which will adversely effect operation latency. See the ZooKeeper troubleshooting guide`
   ZooKeeper的其它follower也是大片的这句话。
   
   基本上原因可以定位到:
   主要还是forceSync机制造成的
   随着业务量增加,事务同步数据变大,写事务日志太慢,从而超时
   
   于是我优化了以下ZK参数:
   增大zk同步时间配置
   > tickTime=5000
   initLimit=20
   syncLimit=10
   
   设置服务端会话时长
   > maxSessionTimeout=300000
   minSessionTimeout=10000
   
   我看DS的zkSessionTimeout、zkConnectionTimeout配置都默认为300s。
   
   做出上述调整后今日上午观察ZK的leader依旧报SocketTimeoutException: Read timed out,不过好消息是DS的Master并没有直接挂掉,日志也暂时未发现异常。
   但是这个问题并未根本解决,还是有很大风险的,因为观察时间较短,还在持续观察中。
   但是没有了更好的排查思路和方法了。
   
   我查阅资料说最后不是办法的办法就是ZK的配置forceSync=no
   但是我们ZK上有很多对外的服务也有依赖,关闭强制同步ZK的可靠性就难以保证了,风险太大,所以这个办法不太可行。
   
   求助一下,这个问题有没有什么排查思路或者更好的解决办法,在DS这边需要做什么配置优化吗?
   如果下次Master再挂掉,目前没有什么解决办法了……
   
   十分感谢!!

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-dolphinscheduler] nd49782 closed issue #2065: 随着调度任务增多,ZooKeeper同步缓慢导致Socket超时,从而引起所有Master服务挂掉

Posted by GitBox <gi...@apache.org>.
nd49782 closed issue #2065: 随着调度任务增多,ZooKeeper同步缓慢导致Socket超时,从而引起所有Master服务挂掉
URL: https://github.com/apache/incubator-dolphinscheduler/issues/2065
 
 
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-dolphinscheduler] CESAEREE commented on issue #2065: 随着调度任务增多,ZooKeeper同步缓慢导致Socket超时,从而引起所有Master服务挂掉

Posted by GitBox <gi...@apache.org>.
CESAEREE commented on issue #2065: 随着调度任务增多,ZooKeeper同步缓慢导致Socket超时,从而引起所有Master服务挂掉
URL: https://github.com/apache/incubator-dolphinscheduler/issues/2065#issuecomment-593818738
 
 
   我也遇到类似问题,请求解决!

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-dolphinscheduler] nd49782 commented on issue #2065: 随着调度任务增多,ZooKeeper同步缓慢导致Socket超时,从而引起所有Master服务挂掉

Posted by GitBox <gi...@apache.org>.
nd49782 commented on issue #2065: 随着调度任务增多,ZooKeeper同步缓慢导致Socket超时,从而引起所有Master服务挂掉
URL: https://github.com/apache/incubator-dolphinscheduler/issues/2065#issuecomment-595626543
 
 
   在开发人员佳竹的帮助下,发现zk的一个节点网络延迟非常高,疑似网络不稳定造成的。
   我又调整了DS的如下参数:
   > zkSessionTimeout
   zkConnectionTimeout
   zkRetrySleep
   zkRetryMaxtime
   
   之后监控观察了几天,刚开始几天ZK依旧会报SocketTimeoutException,但是Master至今未出现挂掉的情况。
   在近几天的监控观察中,ZK一切正常,没有报任何异常,之前延迟高的那台节点也恢复正常了。
   
   至于网络延迟的监控情况已联系运维人员。
   后续我会持续观察这个问题,如有进一步进展也会反馈给社区,谢谢!

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-dolphinscheduler] fanghj commented on issue #2065: 随着调度任务增多,ZooKeeper同步缓慢导致Socket超时,从而引起所有Master服务挂掉

Posted by GitBox <gi...@apache.org>.
fanghj commented on issue #2065:
URL: https://github.com/apache/incubator-dolphinscheduler/issues/2065#issuecomment-742937515


   > 在开发人员佳竹的帮助下,发现zk的一个节点网络延迟非常高,疑似网络不稳定造成的。
   > 我又调整了DS的如下参数:
   > 
   > > zkSessionTimeout
   > > zkConnectionTimeout
   > > zkRetrySleep
   > > zkRetryMaxtime
   > 
   > 之后监控观察了几天,刚开始几天ZK依旧会报SocketTimeoutException,但是Master至今未出现挂掉的情况。
   > 在近几天的监控观察中,ZK一切正常,没有报任何异常,之前延迟高的那台节点也恢复正常了。
   > 
   > 至于网络延迟的监控情况已联系运维人员。
   > 后续我会持续观察这个问题,如有进一步进展也会反馈给社区,谢谢!
   
   您好,请问下这个问题后续出现了吗?我这边也遇到这个问题了,DS:1,3.3


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [incubator-dolphinscheduler] nd49782 commented on issue #2065: 随着调度任务增多,ZooKeeper同步缓慢导致Socket超时,从而引起所有Master服务挂掉

Posted by GitBox <gi...@apache.org>.
nd49782 commented on issue #2065:
URL: https://github.com/apache/incubator-dolphinscheduler/issues/2065#issuecomment-742952192


   > > 在开发人员佳竹的帮助下,发现zk的一个节点网络延迟非常高,疑似网络不稳定造成的。
   > > 我又调整了DS的如下参数:
   > > > zkSessionTimeout
   > > > zkConnectionTimeout
   > > > zkRetrySleep
   > > > zkRetryMaxtime
   > > 
   > > 
   > > 之后监控观察了几天,刚开始几天ZK依旧会报SocketTimeoutException,但是Master至今未出现挂掉的情况。
   > > 在近几天的监控观察中,ZK一切正常,没有报任何异常,之前延迟高的那台节点也恢复正常了。
   > > 至于网络延迟的监控情况已联系运维人员。
   > > 后续我会持续观察这个问题,如有进一步进展也会反馈给社区,谢谢!
   > 
   > 您好,请问下这个问题后续出现了吗?我这边也遇到这个问题了,DS:1,3.3
   
   没出现。
   后来怀疑是有一台zk节点的虚拟机与其它大部分zk节点的虚拟机不在一台物理机(网络波动)导致的,于是我们重新调整了zk节点。
   在网络稳定的情况下,没有再出现这个问题了。


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org