You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Susheel Kumar <su...@gmail.com> on 2018/10/22 21:31:19 UTC

ZookeeperServer not running/Client Session timed out

Hello,

I am seeing "ZookeeperServer not running" WARM messages in zookeeper logs
which is causing the Solr client connections to timeout...

What could be the problem?

ZK: 3.4.10

Zookeeper.out
==
2018-10-22 06:04:51,071 [myid:2] - INFO
[WorkerReceiver[myid=2]:FastLeaderElection@600] - Notification: 1 (message
format version), 5 (n.leader), 0xf00000461 (n.zxid), 0x10 (n.round),
FOLLOWING (n.state), 4 (n.sid), 0xf (n.peerEpoch) LOOKING (my state)
2018-10-22 06:04:51,093 [myid:2] - INFO  [NIOServerCxn.Factory:
0.0.0.0/0.0.0.0:2182:NIOServerCnxnFactory@192] - Accepted socket connection
from /192.72.25.177:39514
2018-10-22 06:04:51,094 [myid:2] - WARN  [NIOServerCxn.Factory:
0.0.0.0/0.0.0.0:2182:NIOServerCnxn@373] - Exception causing close of
session 0x0 due to java.io.IOException: ZooKeeperServer not running
2018-10-22 06:04:51,094 [myid:2] - INFO  [NIOServerCxn.Factory:
0.0.0.0/0.0.0.0:2182:NIOServerCnxn@1044] - Closed socket connection for
client /192.72.25.177:39514 (no session established for client)
2018-10-22 06:04:51,138 [myid:2] - INFO  [NIOServerCxn.Factory:
0.0.0.0/0.0.0.0:2182:NIOServerCnxnFactory@192] - Accepted socket connection
from /192.3.101.219:56298
2018-10-22 06:04:51,138 [myid:2] - WARN  [NIOServerCxn.Factory:
0.0.0.0/0.0.0.0:2182:NIOServerCnxn@373] - Exception causing close of
session 0x0 due to java.io.IOException: ZooKeeperServer not running
2018-10-22 06:04:51,139 [myid:2] - INFO  [NIOServerCxn.Factory:
0.0.0.0/0.0.0.0:2182:NIOServerCnxn@1044] - Closed socket connection for
client /192.3.101.219:56298 (no session established for client)
2018-10-22 06:04:51,250 [myid:2] - INFO  [NIOServerCxn.Factory:
0.0.0.0/0.0.0.0:2182:NIOServerCnxnFactory@192] - Accepted socket connection
from /192.72.27.181:46414
2018-10-22 06:04:51,250 [myid:2] - WARN  [NIOServerCxn.Factory:
0.0.0.0/0.0.0.0:2182:NIOServerCnxn@373] - Exception causing close of
session 0x0 due to java.io.IOException: ZooKeeperServer not running
2018-10-22 06:04:51,250 [myid:2] - INFO  [NIOServerCxn.Factory:
0.0.0.0/0.0.0.0:2182:NIOServerCnxn@1044] - Closed socket connection for
client /192.72.27.181:46414 (no session established for client)
2018-10-22 06:04:51,275 [myid:2] - INFO
[WorkerReceiver[myid=2]:FastLeaderElection@600] - Notification: 1 (message
format version), 4 (n.leader), 0xf00000461 (n.zxid), 0x192 (n.round),
LOOKING (n.state), 4 (n.sid), 0xf (n.peerEpoch) LOOKING (my state)
2018-10-22 06:04:51,275 [myid:2] - INFO
[WorkerReceiver[myid=2]:FastLeaderElection@600] - Notification: 1 (message
format version), 4 (n.leader), 0xf00000461 (n.zxid), 0x192 (n.round),
LOOKING (n.state), 2 (n.sid), 0xf (n.peerEpoch) LOOKING (my state)
2018-10-22 06:04:51,275 [myid:2] - INFO
[WorkerReceiver[myid=2]:FastLeaderElection@600] - Notification: 1 (message
format version), 4 (n.leader), 0xf00000461 (n.zxid), 0x192 (n.round),
LOOKING (n.state), 1 (n.sid), 0xf (n.peerEpoch) LOOKING (my state)
2018-10-22 06:04:51,309 [myid:2] - INFO  [NIOServerCxn.Factory:
0.0.0.0/0.0.0.0:2182:NIOServerCnxnFactory@192] - Accepted socket connection
from /192.72.5.212:38944
2018-10-22 06:04:51,309 [myid:2] - WARN  [NIOServerCxn.Factory:
0.0.0.0/0.0.0.0:2182:NIOServerCnxn@373] - Exception causing close of
session 0x0 due to java.io.IOException: ZooKeeperServer not running
2018-10-22 06:04:51,309 [myid:2] - INFO  [NIOServerCxn.Factory:
0.0.0.0/0.0.0.0:2182:NIOServerCnxn@1044] - Closed socket connection for
client /192.72.5.212:38944 (no session established for client)
2018-10-22 06:04:51,356 [myid:2] - INFO  [NIOServerCxn.Factory:
0.0.0.0/0.0.0.0:2182:NIOServerCnxnFactory@192] - Accepted socket connection
from /192.72.7.201:59310
2018-10-22 06:04:51,356 [myid:2] - WARN  [NIOServerCxn.Factory:
0.0.0.0/0.0.0.0:2182:NIOServerCnxn@373] - Exception causing close of
session 0x0 due to java.io.IOException: ZooKeeperServer not running
2018-10-22 06:04:51,356 [myid:2] - INFO  [NIOServerCxn.Factory:
0.0.0.0/0.0.0.0:2182:NIOServerCnxn@1044] - Closed socket connection for
client /192.72.7.201:59310 (no session established for client)
2018-10-22 06:04:51,402 [myid:2] - INFO  [NIOServerCxn.Factory:
0.0.0.0/0.0.0.0:2182:NIOServerCnxnFactory@192] - Accepted socket connection
from /192.3.101.219:56302
2018-10-22 06:04:51,402 [myid:2] - WARN  [NIOServerCxn.Factory:
0.0.0.0/0.0.0.0:2182:NIOServerCnxn@373] - Exception causing close of
session 0x0 due to java.io.IOException: ZooKeeperServer not running
2018-10-22 06:04:51,402 [myid:2] - INFO  [NIOServerCxn.Factory:
0.0.0.0/0.0.0.0:2182:NIOServerCnxn@1044] - Closed socket connection for
client /192.3.101.219:56302 (no session established for client)
2018-10-22 06:04:51,467 [myid:2] - INFO  [NIOServerCxn.Factory:
0.0.0.0/0.0.0.0:2182:NIOServerCnxnFactory@192] - Accepted socket connection
from /192.72.7.205:46694
2018-10-22 06:04:51,468 [myid:2] - WARN  [NIOServerCxn.Factory:
0.0.0.0/0.0.0.0:2182:NIOServerCnxn@373] - Exception causing close of
session 0x0 due to java.io.IOException: ZooKeeperServer not running
2018-10-22 06:04:51,468 [myid:2] - INFO  [NIOServerCxn.Factory:
0.0.0.0/0.0.0.0:2182:NIOServerCnxn@1044] - Closed socket connection for
client /192.72.7.205:46694 (no session established for client)
2018-10-22 06:04:51,476 [myid:2] - INFO
[QuorumPeer[myid=2]/0:0:0:0:0:0:0:0:2182:QuorumPeer@935] - FOLLOWING
2018-10-22 06:04:51,476 [myid:2] - INFO
[QuorumPeer[myid=2]/0:0:0:0:0:0:0:0:2182:ZooKeeperServer@173] - Created
server with tickTime 2000 minSessionTimeout 4000 maxSessionTimeout 40000
datadir /app/zookeeper-3.4.10/log/version-2 snapdir
/app/zookeeper-3.4.10/data/version-2
2018-10-22 06:04:51,476 [myid:2] - INFO
[QuorumPeer[myid=2]/0:0:0:0:0:0:0:0:2182:Follower@64] - FOLLOWING - LEADER
ELECTION TOOK - 1008
2018-10-22 06:04:51,477 [myid:2] - INFO
[QuorumPeer[myid=2]/0:0:0:0:0:0:0:0:2182:QuorumPeer$QuorumServer@167] -
Resolved hostname: srch0120.whc.dc02.us.adp to address: srch0120/
192.3.35.243

solr.log
2018-10-22 10:02:21.466 WARN  (main-SendThread(srch0118:2182)) [   ]
o.a.z.ClientCnxn Client session timed out, have not heard from server in
26675ms for sessionid 0x5665c67cb0d000b
2018-10-22 10:02:48.891 WARN  (main-SendThread(srch0119:2182)) [   ]
o.a.z.ClientCnxn Client session timed out, have not heard from server in
26667ms for sessionid 0x5665c67cb0d000b
2018-10-22 10:02:57.720 WARN  (main-SendThread(srch0117:2182)) [   ]
o.a.z.ClientCnxn Client session timed out, have not heard from server in
8007ms for sessionid 0x5665c67cb0d000b
2018-10-22 10:03:06.776 WARN  (main-SendThread(srch0120:2182)) [   ]
o.a.z.ClientCnxn Client session timed out, have not heard from server in
8006ms for sessionid 0x5665c67cb0d000b
2018-10-22 10:03:15.734 WARN  (main-SendThread(srch0121:2182)) [   ]
o.a.z.ClientCnxn Client session timed out, have not heard from server in
8004ms for sessionid 0x5665c67cb0d000b
2018-10-22 10:03:24.568 WARN  (main-SendThread(srch0118:2182)) [   ]
o.a.z.ClientCnxn Client session timed out, have not heard from server in
8003ms for sessionid 0x5665c67cb0d000b
2018-10-22 10:03:33.844 WARN  (main-SendThread(srch0119:2182)) [   ]
o.a.z.ClientCnxn Client session timed out, have not heard from server in
8001ms for sessionid 0x5665c67cb0d000b
2018-10-22 10:03:42.377 WARN  (main-SendThread(srch0117:2182)) [   ]
o.a.z.ClientCnxn Client session timed out, have not heard from server in
8008ms for sessionid 0x5665c67cb0d000b
2018-10-22 10:03:51.138 WARN  (main-SendThread(srch0120:2182)) [   ]
o.a.z.ClientCnxn Client session timed out, have not heard from server in
8004ms for sessionid 0x5665c67cb0d000b
2018-10-22 10:04:10.229 WARN  (main-SendThread(srch0119:2182)) [   ]
o.a.z.ClientCnxn Client session timed out, have not heard from server in
8008ms for sessionid 0x5665c67cb0d000b
2018-10-22 10:04:20.669 WARN  (main-SendThread(srch0119:2182)) [   ]
o.a.z.ClientCnxn Client session timed out, have not heard from server in
9272ms for sessionid 0x5665c67cb0d000b
2018-10-22 10:04:34.076 WARN  (main-SendThread(srch0119:2182)) [   ]
o.a.z.ClientCnxn Client session timed out, have not heard from server in
9873ms for sessionid 0x5665c67cb0d000b

Re: ZookeeperServer not running/Client Session timed out

Posted by Susheel Kumar <su...@gmail.com>.
Hi Shawn,

Thanks for pointing out that it may be due to network/VM issue. I looked
the ZK logs in detail and i see below Socket timeout issue after which ZK
shutdown is called.

Is that good enough to confirm some VM/network issue not any ZK/Solr issue.
I am also including dmesg output during the timestamps we had issues

...
...
2018-10-22 06:03:56,022 [myid:2] - INFO  [NIOServerCxn.Factory:
0.0.0.0/0.0.0.0:2182:NIOServerCnxn@1044] - Closed socket connection for
client /192.3.101.219:55704 which had sessionid 0x5665c67cb0d0000
2018-10-22 06:03:56,022 [myid:2] - INFO  [NIOServerCxn.Factory:
0.0.0.0/0.0.0.0:2182:Learner@108] - Revalidating client: 0x5665c67cb0d0000
2018-10-22 06:03:56,265 [myid:2] - WARN
[QuorumPeer[myid=2]/0:0:0:0:0:0:0:0:2182:Follower@89] - Exception when
following the leader
java.net.SocketTimeoutException: Read timed out
        at java.net.SocketInputStream.socketRead0(Native Method)
        at java.net.SocketInputStream.socketRead(SocketInputStream.java:116)
        at java.net.SocketInputStream.read(SocketInputStream.java:170)
        at java.net.SocketInputStream.read(SocketInputStream.java:141)
        at java.io.BufferedInputStream.fill(BufferedInputStream.java:246)
        at java.io.BufferedInputStream.read(BufferedInputStream.java:265)
        at java.io.DataInputStream.readInt(DataInputStream.java:387)
        at
org.apache.jute.BinaryInputArchive.readInt(BinaryInputArchive.java:63)
        at
org.apache.zookeeper.server.quorum.QuorumPacket.deserialize(QuorumPacket.java:83)
        at
org.apache.jute.BinaryInputArchive.readRecord(BinaryInputArchive.java:99)
        at
org.apache.zookeeper.server.quorum.Learner.readPacket(Learner.java:153)
        at
org.apache.zookeeper.server.quorum.Follower.followLeader(Follower.java:85)
        at
org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:937)
2018-10-22 06:03:56,266 [myid:2] - INFO
[QuorumPeer[myid=2]/0:0:0:0:0:0:0:0:2182:Follower@166] - shutdown called
java.lang.Exception: shutdown Follower
        at
org.apache.zookeeper.server.quorum.Follower.shutdown(Follower.java:166)
        at
org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:941)
2018-10-22 06:03:56,266 [myid:2] - INFO
[QuorumPeer[myid=2]/0:0:0:0:0:0:0:0:2182:NIOServerCnxn@1044] - Closed
socket connection for client /192.72.5.213:57834 which had sessionid
0x46591d67d0c0024
2018-10-22 06:03:56,266 [myid:2] - INFO
[QuorumPeer[myid=2]/0:0:0:0:0:0:0:0:2182:NIOServerCnxn@1044] - Closed
socket connection for client /192.3.95.181:55192 which had sessionid
0x3665c676caf0004
2018-10-22 06:03:56,266 [myid:2] - INFO
[QuorumPeer[myid=2]/0:0:0:0:0:0:0:0:2182:NIOServerCnxn@1044] - Closed
socket connection for client /192.3.224.15:38712 which had sessionid
0x2668d42319e0012
...
...

dmesg

srch0117
[Mon Oct 22 06:04:37 2018] mptscsih: ioc0: attempting task abort!
(sc=ffff88081bdf90c0)
[Mon Oct 22 06:04:37 2018] sd 0:0:1:0: [sdb] tag#0 CDB: Write(10) 2a 00 00
50 43 df 00 00 10 00
[Mon Oct 22 06:04:41 2018] mptscsih: ioc0: task abort: SUCCESS (rv=2002)
(sc=ffff88081bdf90c0)

srch0118
[Mon Oct 22 06:04:41 2018] mptscsih: ioc0: attempting task abort!
(sc=ffff8807b7c7b200)
[Mon Oct 22 06:04:41 2018] sd 0:0:1:0: [sdb] tag#3 CDB: Write(10) 2a 00 00
33 da 80 00 00 08 00
[Mon Oct 22 06:04:49 2018] mptscsih: ioc0: task abort: SUCCESS (rv=2002)
(sc=ffff8807b7c7b200)
[Mon Oct 22 06:04:49 2018] mptscsih: ioc0: attempting task abort!
(sc=ffff88081c09a680)
[Mon Oct 22 06:04:49 2018] sd 0:0:1:0: [sdb] tag#4 CDB: Write(10) 2a 00 00
50 13 e8 00 00 0f 00
[Mon Oct 22 06:04:50 2018] mptscsih: ioc0: task abort: SUCCESS (rv=2002)
(sc=ffff88081c09a680)

srch0119
[Mon Oct 22 06:04:30 2018] mptscsih: ioc0: attempting task abort!
(sc=ffff8807777e63c0)
[Mon Oct 22 06:04:30 2018] sd 0:0:1:0: [sdb] tag#0 CDB: Write(10) 2a 00 00
38 06 b0 00 00 18 00
[Mon Oct 22 06:04:38 2018] mptscsih: ioc0: task abort: SUCCESS (rv=2002)
(sc=ffff8807777e63c0)

srch0120
Nothing around 6

srch0121
[Mon Oct 22 06:00:01 2018] BTRFS info (device sda1): relocating block group
1273285836800 flags 1
[Mon Oct 22 06:00:02 2018] BTRFS info (device sda1): found 8 extents
[Mon Oct 22 06:00:05 2018] BTRFS info (device sda1): found 8 extents
[Mon Oct 22 06:00:05 2018] BTRFS info (device sda1): relocating block group
1274527350784 flags 1
[Mon Oct 22 06:00:05 2018] BTRFS info (device sda1): found 8 extents
[Mon Oct 22 06:00:07 2018] BTRFS info (device sda1): found 8 extents
[Mon Oct 22 06:00:07 2018] BTRFS info (device sda1): relocating block group
1275601092608 flags 1
[Mon Oct 22 06:00:08 2018] BTRFS info (device sda1): found 8 extents
[Mon Oct 22 06:00:10 2018] BTRFS info (device sda1): found 8 extents
[Mon Oct 22 06:00:10 2018] BTRFS info (device sda1): relocating block group
1274493796352 flags 34
[Mon Oct 22 06:00:10 2018] BTRFS info (device sda1): relocating block group
1277748576256 flags 34
[Mon Oct 22 06:00:10 2018] BTRFS info (device sda1): relocating block group
1277782130688 flags 34
[Mon Oct 22 06:00:10 2018] BTRFS info (device sda1): relocating block group
1277815685120 flags 34
[Mon Oct 22 06:00:10 2018] BTRFS info (device sda1): relocating block group
1277849239552 flags 34

[Fri Oct 19 12:58:00 2018] mptscsih: ioc0: attempting task abort!
(sc=ffff88081c17ce00)
[Fri Oct 19 12:58:00 2018] sd 0:0:0:0: [sda] tag#2 CDB: Write(10) 2a 00 00
4f 50 00 00 00 80 00
[Fri Oct 19 12:58:11 2018] mptscsih: ioc0: WARNING - Issuing Reset from
mptscsih_IssueTaskMgmt!! doorbell=0x24000000
[Fri Oct 19 12:58:37 2018] NMI watchdog: BUG: soft lockup - CPU#5 stuck for
22s! [splunkd:2753]
[Fri Oct 19 12:58:37 2018] Modules linked in: tcp_diag udp_diag inet_diag
unix_diag af_packet_diag netlink_diag falcon_lsm_serviceable(PEN)
falcon_nf_netcontain(PEN) falcon_lsm_pinned_5607(EN) binfmt_misc af_packet
iscsi_ibft iscsi_boot_sysfs vmw_vsock_vmci_transport vsock
falcon_lsm_pinned_5305(EN) xfs libcrc32c sb_edac edac_core coretemp
crct10dif_pclmul crc32_pclmul ghash_clmulni_intel drbg ansi_cprng
aesni_intel aes_x86_64 ppdev joydev pcspkr vmxnet3 vmw_balloon lrw gf128mul
glue_helper i2c_piix4 ablk_helper cryptd vmw_vmci parport_pc shpchp mptctl
parport fjes processor ac arc4 ecb ppp_mppe btrfs sr_mod cdrom ata_generic
xor hid_generic usbhid raid6_pq sd_mod ata_piix ahci libahci vmwgfx
crc32c_intel serio_raw drm_kms_helper uhci_hcd syscopyarea sysfillrect
ehci_pci sysimgblt fb_sys_fops ehci_hcd ttm mptspi
[Fri Oct 19 12:58:37 2018]  libata usbcore scsi_transport_spi drm mptscsih
mptbase usb_common button sg ppp_generic slhc dm_multipath dm_mod
scsi_dh_rdac scsi_dh_emc scsi_dh_alua scsi_mod ecryptfs autofs4 [last
unloaded: falcon_lsm_serviceable]
[Fri Oct 19 12:58:37 2018] Supported: No, Proprietary and Unsupported
modules are loaded
[Fri Oct 19 12:58:37 2018] CPU: 5 PID: 2753 Comm: splunkd Tainted: P
    E   N  4.4.132-94.33-default #1
[Fri Oct 19 12:58:37 2018] Hardware name: VMware, Inc. VMware Virtual
Platform/440BX Desktop Reference Platform, BIOS 6.00 04/05/2016
[Fri Oct 19 12:58:37 2018] task: ffff88081b3b56c0 ti: ffff8808017f0000
task.ti: ffff8808017f0000
[Fri Oct 19 12:58:37 2018] RIP: 0010:[<ffffffff8110acda>]
[<ffffffff8110acda>] smp_call_function_single+0xba/0x110
[Fri Oct 19 12:58:37 2018] RSP: 0018:ffff8808017f3d28  EFLAGS: 00000202
[Fri Oct 19 12:58:37 2018] RAX: 0000000000000000 RBX: 0000000000000000 RCX:
ffffffff81e3ca00
[Fri Oct 19 12:58:37 2018] RDX: 0000000000000001 RSI: 00000000000000fb RDI:
0000000000000286
[Fri Oct 19 12:58:37 2018] RBP: ffffffff8106e9d0 R08: ffffffff81f639a0 R09:
0000000000000000
[Fri Oct 19 12:58:37 2018] R10: 00007f8112149d50 R11: ffff8800bbae3e20 R12:
00007f8111f4c000
[Fri Oct 19 12:58:37 2018] R13: ffff88081dbb3800 R14: ffff88081dbb3b70 R15:
ffff88081e575400
[Fri Oct 19 12:58:37 2018] FS:  00007f8102dff700(0000)
GS:ffff88083fd40000(0000) knlGS:0000000000000000
[Fri Oct 19 12:58:37 2018] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[Fri Oct 19 12:58:37 2018] CR2: 00007f810239a000 CR3: 00000000bb0e8000 CR4:
0000000000040670
[Fri Oct 19 12:58:37 2018] Stack:
[Fri Oct 19 12:58:37 2018]  00000001bf1ea867 0000000000000005
0000000000000000 ffffffff8106e9d0
[Fri Oct 19 12:58:37 2018]  ffff8808017f3d70 0000000000000003
00007f8111f4d000 00007f8111f4d000
[Fri Oct 19 12:58:37 2018]  ffffffff8106ee8b ffff88081dbb3800
00007f8111f4c000 00007f8111f4d000
[Fri Oct 19 12:58:37 2018] Call Trace:
[Fri Oct 19 12:58:37 2018]  [<ffffffff8106ee8b>]
native_flush_tlb_others+0x10b/0x190
[Fri Oct 19 12:58:37 2018]  [<ffffffff8106efab>]
flush_tlb_mm_range+0x9b/0x170
[Fri Oct 19 12:58:37 2018]  [<ffffffff811be766>]
tlb_flush_mmu_tlbonly+0x66/0xc0
[Fri Oct 19 12:58:37 2018]  [<ffffffff811bf59f>] tlb_finish_mmu+0xf/0x40
[Fri Oct 19 12:58:37 2018]  [<ffffffff811c6ce3>] unmap_region+0xc3/0xf0
[Fri Oct 19 12:58:37 2018]  [<ffffffff811c8dbb>] do_munmap+0x20b/0x390
[Fri Oct 19 12:58:37 2018]  [<ffffffff811c8f7b>] vm_munmap+0x3b/0x50
[Fri Oct 19 12:58:37 2018]  [<ffffffff811c9fdd>] SyS_munmap+0x1d/0x30
[Fri Oct 19 12:58:37 2018]  [<ffffffff816173c3>]
entry_SYSCALL_64_fastpath+0x1e/0xca
[Fri Oct 19 12:58:37 2018] DWARF2 unwinder stuck at
entry_SYSCALL_64_fastpath+0x1e/0xca

[Fri Oct 19 12:58:37 2018] Leftover inexact backtrace:

[Fri Oct 19 12:58:37 2018] Code: 89 ea e8 5a fe ff ff 48 83 c4 30 5b 5d c3
48 8d 74 24 10 48 89 d1 89 df 48 89 ea e8 41 fe ff ff 8b 54 24 28 83 e2 01
74 0b f3 90 <8b> 54 24 28 83 e2 01 75 f5 48 83 c4 30 5b 5d c3 8b 05 08 39 13
[Fri Oct 19 12:58:57 2018] mptscsih: ioc0: task abort: SUCCESS (rv=2002)
(sc=ffff88081c17ce00)
[Fri Oct 19 12:58:57 2018] mptscsih: ioc0: attempting task abort!
(sc=ffff88081c17cc80)
[Fri Oct 19 12:58:57 2018] sd 0:0:0:0: [sda] tag#0 CDB: Write(10) 2a 00 00
5c d7 78 00 02 00 00
[Fri Oct 19 12:58:57 2018] mptscsih: ioc0: task abort: SUCCESS (rv=2002)
(sc=ffff88081c17cc80)
[Fri Oct 19 12:58:57 2018] mptscsih: ioc0: attempting task abort!
(sc=ffff88081d44e940)
[Fri Oct 19 12:58:57 2018] sd 0:0:0:0: [sda] tag#1 CDB: Write(10) 2a 00 00
54 8f 70 00 02 00 00
[Fri Oct 19 12:58:57 2018] mptscsih: ioc0: task abort: SUCCESS (rv=2002)
(sc=ffff88081d44e940)
[Fri Oct 19 12:58:57 2018] scsi target0:0:0: Beginning Domain Validation
[Fri Oct 19 12:58:57 2018] scsi target0:0:0: Domain Validation skipping
write tests
[Fri Oct 19 12:58:57 2018] scsi target0:0:0: Ending Domain Validation
[Fri Oct 19 12:58:57 2018] scsi target0:0:0: FAST-40 WIDE SCSI 80.0 MB/s ST
(25 ns, offset 127)
[Fri Oct 19 12:58:58 2018] scsi target0:0:1: Beginning Domain Validation
[Fri Oct 19 12:58:58 2018] scsi target0:0:1: Domain Validation skipping
write tests
[Fri Oct 19 12:58:58 2018] scsi target0:0:1: Ending Domain Validation
[Fri Oct 19 12:58:58 2018] scsi target0:0:1: FAST-40 WIDE SCSI 80.0 MB/s ST
(25 ns, offset 127)
[Fri Oct 19 12:58:58 2018] scsi target0:0:2: Beginning Domain Validation
[Fri Oct 19 12:58:58 2018] scsi target0:0:2: Domain Validation skipping
write tests
[Fri Oct 19 12:58:58 2018] scsi target0:0:2: Ending Domain Validation
[Fri Oct 19 12:58:58 2018] scsi target0:0:2: FAST-40 WIDE SCSI 80.0 MB/s ST
(25 ns, offset 127)
[Fri Oct 19 12:58:58 2018] scsi target0:0:3: Beginning Domain Validation
[Fri Oct 19 12:58:58 2018] scsi target0:0:3: Domain Validation skipping
write tests
[Fri Oct 19 12:58:58 2018] scsi target0:0:3: Ending Domain Validation
[Fri Oct 19 12:58:58 2018] scsi target0:0:3: FAST-40 WIDE SCSI 80.0 MB/s ST
(25 ns, offset 127)


On Mon, Oct 22, 2018 at 10:15 PM Susheel Kumar <su...@gmail.com>
wrote:

> Hi Shawn,
>
> Here is the link for Solr GC log and it doesn't look Solr GC problem. The
> total GC is 12 GB.  The GC log is from yesterday and the issue happened
> this morning i.e. 10/22.
>
>
> https://www.dropbox.com/s/zdlu9sk8kc469ls/Screen%20Shot%202018-10-22%20at%2010.08.37%20PM.png?dl=0
>
>
> It may be network issue but just looking the message "ZookeeperSolr server
> not running" and later it instantiate doesn't give any clue.
>
> Thnx
>
> On Mon, Oct 22, 2018 at 9:54 PM Shawn Heisey <ap...@elyograg.org> wrote:
>
>> On 10/22/2018 7:32 PM, Susheel Kumar wrote:
>> > Hi Shawn, you meant ZK GC log correct?
>>
>> There was another potential cause I was thinking of, but when I got to
>> where I was going to list them in the previous message, I could not for
>> the life of me remember what the other one was.
>>
>> I just remembered:  This problem could be caused by severe network
>> connectivity issues between your servers.  A few dropped packets
>> probably isn't enough ... I think it would have to be a severe problem.
>>
>> Thanks,
>> Shawn
>>
>>

Re: ZookeeperServer not running/Client Session timed out

Posted by Susheel Kumar <su...@gmail.com>.
Hi Shawn,

Here is the link for Solr GC log and it doesn't look Solr GC problem. The
total GC is 12 GB.  The GC log is from yesterday and the issue happened
this morning i.e. 10/22.

https://www.dropbox.com/s/zdlu9sk8kc469ls/Screen%20Shot%202018-10-22%20at%2010.08.37%20PM.png?dl=0


It may be network issue but just looking the message "ZookeeperSolr server
not running" and later it instantiate doesn't give any clue.

Thnx

On Mon, Oct 22, 2018 at 9:54 PM Shawn Heisey <ap...@elyograg.org> wrote:

> On 10/22/2018 7:32 PM, Susheel Kumar wrote:
> > Hi Shawn, you meant ZK GC log correct?
>
> There was another potential cause I was thinking of, but when I got to
> where I was going to list them in the previous message, I could not for
> the life of me remember what the other one was.
>
> I just remembered:  This problem could be caused by severe network
> connectivity issues between your servers.  A few dropped packets
> probably isn't enough ... I think it would have to be a severe problem.
>
> Thanks,
> Shawn
>
>

Re: ZookeeperServer not running/Client Session timed out

Posted by Shawn Heisey <ap...@elyograg.org>.
On 10/22/2018 7:32 PM, Susheel Kumar wrote:
> Hi Shawn, you meant ZK GC log correct?

There was another potential cause I was thinking of, but when I got to 
where I was going to list them in the previous message, I could not for 
the life of me remember what the other one was.

I just remembered:  This problem could be caused by severe network 
connectivity issues between your servers.  A few dropped packets 
probably isn't enough ... I think it would have to be a severe problem.

Thanks,
Shawn


Re: ZookeeperServer not running/Client Session timed out

Posted by Shawn Heisey <ap...@elyograg.org>.
On 10/22/2018 7:32 PM, Susheel Kumar wrote:
> Hi Shawn, you meant ZK GC log correct?

No, the GC log from Solr.  A heap that's too small could happen to ZK as 
well, but I would expect that problem more on the Solr side.

You could try increasing the heap size to see if that makes any 
difference.  The GC log would tell me for sure whether that's a problem.

Thanks,
Shawn


Re: ZookeeperServer not running/Client Session timed out

Posted by Susheel Kumar <su...@gmail.com>.
Hi Shawn, you meant ZK GC log correct?

Thnx

On Mon, Oct 22, 2018 at 7:03 PM Shawn Heisey <ap...@elyograg.org> wrote:

> On 10/22/2018 3:31 PM, Susheel Kumar wrote:
> > Hello,
> >
> > I am seeing "ZookeeperServer not running" WARM messages in zookeeper logs
> > which is causing the Solr client connections to timeout...
> >
> > What could be the problem?
> >
> > ZK: 3.4.10
> >
> > Zookeeper.out
> > ==
>
> For help with the ZK server log, you'll need to consult the ZooKeeper
> project.  The language in their log entries seems plain enough, but they
> will be able to tell you precisely what it means.
>
> > solr.log
> > 2018-10-22 10:02:21.466 WARN  (main-SendThread(srch0118:2182)) [   ]
> > o.a.z.ClientCnxn Client session timed out, have not heard from server in
> > 26675ms for sessionid 0x5665c67cb0d000b
>
> The ZK client in Solr hasn't heard from the ZK server in over 26
> seconds, so it considers that connection to have timed out, and will
> throw the connection away.  It should try again to establish a new
> connection ... but whatever's causing the problem will probably also
> affect the new connection.
>
> It's a particularly bad sign for the ZK connection to time out,
> especially on an interval like 26 seconds.  That's a REALLY long time
> for software like Solr and ZK.
>
> One of the things that can cause problems like this is having a heap
> that's too small, so Java must spend the majority of its time doing
> garbage collection, rather than running the program it's been asked to
> run.  There are sometimes other causes, but that is a very common cause.
>
> Can you share a garbage collection log from a time when these errors
> happen?  Solr should set up Java so that it creates a GC log.  You'll
> need to use a file sharing site (like Dropbox) -- email attachments
> almost never make it to the list.
>
> Thanks,
> Shawn
>
>

Re: ZookeeperServer not running/Client Session timed out

Posted by Shawn Heisey <ap...@elyograg.org>.
On 10/22/2018 3:31 PM, Susheel Kumar wrote:
> Hello,
>
> I am seeing "ZookeeperServer not running" WARM messages in zookeeper logs
> which is causing the Solr client connections to timeout...
>
> What could be the problem?
>
> ZK: 3.4.10
>
> Zookeeper.out
> ==

For help with the ZK server log, you'll need to consult the ZooKeeper 
project.  The language in their log entries seems plain enough, but they 
will be able to tell you precisely what it means.

> solr.log
> 2018-10-22 10:02:21.466 WARN  (main-SendThread(srch0118:2182)) [   ]
> o.a.z.ClientCnxn Client session timed out, have not heard from server in
> 26675ms for sessionid 0x5665c67cb0d000b

The ZK client in Solr hasn't heard from the ZK server in over 26 
seconds, so it considers that connection to have timed out, and will 
throw the connection away.  It should try again to establish a new 
connection ... but whatever's causing the problem will probably also 
affect the new connection.

It's a particularly bad sign for the ZK connection to time out, 
especially on an interval like 26 seconds.  That's a REALLY long time 
for software like Solr and ZK.

One of the things that can cause problems like this is having a heap 
that's too small, so Java must spend the majority of its time doing 
garbage collection, rather than running the program it's been asked to 
run.  There are sometimes other causes, but that is a very common cause.

Can you share a garbage collection log from a time when these errors 
happen?  Solr should set up Java so that it creates a GC log.  You'll 
need to use a file sharing site (like Dropbox) -- email attachments 
almost never make it to the list.

Thanks,
Shawn