You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by heaven <ah...@gmail.com> on 2013/05/09 13:02:50 UTC

ColrCloud: IOException occured when talking to server at

Hi, observing lots of these errors with SolrCloud

Here is the instruction I am using to run services:
zookeeper:
  1: cd /opt/zookeeper/
  2: sudo bin/zkServer.sh start zoo1.cfg
  3: sudo bin/zkServer.sh start zoo2.cfg
  4: sudo bin/zkServer.sh start zoo3.cfg

shards:
  1: cd /opt/solr-cluster/shard1/
     sudo su solr -c "java -Xmx4096M
-DzkHost=localhost:2181,localhost:2182,localhost:2183
-Dbootstrap_confdir=./solr/conf -Dcollection.configName=Carmen -DnumShards=2
-jar start.jar etc/jetty.xml etc/jetty-logging.xml &"
  2: cd ../shard2/
     sudo su solr -c "java -Xmx4096M
-DzkHost=localhost:2181,localhost:2182,localhost:2183 -jar start.jar
etc/jetty.xml etc/jetty-logging.xml &"

replicas:
  1: cd ../replica1/
     sudo su solr -c "java -Xmx4096M
-DzkHost=localhost:2181,localhost:2182,localhost:2183 -jar start.jar
etc/jetty.xml etc/jetty-logging.xml &"
  2: cd ../replica2/
     sudo su solr -c "java -Xmx4096M
-DzkHost=localhost:2181,localhost:2182,localhost:2183 -jar start.jar
etc/jetty.xml etc/jetty-logging.xml &"

zoo1.cfg:
# The number of milliseconds of each tick
tickTime=2000
# The number of ticks that the initial 
# synchronization phase can take
initLimit=10
# The number of ticks that can pass between 
# sending a request and getting an acknowledgement
syncLimit=5
# the directory where the snapshot is stored.
# do not use /tmp for storage, /tmp here is just 
# example sakes.
dataDir=/opt/zookeeper/data/1
# the port at which the clients will connect
clientPort=2181

server.1=localhost:2888:3888
server.2=localhost:2889:3889
server.3=localhost:2890:3890

zoo2.cfg and zoo3.cfg are the same except dataDir and client port
respectively.

Also very often I see: org.apache.solr.common.SolrException: No registered
leader was found and lots of other errors. Just updated jetty.xml and set
org.eclipse.jetty.server.Request.maxFormContentSize to 10MB and restarted
the cluster — half of errors gone, but this one about IOException still
here.

I am re-indexing a few models (rails application), they have from 1 000 000
to 20 000 000 of records. For indexing I have a queue (mongodb) and a few
workers which process it in batches of 200-500 records.

All Solr and Zookeeper instances are launched on the same server: 2 intel
xenon processors, 8 total cores, 32Gb of memory and rapid RAID storage.

Please help me to figure out what could be the reason for those errors and
how can fix them. Please tell me if I can provide some more information
about the server setup, logs, errors, etc.

Best,
Alex

<http://lucene.472066.n3.nabble.com/file/n4061831/Topology.png> 

Shard 1:
<http://lucene.472066.n3.nabble.com/file/n4061831/Shard1.png> 
Replica 1:
<http://lucene.472066.n3.nabble.com/file/n4061831/Replica1.png> 
Shard 2:
<http://lucene.472066.n3.nabble.com/file/n4061831/Shard2.png> 
Replica 2:
<http://lucene.472066.n3.nabble.com/file/n4061831/Replica2.png> 



--
View this message in context: http://lucene.472066.n3.nabble.com/ColrCloud-IOException-occured-when-talking-to-server-at-tp4061831.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: SolrCloud: IOException occured when talking to server at

Posted by heaven <ah...@gmail.com>.

Hi, thanks for the links and for your help. The server is now running third
day in a row with no issues. What is done:
1. Applied these GC tuning options: -XX:+UseConcMarkSweepGC
-XX:CMSInitiatingOccupancyFraction=80
2. Optimized the schema and index size (decreased at least 8 times).
3. Updated the application code. Previously we've indexed lots of data that
we shouldn't, lots of html content which is now removed from data before
indexing. In combination with NGramFilterFactory with gram size 1..20 all
this constituted an explosive mixture. This caused high load during indexing
and result to a big redundant index.

But my previous thought about stability is still valid. I observed that when
shard goes down (both shard all its replicas were down) caused some docs to
be missing from the index. So in case when shards are on separate physical
servers and one of the servers will go down (by any reason) could cause
troubles.

Best,
Alex



--
View this message in context: http://lucene.472066.n3.nabble.com/ColrCloud-IOException-occured-when-talking-to-server-at-tp4061831p4063161.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: SolrCloud: IOException occured when talking to server at

Posted by Shawn Heisey <so...@elyograg.org>.

On 5/10/2013 1:06 PM, heaven wrote:
> Again, just finished reindexing, server utilization was about 5-10%, I
> started index optimization. As result I now lost (again) entire index, got a
> lot of errors, they are appear so fast and contain 0 useful information.
>
> <http://lucene.472066.n3.nabble.com/file/n4062434/Screenshot_546.png>
> <http://lucene.472066.n3.nabble.com/file/n4062434/Screenshot_547.png>
>
> You can see that server is not loaded at all, and load was the same when I
> started the optimization process.
>
> BTW, it seems like an infinite loop, the picture does not change, replica is
> down, shard in recovering.
>
> In the shard log I see:
> org.apache.solr.common.SolrException: No registered leader was found,
> collection:crm-prod slice:shard1
>
> And the same in replica + sometimes:
> Error getting leader from zk
> org.apache.solr.common.SolrException: No registered leader was found,
> collection:crm-test slice:shard1

Based on the screenshot of your processes, I don't think you have enough 
RAM for what this machine is doing.  The performance issues are causing 
zookeeper communication problems, which results in SolrCloud going a 
little crazy because it can't tell what's really going on.

Each of your four Solr processes has a 23GB virtual memory size.  Since 
you have said that your Solr JVMs have a max heap of 4GB, this suggests 
that each of those Solr processes has an index in the neighborhood of 
19GB.  That's 76GB of index.

You've got 32GB of RAM.  Four Solr servers that each have a max heap of 
4GB leaves 16GB of free memory.  With 76GB of index, you'll want between 
40 and 80GB of free memory so that your index is well-cached.  You're 
looking at needing a total memory size of between 64 and 96GB just for Solr.

You also have a MongoDB process that is using 4GB of real memory, with a 
56GB virtual memory size.  That probably means that your mongodb 
database is in the neighborhood of 50GB, though i could be off on that 
estimate.  MongoDB uses free memory for caching in the same way that 
Solr does, so add on at least half the size of your MongoDB database to 
your memory requirement.

Basically, an ideal machine for what you're trying to do will have at 
least 128GB of RAM.  With 96GB, you might still be OK.

The problem is likely further complicated by long GC pauses during heavy 
indexing.  You'll want to tune your garbage collection.

http://wiki.apache.org/solr/ShawnHeisey#GC_Tuning
http://wiki.apache.org/solr/SolrPerformanceProblems

Thanks,
Shawn

Re: SolrCloud: IOException occured when talking to server at

Posted by heaven <ah...@gmail.com>.

Again, just finished reindexing, server utilization was about 5-10%, I
started index optimization. As result I now lost (again) entire index, got a
lot of errors, they are appear so fast and contain 0 useful information.

<http://lucene.472066.n3.nabble.com/file/n4062434/Screenshot_546.png> 
<http://lucene.472066.n3.nabble.com/file/n4062434/Screenshot_547.png> 

You can see that server is not loaded at all, and load was the same when I
started the optimization process.

BTW, it seems like an infinite loop, the picture does not change, replica is
down, shard in recovering.

In the shard log I see:
org.apache.solr.common.SolrException: No registered leader was found,
collection:crm-prod slice:shard1

And the same in replica + sometimes:
Error getting leader from zk
org.apache.solr.common.SolrException: No registered leader was found,
collection:crm-test slice:shard1
	at
org.apache.solr.common.cloud.ZkStateReader.getLeaderRetry(ZkStateReader.java:484)
	at
org.apache.solr.common.cloud.ZkStateReader.getLeaderUrl(ZkStateReader.java:458)
	at org.apache.solr.cloud.ZkController.getLeader(ZkController.java:843)
	at org.apache.solr.cloud.ZkController.register(ZkController.java:776)
	at org.apache.solr.cloud.ZkController$1.command(ZkController.java:216)
	at
org.apache.solr.common.cloud.ConnectionManager$1.update(ConnectionManager.java:117)
	at
org.apache.solr.common.cloud.DefaultConnectionStrategy.reconnect(DefaultConnectionStrategy.java:46)
	at
org.apache.solr.common.cloud.ConnectionManager.process(ConnectionManager.java:91)
	at
org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:519)
	at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:495)

in zookeeper log:
2013-05-10 06:12:21,788 [myid:3] - WARN 
[NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2183:NIOServerCnxn@349] - caught end
of stream exception
EndOfStreamException: Unable to read additional data from client sessionid
0x33e89eb26e40006, likely client has closed socket
        at
org.apache.zookeeper.server.NIOServerCnxn.doIO(NIOServerCnxn.java:220)
        at
org.apache.zookeeper.server.NIOServerCnxnFactory.run(NIOServerCnxnFactory.java:208)
        at java.lang.Thread.run(Thread.java:679)
2013-05-10 06:13:16,346 [myid:3] - WARN 
[NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2183:NIOServerCnxn@349] - caught end
of stream exception
EndOfStreamException: Unable to read additional data from client sessionid
0x23e89eb1ab20004, likely client has closed socket
        at
org.apache.zookeeper.server.NIOServerCnxn.doIO(NIOServerCnxn.java:220)
        at
org.apache.zookeeper.server.NIOServerCnxnFactory.run(NIOServerCnxnFactory.java:208)
        at java.lang.Thread.run(Thread.java:679)
2013-05-10 07:57:50,677 [myid:3] - WARN 
[NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2183:NIOServerCnxn@349] - caught end
of stream exception
EndOfStreamException: Unable to read additional data from client sessionid
0x33e89eb26e40004, likely client has closed socket
        at
org.apache.zookeeper.server.NIOServerCnxn.doIO(NIOServerCnxn.java:220)
        at
org.apache.zookeeper.server.NIOServerCnxnFactory.run(NIOServerCnxnFactory.java:208)
        at java.lang.Thread.run(Thread.java:679)
2013-05-10 07:58:05,956 [myid:3] - ERROR
[CommitProcessor:3:NIOServerCnxn@180] - Unexpected Exception: 
java.nio.channels.CancelledKeyException
        at sun.nio.ch.SelectionKeyImpl.ensureValid(SelectionKeyImpl.java:73)
        at sun.nio.ch.SelectionKeyImpl.interestOps(SelectionKeyImpl.java:77)
        at
org.apache.zookeeper.server.NIOServerCnxn.sendBuffer(NIOServerCnxn.java:153)
        at
org.apache.zookeeper.server.NIOServerCnxn.sendResponse(NIOServerCnxn.java:1076)
        at
org.apache.zookeeper.server.NIOServerCnxn.process(NIOServerCnxn.java:1113)
        at
org.apache.zookeeper.server.DataTree.setWatches(DataTree.java:1293)
        at
org.apache.zookeeper.server.ZKDatabase.setWatches(ZKDatabase.java:384)
        at
org.apache.zookeeper.server.FinalRequestProcessor.processRequest(FinalRequestProcessor.java:304)
        at
org.apache.zookeeper.server.quorum.CommitProcessor.run(CommitProcessor.java:74)
2013-05-10 07:58:05,957 [myid:3] - ERROR
[CommitProcessor:3:NIOServerCnxn@180] - Unexpected Exception: 
java.nio.channels.CancelledKeyException
        at sun.nio.ch.SelectionKeyImpl.ensureValid(SelectionKeyImpl.java:73)
        at sun.nio.ch.SelectionKeyImpl.interestOps(SelectionKeyImpl.java:77)
        at org.apache.zookeeper.server.NIOServerCnxn.s2013-05-10
11:07:55,805 [myid:2] - WARN  [SyncThread:2:FileTxnLog@321] - fsync-ing the
write ahead log in SyncThread:2 took 2302ms which will adversely effect
operation latency. See the ZooKeeper troubleshooting guide
2013-05-10 11:10:31,354 [myid:2] - WARN 
[NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2182:NIOServerCnxn@349] - caught end
of stream exception
EndOfStreamException: Unable to read additional data from client sessionid
0x33e89eb26e40008, likely client has closed socket
        at
org.apache.zookeeper.server.NIOServerCnxn.doIO(NIOServerCnxn.java:220)
        at
org.apache.zookeeper.server.NIOServerCnxnFactory.run(NIOServerCnxnFactory.java:208)
        at java.lang.Thread.run(Thread.java:679)
2013-05-10 11:13:28,610 [myid:2] - ERROR
[LearnerHandler-/127.0.0.1:53815:LearnerHandler@562] - Unexpected exception
causing shutdown while sock still open
java.net.SocketTimeoutException: Read timed out
        at java.net.SocketInputStream.socketRead0(Native Method)
        at java.net.SocketInputStream.read(SocketInputStream.java:146)
        at java.io.BufferedInputStream.fill(BufferedInputStream.java:235)
        at java.io.BufferedInputStream.read(BufferedInputStream.java:254)
        at java.io.DataInputStream.readInt(DataInputStream.java:387)
        at
org.apache.jute.BinaryInputArchive.readInt(BinaryInputArchive.java:63)
        at
org.apache.zookeeper.server.quorum.QuorumPacket.deserialize(QuorumPacket.java:83)
        at
org.apache.jute.BinaryInputArchive.readRecord(BinaryInputArchive.java:108)
        at
org.apache.zookeeper.server.quorum.LearnerHandler.run(LearnerHandler.java:476)
2013-05-10 11:13:28,611 [myid:2] - WARN 
[LearnerHandler-/127.0.0.1:53815:LearnerHandler@575] - ******* GOODBYE
/127.0.0.1:53815 ********
2013-05-10 11:23:28,186 [myid:2] - WARN  [SyncThread:2:FileTxnLog@321] -
fsync-ing the write ahead log in SyncThread:2 took 1079ms which will
adversely effect operation latency. See the ZooKeeper troubleshooting guide
ocket
        at
org.apache.zookeeper.server.NIOServerCnxn.doIO(NIOServerCnxn.java:220)
        at
org.apache.zookeeper.server.NIOServerCnxnFactory.run(NIOServerCnxnFactory.java:208)
        at java.lang.Thread.run(Thread.java:679)
2013-05-10 10:59:06,943 [myid:3] - WARN 
[NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2183:NIOServerCnxn@349] - caught end
of stream exception
EndOfStreamException: Unable to read additional data from client sessionid
0x23e89eb1ab20004, likely client has closed socket
        at
org.apache.zookeeper.server.NIOServerCnxn.doIO(NIOServerCnxn.java:220)
        at
org.apache.zookeeper.server.NIOServerCnxnFactory.run(NIOServerCnxnFactory.java:208)
        at java.lang.Thread.run(Thread.java:679)
2013-05-10 11:07:55,805 [myid:3] - WARN  [SyncThread:3:FileTxnLog@321] -
fsync-ing the write ahead log in SyncThread:3 took 2625ms which will
adversely effect operation latency. See the ZooKeeper troubleshooting guide
2013-05-10 11:08:31,101 [myid:3] - WARN  [SyncThread:3:FileTxnLog@321] -
fsync-ing the write ahead log in SyncThread:3 took 3102ms which will
adversely effect operation latency. See the ZooKeeper troubleshooting guide
2013-05-10 11:10:45,152 [myid:3] - WARN  [SyncThread:3:FileTxnLog@321] -
fsync-ing the write ahead log in SyncThread:3 took 1587ms which will
adversely effect operation latency. See the ZooKeeper troubleshooting guide
2013-05-09 08:32:34,358 [myid:1] - WARN 
[QuorumPeer[myid=1]/0:0:0:0:0:0:0:0:2181:Follower@118] - Got zxid
0x300000001 expected 0x1
2013-05-10 08:49:35,497 [myid:1] - WARN 
[NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@349] - caught end
of stream exception
EndOfStreamException: Unable to read additional data from client sessionid
0x33e89eb26e40008, likely client has closed socket
        at
org.apache.zookeeper.server.NIOServerCnxn.doIO(NIOServerCnxn.java:220)
        at
org.apache.zookeeper.server.NIOServerCnxnFactory.run(NIOServerCnxnFactory.java:208)
        at java.lang.Thread.run(Thread.java:679)
2013-05-10 09:58:31,378 [myid:1] - WARN 
[NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@349] - caught end
of stream exception
EndOfStreamException: Unable to read additional data from client sessionid
0x13e89eb1aa90005, likely client has closed socket
        at
org.apache.zookeeper.server.NIOServerCnxn.doIO(NIOServerCnxn.java:220)
        at
org.apache.zookeeper.server.NIOServerCnxnFactory.run(NIOServerCnxnFactory.java:208)
        at java.lang.Thread.run(Thread.java:679)
2013-05-10 11:08:31,101 [myid:1] - WARN  [SyncThread:1:FileTxnLog@321] -
fsync-ing the write ahead log in SyncThread:1 took 1548ms which will
adversely effect operation latency. See the ZooKeeper troubleshooting guide
2013-05-10 11:13:29,793 [myid:1] - WARN 
[QuorumPeer[myid=1]/0:0:0:0:0:0:0:0:2181:Follower@89] - Exception when
following the leader
java.net.SocketException: Broken pipe
        at java.net.SocketOutputStream.socketWrite0(Native Method)
        at
java.net.SocketOutputStream.socketWrite(SocketOutputStream.java:109)
        at java.net.SocketOutputStream.write(SocketOutputStream.java:153)
        at
java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:82)
        at java.io.BufferedOutputStream.flush(BufferedOutputStream.java:140)
        at
org.apache.zookeeper.server.quorum.Learner.writePacket(Learner.java:138)
        at org.apache.zookeeper.server.quorum.Learner.ping(Learner.java:465)
        at
org.apache.zookeeper.server.quorum.Follower.processPacket(Follower.java:112)
        at
org.apache.zookeeper.server.quorum.Follower.followLeader(Follower.java:86)
        at
org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:740)
2013-05-10 11:13:29,794 [myid:1] - ERROR
[FollowerRequestProcessor:1:FollowerRequestProcessor@93] - Unexpected
exception causing exit
java.net.SocketException: Broken pipe
        at java.net.SocketOutputStream.socketWrite0(Native Method)
        at
java.net.SocketOutputStream.socketWrite(SocketOutputStream.java:109)
        at java.net.SocketOutputStream.write(SocketOutputStream.java:153)
        at
java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:82)
        at java.io.BufferedOutputStream.flush(BufferedOutputStream.java:140)
        at
org.apache.zookeeper.server.quorum.Learner.writePacket(Learner.java:138)
        at
org.apache.zookeeper.server.quorum.Learner.request(Learner.java:187)
        at
org.apache.zookeeper.server.quorum.FollowerRequestProcessor.run(FollowerRequestProcessor.java:88)
2013-05-10 11:13:30,076 [myid:1] - WARN 
[QuorumPeer[myid=1]/0:0:0:0:0:0:0:0:2181:Learner@373] - Got zxid 0x300001697
expected 0x1
2013-05-10 11:13:30,648 [myid:1] - WARN 
[QuorumPeer[myid=1]/0:0:0:0:0:0:0:0:2181:Follower@118] - Got zxid
0x30000169d expected 0x1

Load wasn't high at all, there were maximum 10 requests per minute.

Best,
Alex



--
View this message in context: http://lucene.472066.n3.nabble.com/ColrCloud-IOException-occured-when-talking-to-server-at-tp4061831p4062434.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Re: SolrCloud: IOException occured when talking to server at

Posted by heaven <ah...@gmail.com>.

CPU utilization is 90-100%, IO wait is 0.5-3%, The server is:

CPU
2 x Xeon E5606 2.13GHz 8MB Cache 4.8GT/sec (8 total cores)
Memory
32 GB RAM
Disk Space
10 x 300GB 15k RPM Seagate SAS
Raid Card(s)

OS
scientific 6.3

This high CPU usage is only during indexing, since we have to recreate the whole
index. Also we do not have a very high query volume. App is supposed to be for
internal company usage only, so there are no thousands of users. But some of
our background services use Solr to analyze some sorts of data. E.g. to perform
keywords match and tag the content, etc.

Current index size is 64GB, but it's only a part of the data. After complete re-
indexing its size grow up to 160-180GB and always growing. But it is not optimized
well, we're using EdgeNGramFilterFactory with minGramSize="1" where it is
necessary and where it is not so think that size could be reduced at 30-50% after
optimization.

Best,
Alex

Friday 10 May 2013, you wrote:

On 5/10/2013 2:00 AM, heaven wrote: > UPD: > Forget to confirm, we're using
Solr 4.2.1 and will wait for Solr 4.3.1 or > for 4.4 as you advised.

Solr 4.2.1 should be pretty stable.

You mentioned on the previous message that your server load is below 16. This
is very high. What is your CPU utilization, and most importantly, what is your
iowait percentage? If you are looking at top, this is the "%wa" value.

Some other things to check:

- How big are your indexes? Your server is hosting six of them - three collections,
each with two shards. Each instance of Solr has a max heap of 4GB. The OS
and zookeeper probably use up another 1GB or so. If nothing else is running on
the server, then that means you have about 23GB of free memory left for
caching. With that much free memory, you could probably handle 40GB of index
with good performance, unless your query volume is very very high. If there are
other programs running on the server, then your free memory will go down.

- What kind of RAID are you using? If it's RAID5 or RAID6, then sustained write
performance (indexing) will not be very good. With standard hard drives, RAID10
is best. You could take the plunge and get SSD instead, if there's enough money
in the budget for it.

Comparing your server load with mine: My production Solr servers have 8 CPU
cores and the load average is rarely above 1.5 even during busy times. Overall
CPU utilization normally peaks at about 15 percent.

Thanks, Shawn

--------------------
*If you reply to this email, your message will be added to the discussion below:*
http://lucene.472066.n3.nabble.com/ColrCloud-IOException-occured-when-talking-to-server-at-tp4061831p4062243.html[1]
To unsubscribe from ColrCloud: IOException occured when talking to server at,
click here[2].

NAML[3]

--------
[1] http://lucene.472066.n3.nabble.com/ColrCloud-IOException-occured-when-talking-to-server-at-tp4061831p4062243.html
[2]
http://lucene.472066.n3.nabble.com/template/NamlServlet.jtp?macro=unsubscribe
_by_code&node=4061831&code=YWhlYXZlbjg3QGdtYWlsLmNvbXw0MDYxODMxfDE
3MDI0ODI4OTY=
[3]
http://lucene.472066.n3.nabble.com/template/NamlServlet.jtp?macro=macro_view
er&id=instant_html%21nabble%3Aemail.naml&base=nabble.naml.namespaces.B
asicNamespace-nabble.view.web.template.NabbleNamespace-
nabble.view.web.template.NodeNamespace&breadcrumbs=notify_subscribers%21
nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-
send_instant_email%21nabble%3Aemail.naml

--
View this message in context: http://lucene.472066.n3.nabble.com/ColrCloud-IOException-occured-when-talking-to-server-at-tp4061831p4062246.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: SolrCloud: IOException occured when talking to server at

Posted by Shawn Heisey <so...@elyograg.org>.

On 5/10/2013 2:00 AM, heaven wrote:
> UPD:
> Forget to confirm, we're using Solr 4.2.1 and will wait for Solr 4.3.1 or
> for 4.4 as you advised.

Solr 4.2.1 should be pretty stable.

You mentioned on the previous message that your server load is below 16.
 This is very high.  What is your CPU utilization, and most importantly,
what is your iowait percentage?  If you are looking at top, this is the
"%wa" value.

Some other things to check:

- How big are your indexes?  Your server is hosting six of them - three
collections, each with two shards.  Each instance of Solr has a max heap
of 4GB.  The OS and zookeeper probably use up another 1GB or so.  If
nothing else is running on the server, then that means you have about
23GB of free memory left for caching.  With that much free memory, you
could probably handle 40GB of index with good performance, unless your
query volume is very very high.  If there are other programs running on
the server, then your free memory will go down.

- What kind of RAID are you using?  If it's RAID5 or RAID6, then
sustained write performance (indexing) will not be very good.  With
standard hard drives, RAID10 is best.  You could take the plunge and get
SSD instead, if there's enough money in the budget for it.

Comparing your server load with mine: My production Solr servers have 8
CPU cores and the load average is rarely above 1.5 even during busy
times.  Overall CPU utilization normally peaks at about 15 percent.

Thanks,
Shawn

Re: SolrCloud: IOException occured when talking to server at

Posted by heaven <ah...@gmail.com>.

UPD:
Forget to confirm, we're using Solr 4.2.1 and will wait for Solr 4.3.1 or
for 4.4 as you advised.



--
View this message in context: http://lucene.472066.n3.nabble.com/ColrCloud-IOException-occured-when-talking-to-server-at-tp4061831p4062235.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: SolrCloud: IOException occured when talking to server at

Posted by heaven <ah...@gmail.com>.

Hi Shawn, thank you for the reply and for your advises, will try all of them
today. Some of them are already applied, i.e. "Stop other software" and
"zkClientTimeout". Timeout set to 60 seconds, also reduced autowarm count
and increased autoCommit interval to 5 minutes.

Situation improved now and number of errors decreased, only a few errors
since yesterday:
a few forwarding update to http://208.85.150.171:8090/solr/crm-prod/ failed
- retrying ...
and a few of IOException occured when talking to server at:
http://208.85.150.171:8081/solr/crm-prod
on both shards and only warnings on replicas about
too many updates received since start - startingUpdates no longer overlaps
with our currentUpdates
and Starting/stopping log replay

But the load has decreased also.

When I told about production-ready I meant documents loss, even though
SolrCloud use journal, this does not appear to be a guarantee that the data
will be indexed. How that happened that the journal was replayed, truncated
but docs aren't in index? I think if Solr accepted a request, docs should
appear in index, if there are less resources than required — it could work
slow, could crash, could stop working, but data should not be lost, log
should be in place and replayed after restart (as it is supposed to be),
that is my point of view. Because there is no way my index queue workers
could check for Solr failures after getting a successful response from it.

I agree, there are too many software for this hardware :) Load average now
is under 16. Previously we only had a single Solr instance on this server
and decided to switch to SolrCloud to improve the search speed. And it
really became much faster now. But also became unreliable under a high load.
Perhaps it is really because of some server configuration, checking now.

Thank you,
Alex

--
View this message in context: http://lucene.472066.n3.nabble.com/ColrCloud-IOException-occured-when-talking-to-server-at-tp4061831p4062234.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: SolrCloud: IOException occured when talking to server at

Posted by Shawn Heisey <so...@elyograg.org>.

On 5/9/2013 7:31 AM, heaven wrote:
> Can confirm this lead to data loss. I have 1217427 records in database and
> only 1217216 indexed. Which does mean that Solr gave a successful response
> and then did not added some documents to the index.
> 
> Seems like SolrCloud is not a production-ready solution, would be good if
> there was a warning in the Solr wiki about that.

You've got some kind of underlying problem here.  Here are my guesses
about what that might be:

- An improperly configured Linux firewall and/or SELinux is enabled.
- The hardware is already overtaxed by other software.
- Your zkClientTimeout value is extremely small.
- Your GC pauses are large.
- You're running into an open file limit.

Here's what you could do to resolve each of these:

- Disable the firewall and selinux, reboot.
- Stop other software.
- The example zkClientTimeout is 15 seconds. Try 30-60.
- See http://wiki.apache.org/solr/SolrPerformanceProblems for some GC ideas.
- Increase the file and process limits.  For most versions of Linux, in
/etc/security/limits.conf:

solr         hard    nproc   6144
solr         soft    nproc   4096
solr         hard    nofile  65536
solr         soft    nofile  49152

These numbers should be sufficient for deployments considerably larger
than yours.

SolrCloud is not only production ready, it's being used by many many
people for extremely large indexes.  My own SolrCloud deployment is
fairly small with only 1.5 million docs, but it's extremely stable.  I
also have a somewhat large (77 million docs) non-cloud deployment.

Are you running 4.2.1?  I feel fairly certain based on your screenshots
that you are not running 4.3, but I can't tell which version you are
running.  There are some bugs in the 4.3 release, a 4.3.1 will be
released soon.  If you had planned to upgrade, you should wait for 4.3.1
or 4.4.

NB, and something you might already know: When talking about
production-ready, you can't run everything on the same server.  You need
at least three - two of them can run Solr and zookeeper, and the third
runs zookeeper.  This single-server setup is fine for a proof-of-concept.

Thanks,
Shawn

Re: ColrCloud: IOException occured when talking to server at

Posted by heaven <ah...@gmail.com>.

Can confirm this lead to data loss. I have 1217427 records in database and
only 1217216 indexed. Which does mean that Solr gave a successful response
and then did not added some documents to the index.

Seems like SolrCloud is not a production-ready solution, would be good if
there was a warning in the Solr wiki about that.



--
View this message in context: http://lucene.472066.n3.nabble.com/ColrCloud-IOException-occured-when-talking-to-server-at-tp4061831p4061847.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: ColrCloud: IOException occured when talking to server at

Posted by heaven <ah...@gmail.com>.

Zookeeper log:
     1  *2013-05-09 03:03:07,177* [myid:3] - WARN 
[QuorumPeer[myid=3]/0:0:0:0:0:0:0:0:2183:Follower@118] - Got zxid
0x200000001 expected 0x1
     2  *2013-05-09 03:36:52,918* [myid:3] - ERROR
[CommitProcessor:3:NIOServerCnxn@180] - Unexpected Exception: 
     3  java.nio.channels.CancelledKeyException
     4          at
sun.nio.ch.SelectionKeyImpl.ensureValid(SelectionKeyImpl.java:73)
     5          at
sun.nio.ch.SelectionKeyImpl.interestOps(SelectionKeyImpl.java:77)
     6          at
org.apache.zookeeper.server.NIOServerCnxn.sendBuffer(NIOServerCnxn.java:153)
     7          at
org.apache.zookeeper.server.NIOServerCnxn.sendResponse(NIOServerCnxn.java:1076)
     8          at
org.apache.zookeeper.server.NIOServerCnxn.process(NIOServerCnxn.java:1113)
     9          at
org.apache.zookeeper.server.DataTree.setWatches(DataTree.java:1327)
    10          at
org.apache.zookeeper.server.ZKDatabase.setWatches(ZKDatabase.java:384)
    11          at
org.apache.zookeeper.server.FinalRequestProcessor.processRequest(FinalRequestProcessor.java:304)
    12          at
org.apache.zookeeper.server.quorum.CommitProcessor.run(CommitProcessor.java:74)
    13  *2013-05-09 03:36:52,928* [myid:3] - ERROR
[CommitProcessor:3:NIOServerCnxn@180] - Unexpected Exception: 
    14  java.nio.channels.CancelledKeyException
    15          at
sun.nio.ch.SelectionKeyImpl.ensureValid(SelectionKeyImpl.java:73)
    16          at
sun.nio.ch.SelectionKeyImpl.interestOps(SelectionKeyImpl.java:77)
    17          at org.apache.zookeeper.server.NIOServerCnxn.s*2013-05-09
04:26:04,790* [myid:2] - WARN 
[NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2182:NIOServerCnxn@349] - caught end
of stream exception
    18  EndOfStreamException: Unable to read additional data from client
sessionid 0x23e88bdaf800001, likely client has closed socket
    19          at
org.apache.zookeeper.server.NIOServerCnxn.doIO(NIOServerCnxn.java:220)
    20          at
org.apache.zookeeper.server.NIOServerCnxnFactory.run(NIOServerCnxnFactory.java:208)
    21          at java.lang.Thread.run(Thread.java:679)
    22  tionKeyImpl.ensureValid(SelectionKeyImpl.java:73)
    23          at
sun.nio.ch.SelectionKeyImpl.interestOps(SelectionKeyImpl.java:77)
    24          at
org.apache.zookeeper.server.NIOServerCnxn.sendBuffer(NIOServerCnxn.java:153)
    25          at
org.apache.zookeeper.server.NIOServerCnxn.sendResponse(NIOServerCnxn.java:1076)
    26          at
org.apache.zookeeper.server.NIOServerCnxn.process(NIOServerCnxn.java:1113)
    27          at
org.apache.zookeeper.server.WatchManager.triggerWatch(WatchManager.java:120)
    28          at
org.apache.zookeeper.server.WatchManager.triggerWatch(WatchManager.java:92)
    29          at
org.apache.zookeeper.server.DataTree.setData(DataTree.java:620)
    30          at
org.apache.zookeeper.server.DataTree.processTxn(DataTree.java:807)
    31          at
org.apache.zookeeper.server.ZKDatabase.processTxn(ZKDatabase.java:329)
    32          at
org.apache.zookeeper.server.ZooKeeperServer.processTxn(ZooKeeperServer.java:965)
    33          at
org.apache.zookeeper.server.FinalRequestProcessor.processRequest(FinalRequestProcessor.java:116)
    34          at
org.apache.zookeeper.server.quorum.CommitProcessor.run(CommitProcessor.java:74)
    35  *2013-05-09 04:27:04,002* [myid:3] - ERROR
[CommitProcessor:3:NIOServerCnxn@180] - Unexpected Exception: 
    36  java.nio.channels.CancelledKeyException
    37          at
sun.nio.ch.SelectionKeyImpl.ensureValid(SelectionKeyImpl.java:73)
    38          at
sun.nio.ch.SelectionKeyImpl.interestOps(SelectionKeyImpl.java:77)
    39          at
org.apache.zookeeper.server.NIOServerCnxn.sendBuffer(NIOServerCnxn.java:153)
    40          at
org.apache.zookeeper.server.NIOServerCnxn.sendResponse(NIOServerCnxn.java:1076)
    41          at
org.apache.zookeeper.server.NIOServerCnxn.process(NIOServerCnxn.java:1113)
    42          at
org.apache.zookeeper.server.WatchManager.triggerWatch(WatchManager.java:120)
    43          at
org.apache.zookeeper.server.WatchManager.triggerWatch(WatchManager.java:92)
    44          at
org.apache.zookeeper.server.DataTree.deleteNode(DataTree.java:591)
    45          at
org.apache.zookeeper.server.DataTree.killSession(DataTree.java:966)
    46          at
org.apache.zookeeper.server.DataTree.processTxn(DataTree.java:818)
    47          at
org.apache.zookeeper.server.ZKDatabase.processTxn(ZKDatabase.java:329)
    48          at
org.apache.zookeeper.server.ZooKeeperServer.processTxn(ZooKeeperServer.java:965)
    49          at
org.apache.zookeeper.server.FinalRequestProcessor.processRequest(FinalRequestProcessor.java:116)
    50          at
org.apache.zookeeper.server.quorum.CommitProcessor.run(CommitProcessor.java:74)
    51  *2013-05-09 04:36:00,485* [myid:3] - WARN 
[NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2183:NIOServerCnxn@349] - caught end
of stream exception
    52  EndOfStreamException: Unable to read additional data from client
sessionid 0x33e88bdc0a60008, likely client has closed socket
    53          at
org.apache.zookeeper.server.NIOServerCnxn.doIO(NIOServerCnxn.java:220)
    54          at
org.apache.zookeeper.server.NIOServerCnxnFactory.run(NIOServerCnxnFactory.java:208)
    55          at java.lang.Thread.run(Thread.java:679)
    56  *2013-05-09 04:36:12,057* [myid:3] - WARN 
[NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2183:NIOServerCnxn@349] - caught end
of stream exception
    57  EndOfStreamException: Unable to read additional data from client
sessionid 0x33e88bdc0a60004, likely client has closed socket
    58          at
org.apache.zookeeper.server.NIOServerCnxn.doIO(NIOServerCnxn.java:220)
    59          at
org.apache.zookeeper.server.NIOServerCnxnFactory.run(NIOServerCnxnFactory.java:208)
    60          at java.lang.Thread.run(Thread.java:679)
    61  *2013-05-09 03:03:07,176* [myid:1] - WARN 
[QuorumPeer[myid=1]/0:0:0:0:0:0:0:0:2181:Follower@118] - Got zxid
0x200000001 expected 0x1
    62  *2013-05-09 03:32:55,762* [myid:1] - WARN 
[NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@349] - caught end
of stream exception
    63  EndOfStreamException: Unable to read additional data from client
sessionid 0x13e88bdaf6e0004, likely client has closed socket
    64          at
org.apache.zookeeper.server.NIOServerCnxn.doIO(NIOServerCnxn.java:220)
    65          at
org.apache.zookeeper.server.NIOServerCnxnFactory.run(NIOServerCnxnFactory.java:208)
    66          at java.lang.Thread.run(Thread.java:679)



--
View this message in context: http://lucene.472066.n3.nabble.com/ColrCloud-IOException-occured-when-talking-to-server-at-tp4061831p4061842.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: ColrCloud: IOException occured when talking to server at

Posted by heaven <ah...@gmail.com>.

Forget to mention Solr is 4.2 and zookepeer 3.4.5

I do not do manual commits and prefer softCommit each second and autoCommit
each 3 minutes.

the problem happened again, lots of errors in logs and no description.
Cluster state changed, on the shard 2 replica became a leader, former leader
get in to recovering mode.
The error happened when
1. Shard1 tried to forward an update to Shard2, and this was the initial
error From Shard2:
ClusterState says we are the leader, but locally we don't think so
2. Shard2 forwarded the update to the Replica2 and get:
org.apache.solr.common.SolrException: Request says it is coming from
leader, but we are the leader

Please see attachments

Topology:
<http://lucene.472066.n3.nabble.com/file/n4061839/Topology_new.png>
Shard1:
<http://lucene.472066.n3.nabble.com/file/n4061839/Shard1_new.png>
Replica1:
<http://lucene.472066.n3.nabble.com/file/n4061839/Replica1_new.png>
Shard2:
<http://lucene.472066.n3.nabble.com/file/n4061839/Shard2_new.png>
Replica2:
<http://lucene.472066.n3.nabble.com/file/n4061839/Replica2_new.png>

All errors from the screenshots appears each time the server load gets
higher. Only I started a few more queue workers, load gets higher and
cluster becomes unstable. So I have doubts about reliability. Could any docs
be lost during these errors or should I just ignore those?

I understand that 4 solr instances and 3 zookeeper could be too many for a
single machine, there could be not enough resources, etc. But anyway it
should not cause anything like that. The worst scenario there should be is a
timeout error, when Solr not responding and my queue processors could handle
that and resend a request after a while.

--
View this message in context: http://lucene.472066.n3.nabble.com/ColrCloud-IOException-occured-when-talking-to-server-at-tp4061831p4061839.html
Sent from the Solr - User mailing list archive at Nabble.com.