You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by dmarini <da...@gmail.com> on 2013/08/21 06:52:23 UTC

4.3 Cloud looks good on the outside, but lots of errors in the logs

Hi,

I'm running a solr 4.3 cloud in a 3 machine setup that has the following
configuration:
each machine is running 3 zookeepers on different ports
each machine is running a jetty instance PER zookeeper..

Essentially, this gives us the ability to host 3 isolated clouds across the
3 machines. 3 shards per collection with each machine hosting a shard and
replicas of the other 2 shards. default timeout for the zookeeper
communication is 60 seconds. At any time I can go to any machine/port combo
and go to the "Cloud" view and everything looks peachy. All nodes are green
and each shard of each collection has an active leader (albeit they all
eventually have the SAME leader, which does stump me as to how it gets that
way but one thing at a time).

Despite everything looking good, looking at the logs on any of the nodes is
enough to make me wonder how the cloud is functioning at all, with errors
like the following:

*Error while trying to recover.
core=MyCollection.shard2.replica:org.apache.solr.client.solrj.SolrServerException:
Timeout occured while waiting response from server at:
http://MYNODE2.MYDOMAIN.LOCAL:8983/solr
* (what's funny about this one is that MYNODE2:8983/solr responds with no
issue and appears healthy (all green), but these errors are coming in 5 to
10 at a time for MYNODE1 and MYNODE3.)

*Org.apache.solr.common.SolrException: I was asked to wait on state
recovering for MYNODE3.MYDOMAIN.LOCAL:8983_solr but I still do not see the
requested state. I see state: active live:true* (this is from the leader
node: MYNODE2:8983/solr logs from the admin site.. Again, all appears ok and
read/writes to the cloud are working.)

To top it all off, we have monitors that call out to the solr/admin/ping
handler for each node of each cloud and normally these pings are very quick
(under 100ms).. but at various points throughout the day, the 60 second
timeout is surpassed for the monitor and it raises an alarm only to have the
next ping go right back to quick.

I've done checks against resource usage on the machines when I see these
ping slowdowns but I'm not seeing any memory pressure (in terms of free
memory) or cpu thrashing. I'm at a loss for what can cause the system to be
so unstable and would appreciate any thoughts on any of the messages from
the log or proposed ideas for the cause of the ping issue.

Also, to confirm, there is currently no way to force a leader election
correct? with all of our collections inevitably rolling themselves to the
same leader over time, I feel that the performance will suffer since all
writes will be trying to happen on the same machine when there are other
healthy machines that can be the leader for the other shards to allow a
better distribution of requests

Desperate for a stable cloud, thanks in advance for any help.

--Dave

--
View this message in context: http://lucene.472066.n3.nabble.com/4-3-Cloud-looks-good-on-the-outside-but-lots-of-errors-in-the-logs-tp4085806.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: 4.3 Cloud looks good on the outside, but lots of errors in the logs

Posted by Shawn Heisey <so...@elyograg.org>.

On 8/21/2013 6:23 PM, dmarini wrote:
> Shawn,Thanks for your reply. All of these suggestions look like good ideas
> and I will follow up. We are running Solr via the Jetty process on windows
> as well as all of our zookeepers on the same boxes as the clouds. The reason
> for this is that we're on EC2 servers so it gets ultra expensive to have a 6
> box setup just to have zookeepers on separate boxes from the solr instances.

You can have zookeeper on the same host as Solr, that's no problem.  You 
should drop to just three total zookeepers, one per node, and use the 
chroot method to keep things separate.  You can probably run zookeeper 
with a max heap of 256MB, but it likely would never need more than 
512MB.  It doesn't use much memory at all.

> Each of our Windows boxes has 8GB of RAM, with roughly 35 - 40% of it still
> seemingly free. Is there a tool or some way we can identify for certain if
> we're running into memory issues?I like your zookeeper idea and I didn't
> know that this was feasible. I will get a test bed set up that way soon.As
> for indexes, each cloud has multiple collections but we're looking at the
> largest entire cloud (multiple indexes) being about 200MB, each collection
> is between 50 and 100MB and I don't see them getting much bigger than that
> per index (but I do see more indexes being added to the clouds).

With indexes that small, I would run each Jetty/Solr with a max heap of 
1GB.  With three of them per server, that will mean that Solr is using 
3GB of RAM, leaving 5GB for the OS disk cache.  You could probably bump 
that to 1.5 or 2GB and still be OK.

> Is there a definitive advantage to running Solr on a linux box
> over windows? I need to be able to justify the time and effort it will take
> to get up to speed on a non-familiar OS if we're going to go that route but
> if there's a good enough reason I don't see why not.

Linux manages memory better than Windows, and ext4 is a much better 
filesystem than NTFS.  If you are familiar with Windows, there's nothing 
wrong with continuing to use it, except for the fact that you have to 
give Microsoft a few hundred bucks per machine for a server OS when you 
take it into production.  You can run Linux for free.

>--Would it be helpful to
> have the zookeeper ensemble on a different disk drive than the clouds? --Can
> the chattiness of all of the replication and zookeeper communication for
> multiple clouds/collections cause any of these issues (We do have some
> collections that are in constant flux with 1 - 5 requests each second, which
> we gather up and send to solr in batches of 250 documents or a 10 second
> flush)?

It never hurts to have things separated so they are on different disks, 
but SolrCloud will put hardly any load on zookeeper, so I don't think it 
matters much.  It is Solr itself that will take that load.

Thanks,
Shawn

Re: 4.3 Cloud looks good on the outside, but lots of errors in the logs

Posted by dmarini <da...@gmail.com>.

Shawn,Thanks for your reply. All of these suggestions look like good ideas
and I will follow up. We are running Solr via the Jetty process on windows
as well as all of our zookeepers on the same boxes as the clouds. The reason
for this is that we're on EC2 servers so it gets ultra expensive to have a 6
box setup just to have zookeepers on separate boxes from the solr instances. 
Each of our Windows boxes has 8GB of RAM, with roughly 35 - 40% of it still
seemingly free. Is there a tool or some way we can identify for certain if
we're running into memory issues?I like your zookeeper idea and I didn't
know that this was feasible. I will get a test bed set up that way soon.As
for indexes, each cloud has multiple collections but we're looking at the
largest entire cloud (multiple indexes) being about 200MB, each collection
is between 50 and 100MB and I don't see them getting much bigger than that
per index (but I do see more indexes being added to the clouds).A few more
questions:--Is there a definitive advantage to running Solr on a linux box
over windows? I need to be able to justify the time and effort it will take
to get up to speed on a non-familiar OS if we're going to go that route but
if there's a good enough reason I don't see why not.--Would it be helpful to
have the zookeeper ensemble on a different disk drive than the clouds? --Can
the chattiness of all of the replication and zookeeper communication for
multiple clouds/collections cause any of these issues (We do have some
collections that are in constant flux with 1 - 5 requests each second, which
we gather up and send to solr in batches of 250 documents or a 10 second
flush)?Thanks again for your reply and suggestions, they are much
appreciated.--Dave  



--
View this message in context: http://lucene.472066.n3.nabble.com/4-3-Cloud-looks-good-on-the-outside-but-lots-of-errors-in-the-logs-tp4085806p4085984.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: 4.3 Cloud looks good on the outside, but lots of errors in the logs

Posted by Shawn Heisey <so...@elyograg.org>.

On 8/20/2013 10:52 PM, dmarini wrote:
> I'm running a solr 4.3 cloud in a 3 machine setup that has the following
> configuration:
> each machine is running 3 zookeepers on different ports
> each machine is running a jetty instance PER zookeeper..
>
> Essentially, this gives us the ability to host 3 isolated clouds across the
> 3 machines. 3 shards per collection with each machine hosting a shard and
> replicas of the other 2 shards. default timeout for the zookeeper
> communication is 60 seconds. At any time I can go to any machine/port combo
> and go to the "Cloud" view and everything looks peachy. All nodes are green
> and each shard of each collection has an active leader (albeit they all
> eventually have the SAME leader, which does stump me as to how it gets that
> way but one thing at a time).
>
> Despite everything looking good, looking at the logs on any of the nodes is
> enough to make me wonder how the cloud is functioning at all, with errors
> like the following:
>
> *Error while trying to recover.
> core=MyCollection.shard2.replica:org.apache.solr.client.solrj.SolrServerException:
> Timeout occured while waiting response from server at:
> http://MYNODE2.MYDOMAIN.LOCAL:8983/solr
> * (what's funny about this one is that MYNODE2:8983/solr responds with no
> issue and appears healthy (all green), but these errors are coming in 5 to
> 10 at a time for MYNODE1 and MYNODE3.)
>
> *Org.apache.solr.common.SolrException: I was asked to wait on state
> recovering for MYNODE3.MYDOMAIN.LOCAL:8983_solr but I still do not see the
> requested state. I see state: active live:true* (this is from the leader
> node: MYNODE2:8983/solr logs from the admin site.. Again, all appears ok and
> read/writes to the cloud are working.)
>
> To top it all off, we have monitors that call out to the solr/admin/ping
> handler for each node of each cloud and normally these pings are very quick
> (under 100ms).. but at various points throughout the day, the 60 second
> timeout is surpassed for the monitor and it raises an alarm only to have the
> next ping go right back to quick.
>
> I've done checks against resource usage on the machines when I see these
> ping slowdowns but I'm not seeing any memory pressure (in terms of free
> memory) or cpu thrashing. I'm at a loss for what can cause the system to be
> so unstable and would appreciate any thoughts on any of the messages from
> the log or proposed ideas for the cause of the ping issue.
>
> Also, to confirm, there is currently no way to force a leader election
> correct? with all of our collections inevitably rolling themselves to the
> same leader over time, I feel that the performance will suffer since all
> writes will be trying to happen on the same machine when there are other
> healthy machines that can be the leader for the other shards to allow a
> better distribution of requests

I am guessing that you are running into resource starvation, mostly
memory.  You've probably got a lot of slow garbage collections, and you
might even be going to swap (UNIX) or the pagefile (Windows) from
allocating too much memory to Solr instances.  You may find that you
need to add memory to the machines.  I wouldn't try what you are doing
without at least 16GB per server, and depending on how big those indexes
are, I might want 32 or 64GB.

The first thing I recommend is getting rid of all those extra
zookeepers.  You can run many clouds on one three-node zookeeper
ensemble.  You just need to have zkHost parameters like the following,
where "/test1" gets replaced by a different chroot value for each cloud.
  You do not need the chroot on every server in the list, just once at
the end:

-DzkHost=server1:2181,server2:2181,server3:2181/test1

The next thing is to size the max heap appropriately for each of your
Solr instances.  The total amount of RAM allocated to all the JVMs -
zookeeper and Solr - must not exceed the total memory in the server, and
you should have RAM left over for OS disk caching as well.  Unless your
max heap is below 1GB, you'll also want to tune your garbage collection.

Included in the following wiki page are some good tips on memory and
garbage collection tuning:

http://wiki.apache.org/solr/SolrPerformanceProblems

Thanks,
Shawn