You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Joe Gresock <jg...@gmail.com> on 2014/05/31 15:34:46 UTC

Uneven shard heap usage

Hi folks,

I'm trying to figure out why one shard of an evenly-distributed 3-shard
cluster would suddenly start running out of heap space, after 9+ months of
stable performance.  We're using the "!" delimiter in our ids to distribute
the documents, and indeed the disk size of our shards are very similar
(31-32GB on disk per replica).

Our setup is:
9 VMs with 16GB RAM, 8 vcpus (with a 4:1 oversubscription ratio, so
basically 2 physical CPUs), 24GB disk
3 shards, 3 replicas per shard (1 leader, 2 replicas, whatever).  We
reserve 10g heap for each solr instance.
Also 3 zookeeper VMs, which are very stable

Since the troubles started, we've been monitoring all 9 with jvisualvm, and
shards 2 and 3 keep a steady amount of heap space reserved, always having
horizontal lines (with some minor gc).  They're using 4-5GB heap, and when
we force gc using jvisualvm, they drop to 1GB usage.  Shard 1, however,
quickly has a steep slope, and eventually has concurrent mode failures in
the gc logs, requiring us to restart the instances when they can no longer
do anything but gc.

We've tried ruling out physical host problems by moving all 3 Shard 1
replicas to different hosts that are underutilized, however we still get
the same problem.  We'll still be working on ruling out infrastructure
issues, but I wanted to ask the questions here in case it makes sense:

* Does it make sense that all the replicas on one shard of a cluster would
have heap problems, when the other shard replicas do not, assuming a fairly
even data distribution?
* One thing we changed recently was to make all of our fields stored,
instead of only half of them.  This was to support atomic updates.  Can
stored fields, even though lazily loaded, cause problems like this?

Thanks for any input,
Joe





-- 
I know what it is to be in need, and I know what it is to have plenty.  I
have learned the secret of being content in any and every situation,
whether well fed or hungry, whether living in plenty or in want.  I can do
all this through him who gives me strength.    *-Philippians 4:12-13*

Re: Uneven shard heap usage

Posted by Michael Sokolov <ms...@safaribooksonline.com>.
You'll get very different performance profiles from the various 
highlighters (we saw up to 15x speed difference in our queries on 
average by changing highlighters). The default one re-analyzes the 
entire stored document, in memory and is the slowest, but provides the 
most faithful match to the query.  It can be sped up by limiting its 
scope to a truncated version of your stored field; it defaults to a 
reasonable value -- perhaps you overrode hl.maxAnalyzedChars?  We've 
gotten results we like better (faster, less memory, decent snippets) 
from FastVectorHighlighter and more recently from PostingsHighlighter, 
but they each impose tradeoffs you should consider, including possibly 
the need to reindex depending on your current setup.

-Mike

On 6/2/2014 8:38 PM, Joe Gresock wrote:
> So, we were finally able to reproduce the heap overload behavior with a
> stress test of a query that highlighted the large fields we found.  We'll
> have to play around with the highlighting settings, but for now we've
> disabled the highlighting on this query (which is a canned query that
> doesn't even really need highlighting), and our cluster is back to stellar
> performance.
>
> What we observed while debugging this was quite interesting:
> * We removed all of the documents with field values > 2 MB in Shard 1
> (which was causing the problems)
> * When we enabled user query access again, Shard 2 fairly quickly ran out
> of heap space, but Shard 1 was stable!
> * We then removed all documents from Shard 2 with the same criteria.  When
> running a stress test, Shard 3 ran out of heap space and Shard 1 and 2 were
> stable
>
> At this point, our stability issues are gone, but we're left wondering how
> best to re-ingest these documents.  Currently we have this field truncated
> to 2 MB, which is not ideal.
>
> It seems like there's a balance between allowing more of this field to be
> searchable vs. providing the most highlighted results.
>
> I wonder if anyone can recommend some of the relevant highlighting
> parameters that might be able to allow us to turn highlighting back on for
> this field.  I'd say probably only 100-200 documents have field values as
> large as this.
>
> Joe
>
>
> On Mon, Jun 2, 2014 at 10:44 AM, Erick Erickson <er...@gmail.com>
> wrote:
>
>> Joe:
>>
>> One thing to add, if you're returning that doc (or perhaps even some
>> fields, this bit is still something of a mystery to me) then the whole 180M
>> may be being decompressed. Since 4.1 the stored fields have been compressed
>> to disk by default. That this, this is only true if the docs in question
>> are returned as part of the result set. Adding &distrib=false to the URL
>> and pinging only that shard should let you focus on only this shard....
>>
>> Best,
>> Erick
>>
>>
>> On Mon, Jun 2, 2014 at 4:27 AM, Michael Sokolov <
>> msokolov@safaribooksonline.com> wrote:
>>
>>> Joe - there shouldn't really be a problem *indexing* these fields:
>>> remember that all the terms are spread across the index, so there is
>> really
>>> no storage difference between one 180MB document and 180 1 MB documents
>>> from an indexing perspective.
>>>
>>> Making the field "stored" is more likely to lead to a problem, although
>>> it's still a bit of a mystery exactly what's going on. Do they need to be
>>> stored? For example: do you highlight the entire field? Still 180MB
>>> shouldn't necessarily lead to heap space problems, but one thing you
>> could
>>> play with is reducing the cache sizes on that node: if you had very large
>>> (in terms of numbers of documents) caches, and a lot of the documents
>> were
>>> big, that could lead to heap problems.  But this is all just guessing.
>>>
>>> -Mike
>>>
>>>
>>>
>>> On 6/2/2014 6:13 AM, Joe Gresock wrote:
>>>
>>>> And the followup question would be.. if some of these documents are
>>>> legitimately this large (they really do have that much text), is there a
>>>> good way to still allow that to be searchable and not explode our index?
>>>>    These would be "text_en" type fields.
>>>>
>>>>
>>>> On Mon, Jun 2, 2014 at 6:09 AM, Joe Gresock <jg...@gmail.com> wrote:
>>>>
>>>>   So, we're definitely running into some very large documents (180MB, for
>>>>> example).  I haven't run the analysis on the other 2 shards yet, but
>> this
>>>>> could definitely be our problem.
>>>>>
>>>>> Is there any conventional wisdom on a good "maximum size" for your
>>>>> indexed
>>>>> fields?  Of course it will vary for each system, but assuming a heap of
>>>>> 10g, does anyone have past experience in limiting their field sizes?
>>>>>
>>>>> Our caches are set to 128.
>>>>>
>>>>>
>>>>> On Sun, Jun 1, 2014 at 8:32 AM, Joe Gresock <jg...@gmail.com>
>> wrote:
>>>>>   These are some good ideas.  The "huge document" idea could add up,
>> since
>>>>>> I think the shard1 index is a little larger (32.5GB on disk instead of
>>>>>> 31.9GB), so it is possible there's one or 2 really big ones that are
>>>>>> getting loaded into memory there.
>>>>>>
>>>>>> Btw, I did find an article on the Solr document routing (
>>>>>> http://searchhub.org/2013/06/13/solr-cloud-document-routing/), so I
>>>>>> don't think that our ID structure is a problem in itself.  But I will
>>>>>> follow up on the large document idea.
>>>>>>
>>>>>> I used this article (
>>>>>> https://support.datastax.com/entries/38367716-Solr-
>>>>>> Configuration-Best-Practices-and-Troubleshooting-Tips)
>>>>>> to find the index heap and disk usage:
>>>>>> http://localhost:8983/solr/admin/cores?action=STATUS&memory=true
>>>>>>
>>>>>> Though looking at the data index directory on disk basically said the
>>>>>> same thing.
>>>>>>
>>>>>> I am pretty sure we're using the smart round-robining client, but I
>> will
>>>>>> double check on Monday.
>>>>>>
>>>>>> We have been using CollectD and graphite to monitor our VMs, as well
>> as
>>>>>> jvisualvm, though we haven't tried SPM.
>>>>>>
>>>>>> Thanks for all the ideas, guys.
>>>>>>
>>>>>>
>>>>>> On Sat, May 31, 2014 at 11:54 PM, Otis Gospodnetic <
>>>>>> otis.gospodnetic@gmail.com> wrote:
>>>>>>
>>>>>>   Hi Joe,
>>>>>>> Are you/how are you sure all 3 shards are roughly the same size?  Can
>>>>>>> you
>>>>>>> share what you run/see that shows you that?
>>>>>>>
>>>>>>> Are you sure queries are evenly distributed?  Something like SPM
>>>>>>> <http://sematext.com/spm/> should give you insight into that.
>>>>>>>
>>>>>>> How big are your caches?
>>>>>>>
>>>>>>> Otis
>>>>>>> --
>>>>>>> Performance Monitoring * Log Analytics * Search Analytics
>>>>>>> Solr & Elasticsearch Support * http://sematext.com/
>>>>>>>
>>>>>>>
>>>>>>> On Sat, May 31, 2014 at 5:54 PM, Joe Gresock <jg...@gmail.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>   Interesting thought about the routing.  Our document ids are in 3
>>>>>>> parts:
>>>>>>>
>>>>>>>> <10-digit identifier>!<epoch timestamp>!<format>
>>>>>>>>
>>>>>>>> e.g., 5/12345678!130000025603!TEXT
>>>>>>>>
>>>>>>>> Each object has an identifier, and there may be multiple versions of
>>>>>>>>
>>>>>>> the
>>>>>>>
>>>>>>>> object, hence the timestamp.  We like to be able to pull back all of
>>>>>>>>
>>>>>>> the
>>>>>>>
>>>>>>>> versions of an object at once, hence the routing scheme.
>>>>>>>>
>>>>>>>> The nature of the identifier is that a great many of them begin
>> with a
>>>>>>>> certain number.  I'd be interested to know more about the hashing
>>>>>>>>
>>>>>>> scheme
>>>>>>>
>>>>>>>> used for the document routing.  Perhaps the first character gives it
>>>>>>>>
>>>>>>> more
>>>>>>>
>>>>>>>> weight as to which shard it lands in?
>>>>>>>>
>>>>>>>> It seems strange that certain of the most highly-searched documents
>>>>>>>>
>>>>>>> would
>>>>>>>
>>>>>>>> happen to fall on this shard, but you may be onto something.   We'll
>>>>>>>>
>>>>>>> scrape
>>>>>>>
>>>>>>>> through some non-distributed queries and see what we can find.
>>>>>>>>
>>>>>>>>
>>>>>>>> On Sat, May 31, 2014 at 1:47 PM, Erick Erickson <
>>>>>>>>
>>>>>>> erickerickson@gmail.com>
>>>>>>>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>   This is very weird.
>>>>>>>>> Are you sure that all the Java versions are identical? And all the
>>>>>>>>>
>>>>>>>> JVM
>>>>>>>> parameters are the same? Grasping at straws here.
>>>>>>>>> More grasping at straws: I'm a little suspicious that you are using
>>>>>>>>> routing. You say that the indexes are about the same size, but is
>> it
>>>>>>>> is
>>>>>>>> possible that your routing is somehow loading the problem shard
>>>>>>>> abnormally?
>>>>>>>>
>>>>>>>>> By that I mean somehow the documents on that shard are different,
>> or
>>>>>>>> have a
>>>>>>>>
>>>>>>>>> drastically higher number of hits than the other shards?
>>>>>>>>>
>>>>>>>>> You can fire queries at shards with &distrib=false and NOT have it
>>>>>>>>>
>>>>>>>> go to
>>>>>>>> other shards, perhaps if you can isolate the problem queries that
>>>>>>>> might
>>>>>>>> shed some light on the problem.
>>>>>>>>>
>>>>>>>>> Best
>>>>>>>>> Erick@Baffled.com
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Sat, May 31, 2014 at 8:33 AM, Joe Gresock <jg...@gmail.com>
>>>>>>>>>
>>>>>>>> wrote:
>>>>>>>> It has taken as little as 2 minutes to happen the last time we
>>>>>>>>> tried.
>>>>>>>>    It
>>>>>>>>
>>>>>>>>> basically happens upon high query load (peak user hours during the
>>>>>>>>> day).
>>>>>>>>>    When we reduce functionality by disabling most searches, it
>>>>>>>>> stabilizes.
>>>>>>>>>    So it really is only on high query load.  Our ingest rate is
>>>>>>>>> fairly
>>>>>>>> low.
>>>>>>>>
>>>>>>>>> It happens no matter how many nodes in the shard are up.
>>>>>>>>>>
>>>>>>>>>> Joe
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Sat, May 31, 2014 at 11:04 AM, Jack Krupansky <
>>>>>>>>>>
>>>>>>>>> jack@basetechnology.com>
>>>>>>>>>
>>>>>>>>>> wrote:
>>>>>>>>>>
>>>>>>>>>>   When you restart, how long does it take it hit the problem? And
>>>>>>>>>> how
>>>>>>>> much
>>>>>>>>>> query or update activity is happening in that time? Is there any
>>>>>>>>>> other
>>>>>>>>> activity showing up in the log?
>>>>>>>>>>> If you bring up only a single node in that problematic shard, do
>>>>>>>>>>>
>>>>>>>>>> you
>>>>>>>> still
>>>>>>>>>>> see the problem?
>>>>>>>>>>>
>>>>>>>>>>> -- Jack Krupansky
>>>>>>>>>>>
>>>>>>>>>>> -----Original Message----- From: Joe Gresock
>>>>>>>>>>> Sent: Saturday, May 31, 2014 9:34 AM
>>>>>>>>>>> To: solr-user@lucene.apache.org
>>>>>>>>>>> Subject: Uneven shard heap usage
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> Hi folks,
>>>>>>>>>>>
>>>>>>>>>>> I'm trying to figure out why one shard of an evenly-distributed
>>>>>>>>>>>
>>>>>>>>>> 3-shard
>>>>>>>>> cluster would suddenly start running out of heap space, after 9+
>>>>>>>>>> months
>>>>>>>>> of
>>>>>>>>>>> stable performance.  We're using the "!" delimiter in our ids to
>>>>>>>>>>>
>>>>>>>>>> distribute
>>>>>>>>>>
>>>>>>>>>>> the documents, and indeed the disk size of our shards are very
>>>>>>>>>>>
>>>>>>>>>> similar
>>>>>>>>> (31-32GB on disk per replica).
>>>>>>>>>>> Our setup is:
>>>>>>>>>>> 9 VMs with 16GB RAM, 8 vcpus (with a 4:1 oversubscription ratio,
>>>>>>>>>>>
>>>>>>>>>> so
>>>>>>>> basically 2 physical CPUs), 24GB disk
>>>>>>>>>>> 3 shards, 3 replicas per shard (1 leader, 2 replicas, whatever).
>>>>>>>>>>>
>>>>>>>>>>    We
>>>>>>>> reserve 10g heap for each solr instance.
>>>>>>>>>>> Also 3 zookeeper VMs, which are very stable
>>>>>>>>>>>
>>>>>>>>>>> Since the troubles started, we've been monitoring all 9 with
>>>>>>>>>>>
>>>>>>>>>> jvisualvm,
>>>>>>>>> and
>>>>>>>>>>> shards 2 and 3 keep a steady amount of heap space reserved,
>>>>>>>>>>>
>>>>>>>>>> always
>>>>>>>> having
>>>>>>>>>> horizontal lines (with some minor gc).  They're using 4-5GB
>>>>>>>>>> heap, and
>>>>>>>> when
>>>>>>>>>>> we force gc using jvisualvm, they drop to 1GB usage.  Shard 1,
>>>>>>>>>>>
>>>>>>>>>> however,
>>>>>>>>> quickly has a steep slope, and eventually has concurrent mode
>>>>>>>>>> failures
>>>>>>>>> in
>>>>>>>>>
>>>>>>>>>> the gc logs, requiring us to restart the instances when they can
>>>>>>>>>> no
>>>>>>>> longer
>>>>>>>>>>> do anything but gc.
>>>>>>>>>>>
>>>>>>>>>>> We've tried ruling out physical host problems by moving all 3
>>>>>>>>>>>
>>>>>>>>>> Shard 1
>>>>>>>> replicas to different hosts that are underutilized, however we
>>>>>>>>>> still
>>>>>>>> get
>>>>>>>>>> the same problem.  We'll still be working on ruling out
>>>>>>>>>> infrastructure
>>>>>>>>> issues, but I wanted to ask the questions here in case it makes
>>>>>>>>>> sense:
>>>>>>>>> * Does it make sense that all the replicas on one shard of a
>>>>>>>>>> cluster
>>>>>>>> would
>>>>>>>>>>> have heap problems, when the other shard replicas do not,
>>>>>>>>>>>
>>>>>>>>>> assuming a
>>>>>>>> fairly
>>>>>>>>>>> even data distribution?
>>>>>>>>>>> * One thing we changed recently was to make all of our fields
>>>>>>>>>>>
>>>>>>>>>> stored,
>>>>>>>> instead of only half of them.  This was to support atomic
>>>>>>>>>> updates.
>>>>>>>>    Can
>>>>>>>>
>>>>>>>>> stored fields, even though lazily loaded, cause problems like
>>>>>>>>>> this?
>>>>>>>> Thanks for any input,
>>>>>>>>>>> Joe
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> --
>>>>>>>>>>> I know what it is to be in need, and I know what it is to have
>>>>>>>>>>>
>>>>>>>>>> plenty.
>>>>>>>>>    I
>>>>>>>>>
>>>>>>>>>> have learned the secret of being content in any and every
>>>>>>>>>> situation,
>>>>>>>> whether well fed or hungry, whether living in plenty or in want.
>>>>>>>>>>    I
>>>>>>>> can
>>>>>>>>
>>>>>>>>> do
>>>>>>>>>>> all this through him who gives me strength.    *-Philippians
>>>>>>>>>>>
>>>>>>>>>> 4:12-13*
>>>>>>>>>> --
>>>>>>>>>> I know what it is to be in need, and I know what it is to have
>>>>>>>>>>
>>>>>>>>> plenty.
>>>>>>>>    I
>>>>>>>>
>>>>>>>>> have learned the secret of being content in any and every
>>>>>>>>> situation,
>>>>>>>> whether well fed or hungry, whether living in plenty or in want.
>>>>>>>>>    I can
>>>>>>>> do
>>>>>>>>>> all this through him who gives me strength.    *-Philippians
>>>>>>>>>>
>>>>>>>>> 4:12-13*
>>>>>>>>
>>>>>>>> --
>>>>>>>> I know what it is to be in need, and I know what it is to have
>> plenty.
>>>>>>>    I
>>>>>>>
>>>>>>>> have learned the secret of being content in any and every situation,
>>>>>>>> whether well fed or hungry, whether living in plenty or in want.  I
>>>>>>>>
>>>>>>> can do
>>>>>>>
>>>>>>>> all this through him who gives me strength.    *-Philippians
>> 4:12-13*
>>>>>>>>
>>>>>> --
>>>>>> I know what it is to be in need, and I know what it is to have plenty.
>>>>>>   I
>>>>>> have learned the secret of being content in any and every situation,
>>>>>> whether well fed or hungry, whether living in plenty or in want.  I
>> can
>>>>>> do all this through him who gives me strength.    *-Philippians
>> 4:12-13*
>>>>>>
>>>>> --
>>>>> I know what it is to be in need, and I know what it is to have plenty.
>>   I
>>>>> have learned the secret of being content in any and every situation,
>>>>> whether well fed or hungry, whether living in plenty or in want.  I can
>>>>> do all this through him who gives me strength.    *-Philippians
>> 4:12-13*
>>>>>
>>>>
>
>


Re: Uneven shard heap usage

Posted by Joe Gresock <jg...@gmail.com>.
So, we were finally able to reproduce the heap overload behavior with a
stress test of a query that highlighted the large fields we found.  We'll
have to play around with the highlighting settings, but for now we've
disabled the highlighting on this query (which is a canned query that
doesn't even really need highlighting), and our cluster is back to stellar
performance.

What we observed while debugging this was quite interesting:
* We removed all of the documents with field values > 2 MB in Shard 1
(which was causing the problems)
* When we enabled user query access again, Shard 2 fairly quickly ran out
of heap space, but Shard 1 was stable!
* We then removed all documents from Shard 2 with the same criteria.  When
running a stress test, Shard 3 ran out of heap space and Shard 1 and 2 were
stable

At this point, our stability issues are gone, but we're left wondering how
best to re-ingest these documents.  Currently we have this field truncated
to 2 MB, which is not ideal.

It seems like there's a balance between allowing more of this field to be
searchable vs. providing the most highlighted results.

I wonder if anyone can recommend some of the relevant highlighting
parameters that might be able to allow us to turn highlighting back on for
this field.  I'd say probably only 100-200 documents have field values as
large as this.

Joe


On Mon, Jun 2, 2014 at 10:44 AM, Erick Erickson <er...@gmail.com>
wrote:

> Joe:
>
> One thing to add, if you're returning that doc (or perhaps even some
> fields, this bit is still something of a mystery to me) then the whole 180M
> may be being decompressed. Since 4.1 the stored fields have been compressed
> to disk by default. That this, this is only true if the docs in question
> are returned as part of the result set. Adding &distrib=false to the URL
> and pinging only that shard should let you focus on only this shard....
>
> Best,
> Erick
>
>
> On Mon, Jun 2, 2014 at 4:27 AM, Michael Sokolov <
> msokolov@safaribooksonline.com> wrote:
>
> > Joe - there shouldn't really be a problem *indexing* these fields:
> > remember that all the terms are spread across the index, so there is
> really
> > no storage difference between one 180MB document and 180 1 MB documents
> > from an indexing perspective.
> >
> > Making the field "stored" is more likely to lead to a problem, although
> > it's still a bit of a mystery exactly what's going on. Do they need to be
> > stored? For example: do you highlight the entire field? Still 180MB
> > shouldn't necessarily lead to heap space problems, but one thing you
> could
> > play with is reducing the cache sizes on that node: if you had very large
> > (in terms of numbers of documents) caches, and a lot of the documents
> were
> > big, that could lead to heap problems.  But this is all just guessing.
> >
> > -Mike
> >
> >
> >
> > On 6/2/2014 6:13 AM, Joe Gresock wrote:
> >
> >> And the followup question would be.. if some of these documents are
> >> legitimately this large (they really do have that much text), is there a
> >> good way to still allow that to be searchable and not explode our index?
> >>   These would be "text_en" type fields.
> >>
> >>
> >> On Mon, Jun 2, 2014 at 6:09 AM, Joe Gresock <jg...@gmail.com> wrote:
> >>
> >>  So, we're definitely running into some very large documents (180MB, for
> >>> example).  I haven't run the analysis on the other 2 shards yet, but
> this
> >>> could definitely be our problem.
> >>>
> >>> Is there any conventional wisdom on a good "maximum size" for your
> >>> indexed
> >>> fields?  Of course it will vary for each system, but assuming a heap of
> >>> 10g, does anyone have past experience in limiting their field sizes?
> >>>
> >>> Our caches are set to 128.
> >>>
> >>>
> >>> On Sun, Jun 1, 2014 at 8:32 AM, Joe Gresock <jg...@gmail.com>
> wrote:
> >>>
> >>>  These are some good ideas.  The "huge document" idea could add up,
> since
> >>>> I think the shard1 index is a little larger (32.5GB on disk instead of
> >>>> 31.9GB), so it is possible there's one or 2 really big ones that are
> >>>> getting loaded into memory there.
> >>>>
> >>>> Btw, I did find an article on the Solr document routing (
> >>>> http://searchhub.org/2013/06/13/solr-cloud-document-routing/), so I
> >>>> don't think that our ID structure is a problem in itself.  But I will
> >>>> follow up on the large document idea.
> >>>>
> >>>> I used this article (
> >>>> https://support.datastax.com/entries/38367716-Solr-
> >>>> Configuration-Best-Practices-and-Troubleshooting-Tips)
> >>>> to find the index heap and disk usage:
> >>>> http://localhost:8983/solr/admin/cores?action=STATUS&memory=true
> >>>>
> >>>> Though looking at the data index directory on disk basically said the
> >>>> same thing.
> >>>>
> >>>> I am pretty sure we're using the smart round-robining client, but I
> will
> >>>> double check on Monday.
> >>>>
> >>>> We have been using CollectD and graphite to monitor our VMs, as well
> as
> >>>> jvisualvm, though we haven't tried SPM.
> >>>>
> >>>> Thanks for all the ideas, guys.
> >>>>
> >>>>
> >>>> On Sat, May 31, 2014 at 11:54 PM, Otis Gospodnetic <
> >>>> otis.gospodnetic@gmail.com> wrote:
> >>>>
> >>>>  Hi Joe,
> >>>>>
> >>>>> Are you/how are you sure all 3 shards are roughly the same size?  Can
> >>>>> you
> >>>>> share what you run/see that shows you that?
> >>>>>
> >>>>> Are you sure queries are evenly distributed?  Something like SPM
> >>>>> <http://sematext.com/spm/> should give you insight into that.
> >>>>>
> >>>>> How big are your caches?
> >>>>>
> >>>>> Otis
> >>>>> --
> >>>>> Performance Monitoring * Log Analytics * Search Analytics
> >>>>> Solr & Elasticsearch Support * http://sematext.com/
> >>>>>
> >>>>>
> >>>>> On Sat, May 31, 2014 at 5:54 PM, Joe Gresock <jg...@gmail.com>
> >>>>> wrote:
> >>>>>
> >>>>>  Interesting thought about the routing.  Our document ids are in 3
> >>>>>>
> >>>>> parts:
> >>>>>
> >>>>>> <10-digit identifier>!<epoch timestamp>!<format>
> >>>>>>
> >>>>>> e.g., 5/12345678!130000025603!TEXT
> >>>>>>
> >>>>>> Each object has an identifier, and there may be multiple versions of
> >>>>>>
> >>>>> the
> >>>>>
> >>>>>> object, hence the timestamp.  We like to be able to pull back all of
> >>>>>>
> >>>>> the
> >>>>>
> >>>>>> versions of an object at once, hence the routing scheme.
> >>>>>>
> >>>>>> The nature of the identifier is that a great many of them begin
> with a
> >>>>>> certain number.  I'd be interested to know more about the hashing
> >>>>>>
> >>>>> scheme
> >>>>>
> >>>>>> used for the document routing.  Perhaps the first character gives it
> >>>>>>
> >>>>> more
> >>>>>
> >>>>>> weight as to which shard it lands in?
> >>>>>>
> >>>>>> It seems strange that certain of the most highly-searched documents
> >>>>>>
> >>>>> would
> >>>>>
> >>>>>> happen to fall on this shard, but you may be onto something.   We'll
> >>>>>>
> >>>>> scrape
> >>>>>
> >>>>>> through some non-distributed queries and see what we can find.
> >>>>>>
> >>>>>>
> >>>>>> On Sat, May 31, 2014 at 1:47 PM, Erick Erickson <
> >>>>>>
> >>>>> erickerickson@gmail.com>
> >>>>>
> >>>>>> wrote:
> >>>>>>
> >>>>>>  This is very weird.
> >>>>>>>
> >>>>>>> Are you sure that all the Java versions are identical? And all the
> >>>>>>>
> >>>>>> JVM
> >>>>>
> >>>>>> parameters are the same? Grasping at straws here.
> >>>>>>>
> >>>>>>> More grasping at straws: I'm a little suspicious that you are using
> >>>>>>> routing. You say that the indexes are about the same size, but is
> it
> >>>>>>>
> >>>>>> is
> >>>>>
> >>>>>> possible that your routing is somehow loading the problem shard
> >>>>>>>
> >>>>>> abnormally?
> >>>>>>
> >>>>>>> By that I mean somehow the documents on that shard are different,
> or
> >>>>>>>
> >>>>>> have a
> >>>>>>
> >>>>>>> drastically higher number of hits than the other shards?
> >>>>>>>
> >>>>>>> You can fire queries at shards with &distrib=false and NOT have it
> >>>>>>>
> >>>>>> go to
> >>>>>
> >>>>>> other shards, perhaps if you can isolate the problem queries that
> >>>>>>>
> >>>>>> might
> >>>>>
> >>>>>> shed some light on the problem.
> >>>>>>>
> >>>>>>>
> >>>>>>> Best
> >>>>>>> Erick@Baffled.com
> >>>>>>>
> >>>>>>>
> >>>>>>> On Sat, May 31, 2014 at 8:33 AM, Joe Gresock <jg...@gmail.com>
> >>>>>>>
> >>>>>> wrote:
> >>>>>
> >>>>>> It has taken as little as 2 minutes to happen the last time we
> >>>>>>>>
> >>>>>>> tried.
> >>>>>
> >>>>>>   It
> >>>>>>
> >>>>>>> basically happens upon high query load (peak user hours during the
> >>>>>>>>
> >>>>>>> day).
> >>>>>>
> >>>>>>>   When we reduce functionality by disabling most searches, it
> >>>>>>>>
> >>>>>>> stabilizes.
> >>>>>>
> >>>>>>>   So it really is only on high query load.  Our ingest rate is
> >>>>>>>>
> >>>>>>> fairly
> >>>>>
> >>>>>> low.
> >>>>>>
> >>>>>>> It happens no matter how many nodes in the shard are up.
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> Joe
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> On Sat, May 31, 2014 at 11:04 AM, Jack Krupansky <
> >>>>>>>>
> >>>>>>> jack@basetechnology.com>
> >>>>>>>
> >>>>>>>> wrote:
> >>>>>>>>
> >>>>>>>>  When you restart, how long does it take it hit the problem? And
> >>>>>>>>>
> >>>>>>>> how
> >>>>>
> >>>>>> much
> >>>>>>>
> >>>>>>>> query or update activity is happening in that time? Is there any
> >>>>>>>>>
> >>>>>>>> other
> >>>>>>
> >>>>>>> activity showing up in the log?
> >>>>>>>>>
> >>>>>>>>> If you bring up only a single node in that problematic shard, do
> >>>>>>>>>
> >>>>>>>> you
> >>>>>
> >>>>>> still
> >>>>>>>>
> >>>>>>>>> see the problem?
> >>>>>>>>>
> >>>>>>>>> -- Jack Krupansky
> >>>>>>>>>
> >>>>>>>>> -----Original Message----- From: Joe Gresock
> >>>>>>>>> Sent: Saturday, May 31, 2014 9:34 AM
> >>>>>>>>> To: solr-user@lucene.apache.org
> >>>>>>>>> Subject: Uneven shard heap usage
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>> Hi folks,
> >>>>>>>>>
> >>>>>>>>> I'm trying to figure out why one shard of an evenly-distributed
> >>>>>>>>>
> >>>>>>>> 3-shard
> >>>>>>
> >>>>>>> cluster would suddenly start running out of heap space, after 9+
> >>>>>>>>>
> >>>>>>>> months
> >>>>>>
> >>>>>>> of
> >>>>>>>>
> >>>>>>>>> stable performance.  We're using the "!" delimiter in our ids to
> >>>>>>>>>
> >>>>>>>> distribute
> >>>>>>>>
> >>>>>>>>> the documents, and indeed the disk size of our shards are very
> >>>>>>>>>
> >>>>>>>> similar
> >>>>>>
> >>>>>>> (31-32GB on disk per replica).
> >>>>>>>>>
> >>>>>>>>> Our setup is:
> >>>>>>>>> 9 VMs with 16GB RAM, 8 vcpus (with a 4:1 oversubscription ratio,
> >>>>>>>>>
> >>>>>>>> so
> >>>>>
> >>>>>> basically 2 physical CPUs), 24GB disk
> >>>>>>>>> 3 shards, 3 replicas per shard (1 leader, 2 replicas, whatever).
> >>>>>>>>>
> >>>>>>>>   We
> >>>>>
> >>>>>> reserve 10g heap for each solr instance.
> >>>>>>>>> Also 3 zookeeper VMs, which are very stable
> >>>>>>>>>
> >>>>>>>>> Since the troubles started, we've been monitoring all 9 with
> >>>>>>>>>
> >>>>>>>> jvisualvm,
> >>>>>>
> >>>>>>> and
> >>>>>>>>
> >>>>>>>>> shards 2 and 3 keep a steady amount of heap space reserved,
> >>>>>>>>>
> >>>>>>>> always
> >>>>>
> >>>>>> having
> >>>>>>>
> >>>>>>>> horizontal lines (with some minor gc).  They're using 4-5GB
> >>>>>>>>>
> >>>>>>>> heap, and
> >>>>>
> >>>>>> when
> >>>>>>>>
> >>>>>>>>> we force gc using jvisualvm, they drop to 1GB usage.  Shard 1,
> >>>>>>>>>
> >>>>>>>> however,
> >>>>>>
> >>>>>>> quickly has a steep slope, and eventually has concurrent mode
> >>>>>>>>>
> >>>>>>>> failures
> >>>>>>
> >>>>>>> in
> >>>>>>>
> >>>>>>>> the gc logs, requiring us to restart the instances when they can
> >>>>>>>>>
> >>>>>>>> no
> >>>>>
> >>>>>> longer
> >>>>>>>>
> >>>>>>>>> do anything but gc.
> >>>>>>>>>
> >>>>>>>>> We've tried ruling out physical host problems by moving all 3
> >>>>>>>>>
> >>>>>>>> Shard 1
> >>>>>
> >>>>>> replicas to different hosts that are underutilized, however we
> >>>>>>>>>
> >>>>>>>> still
> >>>>>
> >>>>>> get
> >>>>>>>
> >>>>>>>> the same problem.  We'll still be working on ruling out
> >>>>>>>>>
> >>>>>>>> infrastructure
> >>>>>>
> >>>>>>> issues, but I wanted to ask the questions here in case it makes
> >>>>>>>>>
> >>>>>>>> sense:
> >>>>>>
> >>>>>>> * Does it make sense that all the replicas on one shard of a
> >>>>>>>>>
> >>>>>>>> cluster
> >>>>>
> >>>>>> would
> >>>>>>>>
> >>>>>>>>> have heap problems, when the other shard replicas do not,
> >>>>>>>>>
> >>>>>>>> assuming a
> >>>>>
> >>>>>> fairly
> >>>>>>>>
> >>>>>>>>> even data distribution?
> >>>>>>>>> * One thing we changed recently was to make all of our fields
> >>>>>>>>>
> >>>>>>>> stored,
> >>>>>
> >>>>>> instead of only half of them.  This was to support atomic
> >>>>>>>>>
> >>>>>>>> updates.
> >>>>>
> >>>>>>   Can
> >>>>>>
> >>>>>>> stored fields, even though lazily loaded, cause problems like
> >>>>>>>>>
> >>>>>>>> this?
> >>>>>
> >>>>>> Thanks for any input,
> >>>>>>>>> Joe
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>> --
> >>>>>>>>> I know what it is to be in need, and I know what it is to have
> >>>>>>>>>
> >>>>>>>> plenty.
> >>>>>>
> >>>>>>>   I
> >>>>>>>
> >>>>>>>> have learned the secret of being content in any and every
> >>>>>>>>>
> >>>>>>>> situation,
> >>>>>
> >>>>>> whether well fed or hungry, whether living in plenty or in want.
> >>>>>>>>>
> >>>>>>>>   I
> >>>>>
> >>>>>> can
> >>>>>>
> >>>>>>> do
> >>>>>>>>
> >>>>>>>>> all this through him who gives me strength.    *-Philippians
> >>>>>>>>>
> >>>>>>>> 4:12-13*
> >>>>>
> >>>>>>
> >>>>>>>>
> >>>>>>>> --
> >>>>>>>> I know what it is to be in need, and I know what it is to have
> >>>>>>>>
> >>>>>>> plenty.
> >>>>>
> >>>>>>   I
> >>>>>>
> >>>>>>> have learned the secret of being content in any and every
> >>>>>>>>
> >>>>>>> situation,
> >>>>>
> >>>>>> whether well fed or hungry, whether living in plenty or in want.
> >>>>>>>>
> >>>>>>>   I can
> >>>>>
> >>>>>> do
> >>>>>>>
> >>>>>>>> all this through him who gives me strength.    *-Philippians
> >>>>>>>>
> >>>>>>> 4:12-13*
> >>>>>
> >>>>>>
> >>>>>>
> >>>>>> --
> >>>>>> I know what it is to be in need, and I know what it is to have
> plenty.
> >>>>>>
> >>>>>   I
> >>>>>
> >>>>>> have learned the secret of being content in any and every situation,
> >>>>>> whether well fed or hungry, whether living in plenty or in want.  I
> >>>>>>
> >>>>> can do
> >>>>>
> >>>>>> all this through him who gives me strength.    *-Philippians
> 4:12-13*
> >>>>>>
> >>>>>>
> >>>>
> >>>> --
> >>>> I know what it is to be in need, and I know what it is to have plenty.
> >>>>  I
> >>>> have learned the secret of being content in any and every situation,
> >>>> whether well fed or hungry, whether living in plenty or in want.  I
> can
> >>>> do all this through him who gives me strength.    *-Philippians
> 4:12-13*
> >>>>
> >>>>
> >>>
> >>> --
> >>> I know what it is to be in need, and I know what it is to have plenty.
>  I
> >>> have learned the secret of being content in any and every situation,
> >>> whether well fed or hungry, whether living in plenty or in want.  I can
> >>> do all this through him who gives me strength.    *-Philippians
> 4:12-13*
> >>>
> >>>
> >>
> >>
> >
>



-- 
I know what it is to be in need, and I know what it is to have plenty.  I
have learned the secret of being content in any and every situation,
whether well fed or hungry, whether living in plenty or in want.  I can do
all this through him who gives me strength.    *-Philippians 4:12-13*

Re: Uneven shard heap usage

Posted by Erick Erickson <er...@gmail.com>.
Joe:

One thing to add, if you're returning that doc (or perhaps even some
fields, this bit is still something of a mystery to me) then the whole 180M
may be being decompressed. Since 4.1 the stored fields have been compressed
to disk by default. That this, this is only true if the docs in question
are returned as part of the result set. Adding &distrib=false to the URL
and pinging only that shard should let you focus on only this shard....

Best,
Erick


On Mon, Jun 2, 2014 at 4:27 AM, Michael Sokolov <
msokolov@safaribooksonline.com> wrote:

> Joe - there shouldn't really be a problem *indexing* these fields:
> remember that all the terms are spread across the index, so there is really
> no storage difference between one 180MB document and 180 1 MB documents
> from an indexing perspective.
>
> Making the field "stored" is more likely to lead to a problem, although
> it's still a bit of a mystery exactly what's going on. Do they need to be
> stored? For example: do you highlight the entire field? Still 180MB
> shouldn't necessarily lead to heap space problems, but one thing you could
> play with is reducing the cache sizes on that node: if you had very large
> (in terms of numbers of documents) caches, and a lot of the documents were
> big, that could lead to heap problems.  But this is all just guessing.
>
> -Mike
>
>
>
> On 6/2/2014 6:13 AM, Joe Gresock wrote:
>
>> And the followup question would be.. if some of these documents are
>> legitimately this large (they really do have that much text), is there a
>> good way to still allow that to be searchable and not explode our index?
>>   These would be "text_en" type fields.
>>
>>
>> On Mon, Jun 2, 2014 at 6:09 AM, Joe Gresock <jg...@gmail.com> wrote:
>>
>>  So, we're definitely running into some very large documents (180MB, for
>>> example).  I haven't run the analysis on the other 2 shards yet, but this
>>> could definitely be our problem.
>>>
>>> Is there any conventional wisdom on a good "maximum size" for your
>>> indexed
>>> fields?  Of course it will vary for each system, but assuming a heap of
>>> 10g, does anyone have past experience in limiting their field sizes?
>>>
>>> Our caches are set to 128.
>>>
>>>
>>> On Sun, Jun 1, 2014 at 8:32 AM, Joe Gresock <jg...@gmail.com> wrote:
>>>
>>>  These are some good ideas.  The "huge document" idea could add up, since
>>>> I think the shard1 index is a little larger (32.5GB on disk instead of
>>>> 31.9GB), so it is possible there's one or 2 really big ones that are
>>>> getting loaded into memory there.
>>>>
>>>> Btw, I did find an article on the Solr document routing (
>>>> http://searchhub.org/2013/06/13/solr-cloud-document-routing/), so I
>>>> don't think that our ID structure is a problem in itself.  But I will
>>>> follow up on the large document idea.
>>>>
>>>> I used this article (
>>>> https://support.datastax.com/entries/38367716-Solr-
>>>> Configuration-Best-Practices-and-Troubleshooting-Tips)
>>>> to find the index heap and disk usage:
>>>> http://localhost:8983/solr/admin/cores?action=STATUS&memory=true
>>>>
>>>> Though looking at the data index directory on disk basically said the
>>>> same thing.
>>>>
>>>> I am pretty sure we're using the smart round-robining client, but I will
>>>> double check on Monday.
>>>>
>>>> We have been using CollectD and graphite to monitor our VMs, as well as
>>>> jvisualvm, though we haven't tried SPM.
>>>>
>>>> Thanks for all the ideas, guys.
>>>>
>>>>
>>>> On Sat, May 31, 2014 at 11:54 PM, Otis Gospodnetic <
>>>> otis.gospodnetic@gmail.com> wrote:
>>>>
>>>>  Hi Joe,
>>>>>
>>>>> Are you/how are you sure all 3 shards are roughly the same size?  Can
>>>>> you
>>>>> share what you run/see that shows you that?
>>>>>
>>>>> Are you sure queries are evenly distributed?  Something like SPM
>>>>> <http://sematext.com/spm/> should give you insight into that.
>>>>>
>>>>> How big are your caches?
>>>>>
>>>>> Otis
>>>>> --
>>>>> Performance Monitoring * Log Analytics * Search Analytics
>>>>> Solr & Elasticsearch Support * http://sematext.com/
>>>>>
>>>>>
>>>>> On Sat, May 31, 2014 at 5:54 PM, Joe Gresock <jg...@gmail.com>
>>>>> wrote:
>>>>>
>>>>>  Interesting thought about the routing.  Our document ids are in 3
>>>>>>
>>>>> parts:
>>>>>
>>>>>> <10-digit identifier>!<epoch timestamp>!<format>
>>>>>>
>>>>>> e.g., 5/12345678!130000025603!TEXT
>>>>>>
>>>>>> Each object has an identifier, and there may be multiple versions of
>>>>>>
>>>>> the
>>>>>
>>>>>> object, hence the timestamp.  We like to be able to pull back all of
>>>>>>
>>>>> the
>>>>>
>>>>>> versions of an object at once, hence the routing scheme.
>>>>>>
>>>>>> The nature of the identifier is that a great many of them begin with a
>>>>>> certain number.  I'd be interested to know more about the hashing
>>>>>>
>>>>> scheme
>>>>>
>>>>>> used for the document routing.  Perhaps the first character gives it
>>>>>>
>>>>> more
>>>>>
>>>>>> weight as to which shard it lands in?
>>>>>>
>>>>>> It seems strange that certain of the most highly-searched documents
>>>>>>
>>>>> would
>>>>>
>>>>>> happen to fall on this shard, but you may be onto something.   We'll
>>>>>>
>>>>> scrape
>>>>>
>>>>>> through some non-distributed queries and see what we can find.
>>>>>>
>>>>>>
>>>>>> On Sat, May 31, 2014 at 1:47 PM, Erick Erickson <
>>>>>>
>>>>> erickerickson@gmail.com>
>>>>>
>>>>>> wrote:
>>>>>>
>>>>>>  This is very weird.
>>>>>>>
>>>>>>> Are you sure that all the Java versions are identical? And all the
>>>>>>>
>>>>>> JVM
>>>>>
>>>>>> parameters are the same? Grasping at straws here.
>>>>>>>
>>>>>>> More grasping at straws: I'm a little suspicious that you are using
>>>>>>> routing. You say that the indexes are about the same size, but is it
>>>>>>>
>>>>>> is
>>>>>
>>>>>> possible that your routing is somehow loading the problem shard
>>>>>>>
>>>>>> abnormally?
>>>>>>
>>>>>>> By that I mean somehow the documents on that shard are different, or
>>>>>>>
>>>>>> have a
>>>>>>
>>>>>>> drastically higher number of hits than the other shards?
>>>>>>>
>>>>>>> You can fire queries at shards with &distrib=false and NOT have it
>>>>>>>
>>>>>> go to
>>>>>
>>>>>> other shards, perhaps if you can isolate the problem queries that
>>>>>>>
>>>>>> might
>>>>>
>>>>>> shed some light on the problem.
>>>>>>>
>>>>>>>
>>>>>>> Best
>>>>>>> Erick@Baffled.com
>>>>>>>
>>>>>>>
>>>>>>> On Sat, May 31, 2014 at 8:33 AM, Joe Gresock <jg...@gmail.com>
>>>>>>>
>>>>>> wrote:
>>>>>
>>>>>> It has taken as little as 2 minutes to happen the last time we
>>>>>>>>
>>>>>>> tried.
>>>>>
>>>>>>   It
>>>>>>
>>>>>>> basically happens upon high query load (peak user hours during the
>>>>>>>>
>>>>>>> day).
>>>>>>
>>>>>>>   When we reduce functionality by disabling most searches, it
>>>>>>>>
>>>>>>> stabilizes.
>>>>>>
>>>>>>>   So it really is only on high query load.  Our ingest rate is
>>>>>>>>
>>>>>>> fairly
>>>>>
>>>>>> low.
>>>>>>
>>>>>>> It happens no matter how many nodes in the shard are up.
>>>>>>>>
>>>>>>>>
>>>>>>>> Joe
>>>>>>>>
>>>>>>>>
>>>>>>>> On Sat, May 31, 2014 at 11:04 AM, Jack Krupansky <
>>>>>>>>
>>>>>>> jack@basetechnology.com>
>>>>>>>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>  When you restart, how long does it take it hit the problem? And
>>>>>>>>>
>>>>>>>> how
>>>>>
>>>>>> much
>>>>>>>
>>>>>>>> query or update activity is happening in that time? Is there any
>>>>>>>>>
>>>>>>>> other
>>>>>>
>>>>>>> activity showing up in the log?
>>>>>>>>>
>>>>>>>>> If you bring up only a single node in that problematic shard, do
>>>>>>>>>
>>>>>>>> you
>>>>>
>>>>>> still
>>>>>>>>
>>>>>>>>> see the problem?
>>>>>>>>>
>>>>>>>>> -- Jack Krupansky
>>>>>>>>>
>>>>>>>>> -----Original Message----- From: Joe Gresock
>>>>>>>>> Sent: Saturday, May 31, 2014 9:34 AM
>>>>>>>>> To: solr-user@lucene.apache.org
>>>>>>>>> Subject: Uneven shard heap usage
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Hi folks,
>>>>>>>>>
>>>>>>>>> I'm trying to figure out why one shard of an evenly-distributed
>>>>>>>>>
>>>>>>>> 3-shard
>>>>>>
>>>>>>> cluster would suddenly start running out of heap space, after 9+
>>>>>>>>>
>>>>>>>> months
>>>>>>
>>>>>>> of
>>>>>>>>
>>>>>>>>> stable performance.  We're using the "!" delimiter in our ids to
>>>>>>>>>
>>>>>>>> distribute
>>>>>>>>
>>>>>>>>> the documents, and indeed the disk size of our shards are very
>>>>>>>>>
>>>>>>>> similar
>>>>>>
>>>>>>> (31-32GB on disk per replica).
>>>>>>>>>
>>>>>>>>> Our setup is:
>>>>>>>>> 9 VMs with 16GB RAM, 8 vcpus (with a 4:1 oversubscription ratio,
>>>>>>>>>
>>>>>>>> so
>>>>>
>>>>>> basically 2 physical CPUs), 24GB disk
>>>>>>>>> 3 shards, 3 replicas per shard (1 leader, 2 replicas, whatever).
>>>>>>>>>
>>>>>>>>   We
>>>>>
>>>>>> reserve 10g heap for each solr instance.
>>>>>>>>> Also 3 zookeeper VMs, which are very stable
>>>>>>>>>
>>>>>>>>> Since the troubles started, we've been monitoring all 9 with
>>>>>>>>>
>>>>>>>> jvisualvm,
>>>>>>
>>>>>>> and
>>>>>>>>
>>>>>>>>> shards 2 and 3 keep a steady amount of heap space reserved,
>>>>>>>>>
>>>>>>>> always
>>>>>
>>>>>> having
>>>>>>>
>>>>>>>> horizontal lines (with some minor gc).  They're using 4-5GB
>>>>>>>>>
>>>>>>>> heap, and
>>>>>
>>>>>> when
>>>>>>>>
>>>>>>>>> we force gc using jvisualvm, they drop to 1GB usage.  Shard 1,
>>>>>>>>>
>>>>>>>> however,
>>>>>>
>>>>>>> quickly has a steep slope, and eventually has concurrent mode
>>>>>>>>>
>>>>>>>> failures
>>>>>>
>>>>>>> in
>>>>>>>
>>>>>>>> the gc logs, requiring us to restart the instances when they can
>>>>>>>>>
>>>>>>>> no
>>>>>
>>>>>> longer
>>>>>>>>
>>>>>>>>> do anything but gc.
>>>>>>>>>
>>>>>>>>> We've tried ruling out physical host problems by moving all 3
>>>>>>>>>
>>>>>>>> Shard 1
>>>>>
>>>>>> replicas to different hosts that are underutilized, however we
>>>>>>>>>
>>>>>>>> still
>>>>>
>>>>>> get
>>>>>>>
>>>>>>>> the same problem.  We'll still be working on ruling out
>>>>>>>>>
>>>>>>>> infrastructure
>>>>>>
>>>>>>> issues, but I wanted to ask the questions here in case it makes
>>>>>>>>>
>>>>>>>> sense:
>>>>>>
>>>>>>> * Does it make sense that all the replicas on one shard of a
>>>>>>>>>
>>>>>>>> cluster
>>>>>
>>>>>> would
>>>>>>>>
>>>>>>>>> have heap problems, when the other shard replicas do not,
>>>>>>>>>
>>>>>>>> assuming a
>>>>>
>>>>>> fairly
>>>>>>>>
>>>>>>>>> even data distribution?
>>>>>>>>> * One thing we changed recently was to make all of our fields
>>>>>>>>>
>>>>>>>> stored,
>>>>>
>>>>>> instead of only half of them.  This was to support atomic
>>>>>>>>>
>>>>>>>> updates.
>>>>>
>>>>>>   Can
>>>>>>
>>>>>>> stored fields, even though lazily loaded, cause problems like
>>>>>>>>>
>>>>>>>> this?
>>>>>
>>>>>> Thanks for any input,
>>>>>>>>> Joe
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> --
>>>>>>>>> I know what it is to be in need, and I know what it is to have
>>>>>>>>>
>>>>>>>> plenty.
>>>>>>
>>>>>>>   I
>>>>>>>
>>>>>>>> have learned the secret of being content in any and every
>>>>>>>>>
>>>>>>>> situation,
>>>>>
>>>>>> whether well fed or hungry, whether living in plenty or in want.
>>>>>>>>>
>>>>>>>>   I
>>>>>
>>>>>> can
>>>>>>
>>>>>>> do
>>>>>>>>
>>>>>>>>> all this through him who gives me strength.    *-Philippians
>>>>>>>>>
>>>>>>>> 4:12-13*
>>>>>
>>>>>>
>>>>>>>>
>>>>>>>> --
>>>>>>>> I know what it is to be in need, and I know what it is to have
>>>>>>>>
>>>>>>> plenty.
>>>>>
>>>>>>   I
>>>>>>
>>>>>>> have learned the secret of being content in any and every
>>>>>>>>
>>>>>>> situation,
>>>>>
>>>>>> whether well fed or hungry, whether living in plenty or in want.
>>>>>>>>
>>>>>>>   I can
>>>>>
>>>>>> do
>>>>>>>
>>>>>>>> all this through him who gives me strength.    *-Philippians
>>>>>>>>
>>>>>>> 4:12-13*
>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> I know what it is to be in need, and I know what it is to have plenty.
>>>>>>
>>>>>   I
>>>>>
>>>>>> have learned the secret of being content in any and every situation,
>>>>>> whether well fed or hungry, whether living in plenty or in want.  I
>>>>>>
>>>>> can do
>>>>>
>>>>>> all this through him who gives me strength.    *-Philippians 4:12-13*
>>>>>>
>>>>>>
>>>>
>>>> --
>>>> I know what it is to be in need, and I know what it is to have plenty.
>>>>  I
>>>> have learned the secret of being content in any and every situation,
>>>> whether well fed or hungry, whether living in plenty or in want.  I can
>>>> do all this through him who gives me strength.    *-Philippians 4:12-13*
>>>>
>>>>
>>>
>>> --
>>> I know what it is to be in need, and I know what it is to have plenty.  I
>>> have learned the secret of being content in any and every situation,
>>> whether well fed or hungry, whether living in plenty or in want.  I can
>>> do all this through him who gives me strength.    *-Philippians 4:12-13*
>>>
>>>
>>
>>
>

Re: Uneven shard heap usage

Posted by Michael Sokolov <ms...@safaribooksonline.com>.
Joe - there shouldn't really be a problem *indexing* these fields: 
remember that all the terms are spread across the index, so there is 
really no storage difference between one 180MB document and 180 1 MB 
documents from an indexing perspective.

Making the field "stored" is more likely to lead to a problem, although 
it's still a bit of a mystery exactly what's going on. Do they need to 
be stored? For example: do you highlight the entire field? Still 180MB 
shouldn't necessarily lead to heap space problems, but one thing you 
could play with is reducing the cache sizes on that node: if you had 
very large (in terms of numbers of documents) caches, and a lot of the 
documents were big, that could lead to heap problems.  But this is all 
just guessing.

-Mike


On 6/2/2014 6:13 AM, Joe Gresock wrote:
> And the followup question would be.. if some of these documents are
> legitimately this large (they really do have that much text), is there a
> good way to still allow that to be searchable and not explode our index?
>   These would be "text_en" type fields.
>
>
> On Mon, Jun 2, 2014 at 6:09 AM, Joe Gresock <jg...@gmail.com> wrote:
>
>> So, we're definitely running into some very large documents (180MB, for
>> example).  I haven't run the analysis on the other 2 shards yet, but this
>> could definitely be our problem.
>>
>> Is there any conventional wisdom on a good "maximum size" for your indexed
>> fields?  Of course it will vary for each system, but assuming a heap of
>> 10g, does anyone have past experience in limiting their field sizes?
>>
>> Our caches are set to 128.
>>
>>
>> On Sun, Jun 1, 2014 at 8:32 AM, Joe Gresock <jg...@gmail.com> wrote:
>>
>>> These are some good ideas.  The "huge document" idea could add up, since
>>> I think the shard1 index is a little larger (32.5GB on disk instead of
>>> 31.9GB), so it is possible there's one or 2 really big ones that are
>>> getting loaded into memory there.
>>>
>>> Btw, I did find an article on the Solr document routing (
>>> http://searchhub.org/2013/06/13/solr-cloud-document-routing/), so I
>>> don't think that our ID structure is a problem in itself.  But I will
>>> follow up on the large document idea.
>>>
>>> I used this article (
>>> https://support.datastax.com/entries/38367716-Solr-Configuration-Best-Practices-and-Troubleshooting-Tips)
>>> to find the index heap and disk usage:
>>> http://localhost:8983/solr/admin/cores?action=STATUS&memory=true
>>>
>>> Though looking at the data index directory on disk basically said the
>>> same thing.
>>>
>>> I am pretty sure we're using the smart round-robining client, but I will
>>> double check on Monday.
>>>
>>> We have been using CollectD and graphite to monitor our VMs, as well as
>>> jvisualvm, though we haven't tried SPM.
>>>
>>> Thanks for all the ideas, guys.
>>>
>>>
>>> On Sat, May 31, 2014 at 11:54 PM, Otis Gospodnetic <
>>> otis.gospodnetic@gmail.com> wrote:
>>>
>>>> Hi Joe,
>>>>
>>>> Are you/how are you sure all 3 shards are roughly the same size?  Can you
>>>> share what you run/see that shows you that?
>>>>
>>>> Are you sure queries are evenly distributed?  Something like SPM
>>>> <http://sematext.com/spm/> should give you insight into that.
>>>>
>>>> How big are your caches?
>>>>
>>>> Otis
>>>> --
>>>> Performance Monitoring * Log Analytics * Search Analytics
>>>> Solr & Elasticsearch Support * http://sematext.com/
>>>>
>>>>
>>>> On Sat, May 31, 2014 at 5:54 PM, Joe Gresock <jg...@gmail.com> wrote:
>>>>
>>>>> Interesting thought about the routing.  Our document ids are in 3
>>>> parts:
>>>>> <10-digit identifier>!<epoch timestamp>!<format>
>>>>>
>>>>> e.g., 5/12345678!130000025603!TEXT
>>>>>
>>>>> Each object has an identifier, and there may be multiple versions of
>>>> the
>>>>> object, hence the timestamp.  We like to be able to pull back all of
>>>> the
>>>>> versions of an object at once, hence the routing scheme.
>>>>>
>>>>> The nature of the identifier is that a great many of them begin with a
>>>>> certain number.  I'd be interested to know more about the hashing
>>>> scheme
>>>>> used for the document routing.  Perhaps the first character gives it
>>>> more
>>>>> weight as to which shard it lands in?
>>>>>
>>>>> It seems strange that certain of the most highly-searched documents
>>>> would
>>>>> happen to fall on this shard, but you may be onto something.   We'll
>>>> scrape
>>>>> through some non-distributed queries and see what we can find.
>>>>>
>>>>>
>>>>> On Sat, May 31, 2014 at 1:47 PM, Erick Erickson <
>>>> erickerickson@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> This is very weird.
>>>>>>
>>>>>> Are you sure that all the Java versions are identical? And all the
>>>> JVM
>>>>>> parameters are the same? Grasping at straws here.
>>>>>>
>>>>>> More grasping at straws: I'm a little suspicious that you are using
>>>>>> routing. You say that the indexes are about the same size, but is it
>>>> is
>>>>>> possible that your routing is somehow loading the problem shard
>>>>> abnormally?
>>>>>> By that I mean somehow the documents on that shard are different, or
>>>>> have a
>>>>>> drastically higher number of hits than the other shards?
>>>>>>
>>>>>> You can fire queries at shards with &distrib=false and NOT have it
>>>> go to
>>>>>> other shards, perhaps if you can isolate the problem queries that
>>>> might
>>>>>> shed some light on the problem.
>>>>>>
>>>>>>
>>>>>> Best
>>>>>> Erick@Baffled.com
>>>>>>
>>>>>>
>>>>>> On Sat, May 31, 2014 at 8:33 AM, Joe Gresock <jg...@gmail.com>
>>>> wrote:
>>>>>>> It has taken as little as 2 minutes to happen the last time we
>>>> tried.
>>>>>   It
>>>>>>> basically happens upon high query load (peak user hours during the
>>>>> day).
>>>>>>>   When we reduce functionality by disabling most searches, it
>>>>> stabilizes.
>>>>>>>   So it really is only on high query load.  Our ingest rate is
>>>> fairly
>>>>> low.
>>>>>>> It happens no matter how many nodes in the shard are up.
>>>>>>>
>>>>>>>
>>>>>>> Joe
>>>>>>>
>>>>>>>
>>>>>>> On Sat, May 31, 2014 at 11:04 AM, Jack Krupansky <
>>>>>> jack@basetechnology.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> When you restart, how long does it take it hit the problem? And
>>>> how
>>>>>> much
>>>>>>>> query or update activity is happening in that time? Is there any
>>>>> other
>>>>>>>> activity showing up in the log?
>>>>>>>>
>>>>>>>> If you bring up only a single node in that problematic shard, do
>>>> you
>>>>>>> still
>>>>>>>> see the problem?
>>>>>>>>
>>>>>>>> -- Jack Krupansky
>>>>>>>>
>>>>>>>> -----Original Message----- From: Joe Gresock
>>>>>>>> Sent: Saturday, May 31, 2014 9:34 AM
>>>>>>>> To: solr-user@lucene.apache.org
>>>>>>>> Subject: Uneven shard heap usage
>>>>>>>>
>>>>>>>>
>>>>>>>> Hi folks,
>>>>>>>>
>>>>>>>> I'm trying to figure out why one shard of an evenly-distributed
>>>>> 3-shard
>>>>>>>> cluster would suddenly start running out of heap space, after 9+
>>>>> months
>>>>>>> of
>>>>>>>> stable performance.  We're using the "!" delimiter in our ids to
>>>>>>> distribute
>>>>>>>> the documents, and indeed the disk size of our shards are very
>>>>> similar
>>>>>>>> (31-32GB on disk per replica).
>>>>>>>>
>>>>>>>> Our setup is:
>>>>>>>> 9 VMs with 16GB RAM, 8 vcpus (with a 4:1 oversubscription ratio,
>>>> so
>>>>>>>> basically 2 physical CPUs), 24GB disk
>>>>>>>> 3 shards, 3 replicas per shard (1 leader, 2 replicas, whatever).
>>>>   We
>>>>>>>> reserve 10g heap for each solr instance.
>>>>>>>> Also 3 zookeeper VMs, which are very stable
>>>>>>>>
>>>>>>>> Since the troubles started, we've been monitoring all 9 with
>>>>> jvisualvm,
>>>>>>> and
>>>>>>>> shards 2 and 3 keep a steady amount of heap space reserved,
>>>> always
>>>>>> having
>>>>>>>> horizontal lines (with some minor gc).  They're using 4-5GB
>>>> heap, and
>>>>>>> when
>>>>>>>> we force gc using jvisualvm, they drop to 1GB usage.  Shard 1,
>>>>> however,
>>>>>>>> quickly has a steep slope, and eventually has concurrent mode
>>>>> failures
>>>>>> in
>>>>>>>> the gc logs, requiring us to restart the instances when they can
>>>> no
>>>>>>> longer
>>>>>>>> do anything but gc.
>>>>>>>>
>>>>>>>> We've tried ruling out physical host problems by moving all 3
>>>> Shard 1
>>>>>>>> replicas to different hosts that are underutilized, however we
>>>> still
>>>>>> get
>>>>>>>> the same problem.  We'll still be working on ruling out
>>>>> infrastructure
>>>>>>>> issues, but I wanted to ask the questions here in case it makes
>>>>> sense:
>>>>>>>> * Does it make sense that all the replicas on one shard of a
>>>> cluster
>>>>>>> would
>>>>>>>> have heap problems, when the other shard replicas do not,
>>>> assuming a
>>>>>>> fairly
>>>>>>>> even data distribution?
>>>>>>>> * One thing we changed recently was to make all of our fields
>>>> stored,
>>>>>>>> instead of only half of them.  This was to support atomic
>>>> updates.
>>>>>   Can
>>>>>>>> stored fields, even though lazily loaded, cause problems like
>>>> this?
>>>>>>>> Thanks for any input,
>>>>>>>> Joe
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> --
>>>>>>>> I know what it is to be in need, and I know what it is to have
>>>>> plenty.
>>>>>>   I
>>>>>>>> have learned the secret of being content in any and every
>>>> situation,
>>>>>>>> whether well fed or hungry, whether living in plenty or in want.
>>>>   I
>>>>> can
>>>>>>> do
>>>>>>>> all this through him who gives me strength.    *-Philippians
>>>> 4:12-13*
>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> I know what it is to be in need, and I know what it is to have
>>>> plenty.
>>>>>   I
>>>>>>> have learned the secret of being content in any and every
>>>> situation,
>>>>>>> whether well fed or hungry, whether living in plenty or in want.
>>>>   I can
>>>>>> do
>>>>>>> all this through him who gives me strength.    *-Philippians
>>>> 4:12-13*
>>>>>
>>>>>
>>>>> --
>>>>> I know what it is to be in need, and I know what it is to have plenty.
>>>>   I
>>>>> have learned the secret of being content in any and every situation,
>>>>> whether well fed or hungry, whether living in plenty or in want.  I
>>>> can do
>>>>> all this through him who gives me strength.    *-Philippians 4:12-13*
>>>>>
>>>
>>>
>>> --
>>> I know what it is to be in need, and I know what it is to have plenty.  I
>>> have learned the secret of being content in any and every situation,
>>> whether well fed or hungry, whether living in plenty or in want.  I can
>>> do all this through him who gives me strength.    *-Philippians 4:12-13*
>>>
>>
>>
>> --
>> I know what it is to be in need, and I know what it is to have plenty.  I
>> have learned the secret of being content in any and every situation,
>> whether well fed or hungry, whether living in plenty or in want.  I can
>> do all this through him who gives me strength.    *-Philippians 4:12-13*
>>
>
>


Re: Uneven shard heap usage

Posted by Joe Gresock <jg...@gmail.com>.
And the followup question would be.. if some of these documents are
legitimately this large (they really do have that much text), is there a
good way to still allow that to be searchable and not explode our index?
 These would be "text_en" type fields.


On Mon, Jun 2, 2014 at 6:09 AM, Joe Gresock <jg...@gmail.com> wrote:

> So, we're definitely running into some very large documents (180MB, for
> example).  I haven't run the analysis on the other 2 shards yet, but this
> could definitely be our problem.
>
> Is there any conventional wisdom on a good "maximum size" for your indexed
> fields?  Of course it will vary for each system, but assuming a heap of
> 10g, does anyone have past experience in limiting their field sizes?
>
> Our caches are set to 128.
>
>
> On Sun, Jun 1, 2014 at 8:32 AM, Joe Gresock <jg...@gmail.com> wrote:
>
>> These are some good ideas.  The "huge document" idea could add up, since
>> I think the shard1 index is a little larger (32.5GB on disk instead of
>> 31.9GB), so it is possible there's one or 2 really big ones that are
>> getting loaded into memory there.
>>
>> Btw, I did find an article on the Solr document routing (
>> http://searchhub.org/2013/06/13/solr-cloud-document-routing/), so I
>> don't think that our ID structure is a problem in itself.  But I will
>> follow up on the large document idea.
>>
>> I used this article (
>> https://support.datastax.com/entries/38367716-Solr-Configuration-Best-Practices-and-Troubleshooting-Tips)
>> to find the index heap and disk usage:
>> http://localhost:8983/solr/admin/cores?action=STATUS&memory=true
>>
>> Though looking at the data index directory on disk basically said the
>> same thing.
>>
>> I am pretty sure we're using the smart round-robining client, but I will
>> double check on Monday.
>>
>> We have been using CollectD and graphite to monitor our VMs, as well as
>> jvisualvm, though we haven't tried SPM.
>>
>> Thanks for all the ideas, guys.
>>
>>
>> On Sat, May 31, 2014 at 11:54 PM, Otis Gospodnetic <
>> otis.gospodnetic@gmail.com> wrote:
>>
>>> Hi Joe,
>>>
>>> Are you/how are you sure all 3 shards are roughly the same size?  Can you
>>> share what you run/see that shows you that?
>>>
>>> Are you sure queries are evenly distributed?  Something like SPM
>>> <http://sematext.com/spm/> should give you insight into that.
>>>
>>> How big are your caches?
>>>
>>> Otis
>>> --
>>> Performance Monitoring * Log Analytics * Search Analytics
>>> Solr & Elasticsearch Support * http://sematext.com/
>>>
>>>
>>> On Sat, May 31, 2014 at 5:54 PM, Joe Gresock <jg...@gmail.com> wrote:
>>>
>>> > Interesting thought about the routing.  Our document ids are in 3
>>> parts:
>>> >
>>> > <10-digit identifier>!<epoch timestamp>!<format>
>>> >
>>> > e.g., 5/12345678!130000025603!TEXT
>>> >
>>> > Each object has an identifier, and there may be multiple versions of
>>> the
>>> > object, hence the timestamp.  We like to be able to pull back all of
>>> the
>>> > versions of an object at once, hence the routing scheme.
>>> >
>>> > The nature of the identifier is that a great many of them begin with a
>>> > certain number.  I'd be interested to know more about the hashing
>>> scheme
>>> > used for the document routing.  Perhaps the first character gives it
>>> more
>>> > weight as to which shard it lands in?
>>> >
>>> > It seems strange that certain of the most highly-searched documents
>>> would
>>> > happen to fall on this shard, but you may be onto something.   We'll
>>> scrape
>>> > through some non-distributed queries and see what we can find.
>>> >
>>> >
>>> > On Sat, May 31, 2014 at 1:47 PM, Erick Erickson <
>>> erickerickson@gmail.com>
>>> > wrote:
>>> >
>>> > > This is very weird.
>>> > >
>>> > > Are you sure that all the Java versions are identical? And all the
>>> JVM
>>> > > parameters are the same? Grasping at straws here.
>>> > >
>>> > > More grasping at straws: I'm a little suspicious that you are using
>>> > > routing. You say that the indexes are about the same size, but is it
>>> is
>>> > > possible that your routing is somehow loading the problem shard
>>> > abnormally?
>>> > > By that I mean somehow the documents on that shard are different, or
>>> > have a
>>> > > drastically higher number of hits than the other shards?
>>> > >
>>> > > You can fire queries at shards with &distrib=false and NOT have it
>>> go to
>>> > > other shards, perhaps if you can isolate the problem queries that
>>> might
>>> > > shed some light on the problem.
>>> > >
>>> > >
>>> > > Best
>>> > > Erick@Baffled.com
>>> > >
>>> > >
>>> > > On Sat, May 31, 2014 at 8:33 AM, Joe Gresock <jg...@gmail.com>
>>> wrote:
>>> > >
>>> > > > It has taken as little as 2 minutes to happen the last time we
>>> tried.
>>> >  It
>>> > > > basically happens upon high query load (peak user hours during the
>>> > day).
>>> > > >  When we reduce functionality by disabling most searches, it
>>> > stabilizes.
>>> > > >  So it really is only on high query load.  Our ingest rate is
>>> fairly
>>> > low.
>>> > > >
>>> > > > It happens no matter how many nodes in the shard are up.
>>> > > >
>>> > > >
>>> > > > Joe
>>> > > >
>>> > > >
>>> > > > On Sat, May 31, 2014 at 11:04 AM, Jack Krupansky <
>>> > > jack@basetechnology.com>
>>> > > > wrote:
>>> > > >
>>> > > > > When you restart, how long does it take it hit the problem? And
>>> how
>>> > > much
>>> > > > > query or update activity is happening in that time? Is there any
>>> > other
>>> > > > > activity showing up in the log?
>>> > > > >
>>> > > > > If you bring up only a single node in that problematic shard, do
>>> you
>>> > > > still
>>> > > > > see the problem?
>>> > > > >
>>> > > > > -- Jack Krupansky
>>> > > > >
>>> > > > > -----Original Message----- From: Joe Gresock
>>> > > > > Sent: Saturday, May 31, 2014 9:34 AM
>>> > > > > To: solr-user@lucene.apache.org
>>> > > > > Subject: Uneven shard heap usage
>>> > > > >
>>> > > > >
>>> > > > > Hi folks,
>>> > > > >
>>> > > > > I'm trying to figure out why one shard of an evenly-distributed
>>> > 3-shard
>>> > > > > cluster would suddenly start running out of heap space, after 9+
>>> > months
>>> > > > of
>>> > > > > stable performance.  We're using the "!" delimiter in our ids to
>>> > > > distribute
>>> > > > > the documents, and indeed the disk size of our shards are very
>>> > similar
>>> > > > > (31-32GB on disk per replica).
>>> > > > >
>>> > > > > Our setup is:
>>> > > > > 9 VMs with 16GB RAM, 8 vcpus (with a 4:1 oversubscription ratio,
>>> so
>>> > > > > basically 2 physical CPUs), 24GB disk
>>> > > > > 3 shards, 3 replicas per shard (1 leader, 2 replicas, whatever).
>>>  We
>>> > > > > reserve 10g heap for each solr instance.
>>> > > > > Also 3 zookeeper VMs, which are very stable
>>> > > > >
>>> > > > > Since the troubles started, we've been monitoring all 9 with
>>> > jvisualvm,
>>> > > > and
>>> > > > > shards 2 and 3 keep a steady amount of heap space reserved,
>>> always
>>> > > having
>>> > > > > horizontal lines (with some minor gc).  They're using 4-5GB
>>> heap, and
>>> > > > when
>>> > > > > we force gc using jvisualvm, they drop to 1GB usage.  Shard 1,
>>> > however,
>>> > > > > quickly has a steep slope, and eventually has concurrent mode
>>> > failures
>>> > > in
>>> > > > > the gc logs, requiring us to restart the instances when they can
>>> no
>>> > > > longer
>>> > > > > do anything but gc.
>>> > > > >
>>> > > > > We've tried ruling out physical host problems by moving all 3
>>> Shard 1
>>> > > > > replicas to different hosts that are underutilized, however we
>>> still
>>> > > get
>>> > > > > the same problem.  We'll still be working on ruling out
>>> > infrastructure
>>> > > > > issues, but I wanted to ask the questions here in case it makes
>>> > sense:
>>> > > > >
>>> > > > > * Does it make sense that all the replicas on one shard of a
>>> cluster
>>> > > > would
>>> > > > > have heap problems, when the other shard replicas do not,
>>> assuming a
>>> > > > fairly
>>> > > > > even data distribution?
>>> > > > > * One thing we changed recently was to make all of our fields
>>> stored,
>>> > > > > instead of only half of them.  This was to support atomic
>>> updates.
>>> >  Can
>>> > > > > stored fields, even though lazily loaded, cause problems like
>>> this?
>>> > > > >
>>> > > > > Thanks for any input,
>>> > > > > Joe
>>> > > > >
>>> > > > >
>>> > > > >
>>> > > > >
>>> > > > >
>>> > > > > --
>>> > > > > I know what it is to be in need, and I know what it is to have
>>> > plenty.
>>> > >  I
>>> > > > > have learned the secret of being content in any and every
>>> situation,
>>> > > > > whether well fed or hungry, whether living in plenty or in want.
>>>  I
>>> > can
>>> > > > do
>>> > > > > all this through him who gives me strength.    *-Philippians
>>> 4:12-13*
>>> > > > >
>>> > > >
>>> > > >
>>> > > >
>>> > > > --
>>> > > > I know what it is to be in need, and I know what it is to have
>>> plenty.
>>> >  I
>>> > > > have learned the secret of being content in any and every
>>> situation,
>>> > > > whether well fed or hungry, whether living in plenty or in want.
>>>  I can
>>> > > do
>>> > > > all this through him who gives me strength.    *-Philippians
>>> 4:12-13*
>>> > > >
>>> > >
>>> >
>>> >
>>> >
>>> > --
>>> > I know what it is to be in need, and I know what it is to have plenty.
>>>  I
>>> > have learned the secret of being content in any and every situation,
>>> > whether well fed or hungry, whether living in plenty or in want.  I
>>> can do
>>> > all this through him who gives me strength.    *-Philippians 4:12-13*
>>> >
>>>
>>
>>
>>
>> --
>> I know what it is to be in need, and I know what it is to have plenty.  I
>> have learned the secret of being content in any and every situation,
>> whether well fed or hungry, whether living in plenty or in want.  I can
>> do all this through him who gives me strength.    *-Philippians 4:12-13*
>>
>
>
>
> --
> I know what it is to be in need, and I know what it is to have plenty.  I
> have learned the secret of being content in any and every situation,
> whether well fed or hungry, whether living in plenty or in want.  I can
> do all this through him who gives me strength.    *-Philippians 4:12-13*
>



-- 
I know what it is to be in need, and I know what it is to have plenty.  I
have learned the secret of being content in any and every situation,
whether well fed or hungry, whether living in plenty or in want.  I can do
all this through him who gives me strength.    *-Philippians 4:12-13*

Re: Uneven shard heap usage

Posted by Joe Gresock <jg...@gmail.com>.
So, we're definitely running into some very large documents (180MB, for
example).  I haven't run the analysis on the other 2 shards yet, but this
could definitely be our problem.

Is there any conventional wisdom on a good "maximum size" for your indexed
fields?  Of course it will vary for each system, but assuming a heap of
10g, does anyone have past experience in limiting their field sizes?

Our caches are set to 128.


On Sun, Jun 1, 2014 at 8:32 AM, Joe Gresock <jg...@gmail.com> wrote:

> These are some good ideas.  The "huge document" idea could add up, since I
> think the shard1 index is a little larger (32.5GB on disk instead of
> 31.9GB), so it is possible there's one or 2 really big ones that are
> getting loaded into memory there.
>
> Btw, I did find an article on the Solr document routing (
> http://searchhub.org/2013/06/13/solr-cloud-document-routing/), so I don't
> think that our ID structure is a problem in itself.  But I will follow up
> on the large document idea.
>
> I used this article (
> https://support.datastax.com/entries/38367716-Solr-Configuration-Best-Practices-and-Troubleshooting-Tips)
> to find the index heap and disk usage:
> http://localhost:8983/solr/admin/cores?action=STATUS&memory=true
>
> Though looking at the data index directory on disk basically said the same
> thing.
>
> I am pretty sure we're using the smart round-robining client, but I will
> double check on Monday.
>
> We have been using CollectD and graphite to monitor our VMs, as well as
> jvisualvm, though we haven't tried SPM.
>
> Thanks for all the ideas, guys.
>
>
> On Sat, May 31, 2014 at 11:54 PM, Otis Gospodnetic <
> otis.gospodnetic@gmail.com> wrote:
>
>> Hi Joe,
>>
>> Are you/how are you sure all 3 shards are roughly the same size?  Can you
>> share what you run/see that shows you that?
>>
>> Are you sure queries are evenly distributed?  Something like SPM
>> <http://sematext.com/spm/> should give you insight into that.
>>
>> How big are your caches?
>>
>> Otis
>> --
>> Performance Monitoring * Log Analytics * Search Analytics
>> Solr & Elasticsearch Support * http://sematext.com/
>>
>>
>> On Sat, May 31, 2014 at 5:54 PM, Joe Gresock <jg...@gmail.com> wrote:
>>
>> > Interesting thought about the routing.  Our document ids are in 3 parts:
>> >
>> > <10-digit identifier>!<epoch timestamp>!<format>
>> >
>> > e.g., 5/12345678!130000025603!TEXT
>> >
>> > Each object has an identifier, and there may be multiple versions of the
>> > object, hence the timestamp.  We like to be able to pull back all of the
>> > versions of an object at once, hence the routing scheme.
>> >
>> > The nature of the identifier is that a great many of them begin with a
>> > certain number.  I'd be interested to know more about the hashing scheme
>> > used for the document routing.  Perhaps the first character gives it
>> more
>> > weight as to which shard it lands in?
>> >
>> > It seems strange that certain of the most highly-searched documents
>> would
>> > happen to fall on this shard, but you may be onto something.   We'll
>> scrape
>> > through some non-distributed queries and see what we can find.
>> >
>> >
>> > On Sat, May 31, 2014 at 1:47 PM, Erick Erickson <
>> erickerickson@gmail.com>
>> > wrote:
>> >
>> > > This is very weird.
>> > >
>> > > Are you sure that all the Java versions are identical? And all the JVM
>> > > parameters are the same? Grasping at straws here.
>> > >
>> > > More grasping at straws: I'm a little suspicious that you are using
>> > > routing. You say that the indexes are about the same size, but is it
>> is
>> > > possible that your routing is somehow loading the problem shard
>> > abnormally?
>> > > By that I mean somehow the documents on that shard are different, or
>> > have a
>> > > drastically higher number of hits than the other shards?
>> > >
>> > > You can fire queries at shards with &distrib=false and NOT have it go
>> to
>> > > other shards, perhaps if you can isolate the problem queries that
>> might
>> > > shed some light on the problem.
>> > >
>> > >
>> > > Best
>> > > Erick@Baffled.com
>> > >
>> > >
>> > > On Sat, May 31, 2014 at 8:33 AM, Joe Gresock <jg...@gmail.com>
>> wrote:
>> > >
>> > > > It has taken as little as 2 minutes to happen the last time we
>> tried.
>> >  It
>> > > > basically happens upon high query load (peak user hours during the
>> > day).
>> > > >  When we reduce functionality by disabling most searches, it
>> > stabilizes.
>> > > >  So it really is only on high query load.  Our ingest rate is fairly
>> > low.
>> > > >
>> > > > It happens no matter how many nodes in the shard are up.
>> > > >
>> > > >
>> > > > Joe
>> > > >
>> > > >
>> > > > On Sat, May 31, 2014 at 11:04 AM, Jack Krupansky <
>> > > jack@basetechnology.com>
>> > > > wrote:
>> > > >
>> > > > > When you restart, how long does it take it hit the problem? And
>> how
>> > > much
>> > > > > query or update activity is happening in that time? Is there any
>> > other
>> > > > > activity showing up in the log?
>> > > > >
>> > > > > If you bring up only a single node in that problematic shard, do
>> you
>> > > > still
>> > > > > see the problem?
>> > > > >
>> > > > > -- Jack Krupansky
>> > > > >
>> > > > > -----Original Message----- From: Joe Gresock
>> > > > > Sent: Saturday, May 31, 2014 9:34 AM
>> > > > > To: solr-user@lucene.apache.org
>> > > > > Subject: Uneven shard heap usage
>> > > > >
>> > > > >
>> > > > > Hi folks,
>> > > > >
>> > > > > I'm trying to figure out why one shard of an evenly-distributed
>> > 3-shard
>> > > > > cluster would suddenly start running out of heap space, after 9+
>> > months
>> > > > of
>> > > > > stable performance.  We're using the "!" delimiter in our ids to
>> > > > distribute
>> > > > > the documents, and indeed the disk size of our shards are very
>> > similar
>> > > > > (31-32GB on disk per replica).
>> > > > >
>> > > > > Our setup is:
>> > > > > 9 VMs with 16GB RAM, 8 vcpus (with a 4:1 oversubscription ratio,
>> so
>> > > > > basically 2 physical CPUs), 24GB disk
>> > > > > 3 shards, 3 replicas per shard (1 leader, 2 replicas, whatever).
>>  We
>> > > > > reserve 10g heap for each solr instance.
>> > > > > Also 3 zookeeper VMs, which are very stable
>> > > > >
>> > > > > Since the troubles started, we've been monitoring all 9 with
>> > jvisualvm,
>> > > > and
>> > > > > shards 2 and 3 keep a steady amount of heap space reserved, always
>> > > having
>> > > > > horizontal lines (with some minor gc).  They're using 4-5GB heap,
>> and
>> > > > when
>> > > > > we force gc using jvisualvm, they drop to 1GB usage.  Shard 1,
>> > however,
>> > > > > quickly has a steep slope, and eventually has concurrent mode
>> > failures
>> > > in
>> > > > > the gc logs, requiring us to restart the instances when they can
>> no
>> > > > longer
>> > > > > do anything but gc.
>> > > > >
>> > > > > We've tried ruling out physical host problems by moving all 3
>> Shard 1
>> > > > > replicas to different hosts that are underutilized, however we
>> still
>> > > get
>> > > > > the same problem.  We'll still be working on ruling out
>> > infrastructure
>> > > > > issues, but I wanted to ask the questions here in case it makes
>> > sense:
>> > > > >
>> > > > > * Does it make sense that all the replicas on one shard of a
>> cluster
>> > > > would
>> > > > > have heap problems, when the other shard replicas do not,
>> assuming a
>> > > > fairly
>> > > > > even data distribution?
>> > > > > * One thing we changed recently was to make all of our fields
>> stored,
>> > > > > instead of only half of them.  This was to support atomic updates.
>> >  Can
>> > > > > stored fields, even though lazily loaded, cause problems like
>> this?
>> > > > >
>> > > > > Thanks for any input,
>> > > > > Joe
>> > > > >
>> > > > >
>> > > > >
>> > > > >
>> > > > >
>> > > > > --
>> > > > > I know what it is to be in need, and I know what it is to have
>> > plenty.
>> > >  I
>> > > > > have learned the secret of being content in any and every
>> situation,
>> > > > > whether well fed or hungry, whether living in plenty or in want.
>>  I
>> > can
>> > > > do
>> > > > > all this through him who gives me strength.    *-Philippians
>> 4:12-13*
>> > > > >
>> > > >
>> > > >
>> > > >
>> > > > --
>> > > > I know what it is to be in need, and I know what it is to have
>> plenty.
>> >  I
>> > > > have learned the secret of being content in any and every situation,
>> > > > whether well fed or hungry, whether living in plenty or in want.  I
>> can
>> > > do
>> > > > all this through him who gives me strength.    *-Philippians
>> 4:12-13*
>> > > >
>> > >
>> >
>> >
>> >
>> > --
>> > I know what it is to be in need, and I know what it is to have plenty.
>>  I
>> > have learned the secret of being content in any and every situation,
>> > whether well fed or hungry, whether living in plenty or in want.  I can
>> do
>> > all this through him who gives me strength.    *-Philippians 4:12-13*
>> >
>>
>
>
>
> --
> I know what it is to be in need, and I know what it is to have plenty.  I
> have learned the secret of being content in any and every situation,
> whether well fed or hungry, whether living in plenty or in want.  I can
> do all this through him who gives me strength.    *-Philippians 4:12-13*
>



-- 
I know what it is to be in need, and I know what it is to have plenty.  I
have learned the secret of being content in any and every situation,
whether well fed or hungry, whether living in plenty or in want.  I can do
all this through him who gives me strength.    *-Philippians 4:12-13*

Re: Uneven shard heap usage

Posted by Joe Gresock <jg...@gmail.com>.
These are some good ideas.  The "huge document" idea could add up, since I
think the shard1 index is a little larger (32.5GB on disk instead of
31.9GB), so it is possible there's one or 2 really big ones that are
getting loaded into memory there.

Btw, I did find an article on the Solr document routing (
http://searchhub.org/2013/06/13/solr-cloud-document-routing/), so I don't
think that our ID structure is a problem in itself.  But I will follow up
on the large document idea.

I used this article (
https://support.datastax.com/entries/38367716-Solr-Configuration-Best-Practices-and-Troubleshooting-Tips)
to find the index heap and disk usage:
http://localhost:8983/solr/admin/cores?action=STATUS&memory=true

Though looking at the data index directory on disk basically said the same
thing.

I am pretty sure we're using the smart round-robining client, but I will
double check on Monday.

We have been using CollectD and graphite to monitor our VMs, as well as
jvisualvm, though we haven't tried SPM.

Thanks for all the ideas, guys.


On Sat, May 31, 2014 at 11:54 PM, Otis Gospodnetic <
otis.gospodnetic@gmail.com> wrote:

> Hi Joe,
>
> Are you/how are you sure all 3 shards are roughly the same size?  Can you
> share what you run/see that shows you that?
>
> Are you sure queries are evenly distributed?  Something like SPM
> <http://sematext.com/spm/> should give you insight into that.
>
> How big are your caches?
>
> Otis
> --
> Performance Monitoring * Log Analytics * Search Analytics
> Solr & Elasticsearch Support * http://sematext.com/
>
>
> On Sat, May 31, 2014 at 5:54 PM, Joe Gresock <jg...@gmail.com> wrote:
>
> > Interesting thought about the routing.  Our document ids are in 3 parts:
> >
> > <10-digit identifier>!<epoch timestamp>!<format>
> >
> > e.g., 5/12345678!130000025603!TEXT
> >
> > Each object has an identifier, and there may be multiple versions of the
> > object, hence the timestamp.  We like to be able to pull back all of the
> > versions of an object at once, hence the routing scheme.
> >
> > The nature of the identifier is that a great many of them begin with a
> > certain number.  I'd be interested to know more about the hashing scheme
> > used for the document routing.  Perhaps the first character gives it more
> > weight as to which shard it lands in?
> >
> > It seems strange that certain of the most highly-searched documents would
> > happen to fall on this shard, but you may be onto something.   We'll
> scrape
> > through some non-distributed queries and see what we can find.
> >
> >
> > On Sat, May 31, 2014 at 1:47 PM, Erick Erickson <erickerickson@gmail.com
> >
> > wrote:
> >
> > > This is very weird.
> > >
> > > Are you sure that all the Java versions are identical? And all the JVM
> > > parameters are the same? Grasping at straws here.
> > >
> > > More grasping at straws: I'm a little suspicious that you are using
> > > routing. You say that the indexes are about the same size, but is it is
> > > possible that your routing is somehow loading the problem shard
> > abnormally?
> > > By that I mean somehow the documents on that shard are different, or
> > have a
> > > drastically higher number of hits than the other shards?
> > >
> > > You can fire queries at shards with &distrib=false and NOT have it go
> to
> > > other shards, perhaps if you can isolate the problem queries that might
> > > shed some light on the problem.
> > >
> > >
> > > Best
> > > Erick@Baffled.com
> > >
> > >
> > > On Sat, May 31, 2014 at 8:33 AM, Joe Gresock <jg...@gmail.com>
> wrote:
> > >
> > > > It has taken as little as 2 minutes to happen the last time we tried.
> >  It
> > > > basically happens upon high query load (peak user hours during the
> > day).
> > > >  When we reduce functionality by disabling most searches, it
> > stabilizes.
> > > >  So it really is only on high query load.  Our ingest rate is fairly
> > low.
> > > >
> > > > It happens no matter how many nodes in the shard are up.
> > > >
> > > >
> > > > Joe
> > > >
> > > >
> > > > On Sat, May 31, 2014 at 11:04 AM, Jack Krupansky <
> > > jack@basetechnology.com>
> > > > wrote:
> > > >
> > > > > When you restart, how long does it take it hit the problem? And how
> > > much
> > > > > query or update activity is happening in that time? Is there any
> > other
> > > > > activity showing up in the log?
> > > > >
> > > > > If you bring up only a single node in that problematic shard, do
> you
> > > > still
> > > > > see the problem?
> > > > >
> > > > > -- Jack Krupansky
> > > > >
> > > > > -----Original Message----- From: Joe Gresock
> > > > > Sent: Saturday, May 31, 2014 9:34 AM
> > > > > To: solr-user@lucene.apache.org
> > > > > Subject: Uneven shard heap usage
> > > > >
> > > > >
> > > > > Hi folks,
> > > > >
> > > > > I'm trying to figure out why one shard of an evenly-distributed
> > 3-shard
> > > > > cluster would suddenly start running out of heap space, after 9+
> > months
> > > > of
> > > > > stable performance.  We're using the "!" delimiter in our ids to
> > > > distribute
> > > > > the documents, and indeed the disk size of our shards are very
> > similar
> > > > > (31-32GB on disk per replica).
> > > > >
> > > > > Our setup is:
> > > > > 9 VMs with 16GB RAM, 8 vcpus (with a 4:1 oversubscription ratio, so
> > > > > basically 2 physical CPUs), 24GB disk
> > > > > 3 shards, 3 replicas per shard (1 leader, 2 replicas, whatever).
>  We
> > > > > reserve 10g heap for each solr instance.
> > > > > Also 3 zookeeper VMs, which are very stable
> > > > >
> > > > > Since the troubles started, we've been monitoring all 9 with
> > jvisualvm,
> > > > and
> > > > > shards 2 and 3 keep a steady amount of heap space reserved, always
> > > having
> > > > > horizontal lines (with some minor gc).  They're using 4-5GB heap,
> and
> > > > when
> > > > > we force gc using jvisualvm, they drop to 1GB usage.  Shard 1,
> > however,
> > > > > quickly has a steep slope, and eventually has concurrent mode
> > failures
> > > in
> > > > > the gc logs, requiring us to restart the instances when they can no
> > > > longer
> > > > > do anything but gc.
> > > > >
> > > > > We've tried ruling out physical host problems by moving all 3
> Shard 1
> > > > > replicas to different hosts that are underutilized, however we
> still
> > > get
> > > > > the same problem.  We'll still be working on ruling out
> > infrastructure
> > > > > issues, but I wanted to ask the questions here in case it makes
> > sense:
> > > > >
> > > > > * Does it make sense that all the replicas on one shard of a
> cluster
> > > > would
> > > > > have heap problems, when the other shard replicas do not, assuming
> a
> > > > fairly
> > > > > even data distribution?
> > > > > * One thing we changed recently was to make all of our fields
> stored,
> > > > > instead of only half of them.  This was to support atomic updates.
> >  Can
> > > > > stored fields, even though lazily loaded, cause problems like this?
> > > > >
> > > > > Thanks for any input,
> > > > > Joe
> > > > >
> > > > >
> > > > >
> > > > >
> > > > >
> > > > > --
> > > > > I know what it is to be in need, and I know what it is to have
> > plenty.
> > >  I
> > > > > have learned the secret of being content in any and every
> situation,
> > > > > whether well fed or hungry, whether living in plenty or in want.  I
> > can
> > > > do
> > > > > all this through him who gives me strength.    *-Philippians
> 4:12-13*
> > > > >
> > > >
> > > >
> > > >
> > > > --
> > > > I know what it is to be in need, and I know what it is to have
> plenty.
> >  I
> > > > have learned the secret of being content in any and every situation,
> > > > whether well fed or hungry, whether living in plenty or in want.  I
> can
> > > do
> > > > all this through him who gives me strength.    *-Philippians 4:12-13*
> > > >
> > >
> >
> >
> >
> > --
> > I know what it is to be in need, and I know what it is to have plenty.  I
> > have learned the secret of being content in any and every situation,
> > whether well fed or hungry, whether living in plenty or in want.  I can
> do
> > all this through him who gives me strength.    *-Philippians 4:12-13*
> >
>



-- 
I know what it is to be in need, and I know what it is to have plenty.  I
have learned the secret of being content in any and every situation,
whether well fed or hungry, whether living in plenty or in want.  I can do
all this through him who gives me strength.    *-Philippians 4:12-13*

Re: Uneven shard heap usage

Posted by Otis Gospodnetic <ot...@gmail.com>.
Hi Joe,

Are you/how are you sure all 3 shards are roughly the same size?  Can you
share what you run/see that shows you that?

Are you sure queries are evenly distributed?  Something like SPM
<http://sematext.com/spm/> should give you insight into that.

How big are your caches?

Otis
--
Performance Monitoring * Log Analytics * Search Analytics
Solr & Elasticsearch Support * http://sematext.com/


On Sat, May 31, 2014 at 5:54 PM, Joe Gresock <jg...@gmail.com> wrote:

> Interesting thought about the routing.  Our document ids are in 3 parts:
>
> <10-digit identifier>!<epoch timestamp>!<format>
>
> e.g., 5/12345678!130000025603!TEXT
>
> Each object has an identifier, and there may be multiple versions of the
> object, hence the timestamp.  We like to be able to pull back all of the
> versions of an object at once, hence the routing scheme.
>
> The nature of the identifier is that a great many of them begin with a
> certain number.  I'd be interested to know more about the hashing scheme
> used for the document routing.  Perhaps the first character gives it more
> weight as to which shard it lands in?
>
> It seems strange that certain of the most highly-searched documents would
> happen to fall on this shard, but you may be onto something.   We'll scrape
> through some non-distributed queries and see what we can find.
>
>
> On Sat, May 31, 2014 at 1:47 PM, Erick Erickson <er...@gmail.com>
> wrote:
>
> > This is very weird.
> >
> > Are you sure that all the Java versions are identical? And all the JVM
> > parameters are the same? Grasping at straws here.
> >
> > More grasping at straws: I'm a little suspicious that you are using
> > routing. You say that the indexes are about the same size, but is it is
> > possible that your routing is somehow loading the problem shard
> abnormally?
> > By that I mean somehow the documents on that shard are different, or
> have a
> > drastically higher number of hits than the other shards?
> >
> > You can fire queries at shards with &distrib=false and NOT have it go to
> > other shards, perhaps if you can isolate the problem queries that might
> > shed some light on the problem.
> >
> >
> > Best
> > Erick@Baffled.com
> >
> >
> > On Sat, May 31, 2014 at 8:33 AM, Joe Gresock <jg...@gmail.com> wrote:
> >
> > > It has taken as little as 2 minutes to happen the last time we tried.
>  It
> > > basically happens upon high query load (peak user hours during the
> day).
> > >  When we reduce functionality by disabling most searches, it
> stabilizes.
> > >  So it really is only on high query load.  Our ingest rate is fairly
> low.
> > >
> > > It happens no matter how many nodes in the shard are up.
> > >
> > >
> > > Joe
> > >
> > >
> > > On Sat, May 31, 2014 at 11:04 AM, Jack Krupansky <
> > jack@basetechnology.com>
> > > wrote:
> > >
> > > > When you restart, how long does it take it hit the problem? And how
> > much
> > > > query or update activity is happening in that time? Is there any
> other
> > > > activity showing up in the log?
> > > >
> > > > If you bring up only a single node in that problematic shard, do you
> > > still
> > > > see the problem?
> > > >
> > > > -- Jack Krupansky
> > > >
> > > > -----Original Message----- From: Joe Gresock
> > > > Sent: Saturday, May 31, 2014 9:34 AM
> > > > To: solr-user@lucene.apache.org
> > > > Subject: Uneven shard heap usage
> > > >
> > > >
> > > > Hi folks,
> > > >
> > > > I'm trying to figure out why one shard of an evenly-distributed
> 3-shard
> > > > cluster would suddenly start running out of heap space, after 9+
> months
> > > of
> > > > stable performance.  We're using the "!" delimiter in our ids to
> > > distribute
> > > > the documents, and indeed the disk size of our shards are very
> similar
> > > > (31-32GB on disk per replica).
> > > >
> > > > Our setup is:
> > > > 9 VMs with 16GB RAM, 8 vcpus (with a 4:1 oversubscription ratio, so
> > > > basically 2 physical CPUs), 24GB disk
> > > > 3 shards, 3 replicas per shard (1 leader, 2 replicas, whatever).  We
> > > > reserve 10g heap for each solr instance.
> > > > Also 3 zookeeper VMs, which are very stable
> > > >
> > > > Since the troubles started, we've been monitoring all 9 with
> jvisualvm,
> > > and
> > > > shards 2 and 3 keep a steady amount of heap space reserved, always
> > having
> > > > horizontal lines (with some minor gc).  They're using 4-5GB heap, and
> > > when
> > > > we force gc using jvisualvm, they drop to 1GB usage.  Shard 1,
> however,
> > > > quickly has a steep slope, and eventually has concurrent mode
> failures
> > in
> > > > the gc logs, requiring us to restart the instances when they can no
> > > longer
> > > > do anything but gc.
> > > >
> > > > We've tried ruling out physical host problems by moving all 3 Shard 1
> > > > replicas to different hosts that are underutilized, however we still
> > get
> > > > the same problem.  We'll still be working on ruling out
> infrastructure
> > > > issues, but I wanted to ask the questions here in case it makes
> sense:
> > > >
> > > > * Does it make sense that all the replicas on one shard of a cluster
> > > would
> > > > have heap problems, when the other shard replicas do not, assuming a
> > > fairly
> > > > even data distribution?
> > > > * One thing we changed recently was to make all of our fields stored,
> > > > instead of only half of them.  This was to support atomic updates.
>  Can
> > > > stored fields, even though lazily loaded, cause problems like this?
> > > >
> > > > Thanks for any input,
> > > > Joe
> > > >
> > > >
> > > >
> > > >
> > > >
> > > > --
> > > > I know what it is to be in need, and I know what it is to have
> plenty.
> >  I
> > > > have learned the secret of being content in any and every situation,
> > > > whether well fed or hungry, whether living in plenty or in want.  I
> can
> > > do
> > > > all this through him who gives me strength.    *-Philippians 4:12-13*
> > > >
> > >
> > >
> > >
> > > --
> > > I know what it is to be in need, and I know what it is to have plenty.
>  I
> > > have learned the secret of being content in any and every situation,
> > > whether well fed or hungry, whether living in plenty or in want.  I can
> > do
> > > all this through him who gives me strength.    *-Philippians 4:12-13*
> > >
> >
>
>
>
> --
> I know what it is to be in need, and I know what it is to have plenty.  I
> have learned the secret of being content in any and every situation,
> whether well fed or hungry, whether living in plenty or in want.  I can do
> all this through him who gives me strength.    *-Philippians 4:12-13*
>

Re: Uneven shard heap usage

Posted by Michael Sokolov <ms...@safaribooksonline.com>.
Is it possible that all your requests are routed to that single shard?  
I.e. you are not using the smart client that round-robins requests?  I 
think that could cause all of the merging of results to be done on a 
single node.

Also - is it possible you have a "bad" document in that shard? Like one 
that has a GB stored field or something?

-Mike

On 5/31/2014 5:54 PM, Joe Gresock wrote:
> Interesting thought about the routing.  Our document ids are in 3 parts:
>
> <10-digit identifier>!<epoch timestamp>!<format>
>
> e.g., 5/12345678!130000025603!TEXT
>
> Each object has an identifier, and there may be multiple versions of the
> object, hence the timestamp.  We like to be able to pull back all of the
> versions of an object at once, hence the routing scheme.
>
> The nature of the identifier is that a great many of them begin with a
> certain number.  I'd be interested to know more about the hashing scheme
> used for the document routing.  Perhaps the first character gives it more
> weight as to which shard it lands in?
>
> It seems strange that certain of the most highly-searched documents would
> happen to fall on this shard, but you may be onto something.   We'll scrape
> through some non-distributed queries and see what we can find.
>
>
> On Sat, May 31, 2014 at 1:47 PM, Erick Erickson <er...@gmail.com>
> wrote:
>
>> This is very weird.
>>
>> Are you sure that all the Java versions are identical? And all the JVM
>> parameters are the same? Grasping at straws here.
>>
>> More grasping at straws: I'm a little suspicious that you are using
>> routing. You say that the indexes are about the same size, but is it is
>> possible that your routing is somehow loading the problem shard abnormally?
>> By that I mean somehow the documents on that shard are different, or have a
>> drastically higher number of hits than the other shards?
>>
>> You can fire queries at shards with &distrib=false and NOT have it go to
>> other shards, perhaps if you can isolate the problem queries that might
>> shed some light on the problem.
>>
>>
>> Best
>> Erick@Baffled.com
>>
>>
>> On Sat, May 31, 2014 at 8:33 AM, Joe Gresock <jg...@gmail.com> wrote:
>>
>>> It has taken as little as 2 minutes to happen the last time we tried.  It
>>> basically happens upon high query load (peak user hours during the day).
>>>   When we reduce functionality by disabling most searches, it stabilizes.
>>>   So it really is only on high query load.  Our ingest rate is fairly low.
>>>
>>> It happens no matter how many nodes in the shard are up.
>>>
>>>
>>> Joe
>>>
>>>
>>> On Sat, May 31, 2014 at 11:04 AM, Jack Krupansky <
>> jack@basetechnology.com>
>>> wrote:
>>>
>>>> When you restart, how long does it take it hit the problem? And how
>> much
>>>> query or update activity is happening in that time? Is there any other
>>>> activity showing up in the log?
>>>>
>>>> If you bring up only a single node in that problematic shard, do you
>>> still
>>>> see the problem?
>>>>
>>>> -- Jack Krupansky
>>>>
>>>> -----Original Message----- From: Joe Gresock
>>>> Sent: Saturday, May 31, 2014 9:34 AM
>>>> To: solr-user@lucene.apache.org
>>>> Subject: Uneven shard heap usage
>>>>
>>>>
>>>> Hi folks,
>>>>
>>>> I'm trying to figure out why one shard of an evenly-distributed 3-shard
>>>> cluster would suddenly start running out of heap space, after 9+ months
>>> of
>>>> stable performance.  We're using the "!" delimiter in our ids to
>>> distribute
>>>> the documents, and indeed the disk size of our shards are very similar
>>>> (31-32GB on disk per replica).
>>>>
>>>> Our setup is:
>>>> 9 VMs with 16GB RAM, 8 vcpus (with a 4:1 oversubscription ratio, so
>>>> basically 2 physical CPUs), 24GB disk
>>>> 3 shards, 3 replicas per shard (1 leader, 2 replicas, whatever).  We
>>>> reserve 10g heap for each solr instance.
>>>> Also 3 zookeeper VMs, which are very stable
>>>>
>>>> Since the troubles started, we've been monitoring all 9 with jvisualvm,
>>> and
>>>> shards 2 and 3 keep a steady amount of heap space reserved, always
>> having
>>>> horizontal lines (with some minor gc).  They're using 4-5GB heap, and
>>> when
>>>> we force gc using jvisualvm, they drop to 1GB usage.  Shard 1, however,
>>>> quickly has a steep slope, and eventually has concurrent mode failures
>> in
>>>> the gc logs, requiring us to restart the instances when they can no
>>> longer
>>>> do anything but gc.
>>>>
>>>> We've tried ruling out physical host problems by moving all 3 Shard 1
>>>> replicas to different hosts that are underutilized, however we still
>> get
>>>> the same problem.  We'll still be working on ruling out infrastructure
>>>> issues, but I wanted to ask the questions here in case it makes sense:
>>>>
>>>> * Does it make sense that all the replicas on one shard of a cluster
>>> would
>>>> have heap problems, when the other shard replicas do not, assuming a
>>> fairly
>>>> even data distribution?
>>>> * One thing we changed recently was to make all of our fields stored,
>>>> instead of only half of them.  This was to support atomic updates.  Can
>>>> stored fields, even though lazily loaded, cause problems like this?
>>>>
>>>> Thanks for any input,
>>>> Joe
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> I know what it is to be in need, and I know what it is to have plenty.
>>   I
>>>> have learned the secret of being content in any and every situation,
>>>> whether well fed or hungry, whether living in plenty or in want.  I can
>>> do
>>>> all this through him who gives me strength.    *-Philippians 4:12-13*
>>>>
>>>
>>>
>>> --
>>> I know what it is to be in need, and I know what it is to have plenty.  I
>>> have learned the secret of being content in any and every situation,
>>> whether well fed or hungry, whether living in plenty or in want.  I can
>> do
>>> all this through him who gives me strength.    *-Philippians 4:12-13*
>>>
>
>


Re: Uneven shard heap usage

Posted by Joe Gresock <jg...@gmail.com>.
Interesting thought about the routing.  Our document ids are in 3 parts:

<10-digit identifier>!<epoch timestamp>!<format>

e.g., 5/12345678!130000025603!TEXT

Each object has an identifier, and there may be multiple versions of the
object, hence the timestamp.  We like to be able to pull back all of the
versions of an object at once, hence the routing scheme.

The nature of the identifier is that a great many of them begin with a
certain number.  I'd be interested to know more about the hashing scheme
used for the document routing.  Perhaps the first character gives it more
weight as to which shard it lands in?

It seems strange that certain of the most highly-searched documents would
happen to fall on this shard, but you may be onto something.   We'll scrape
through some non-distributed queries and see what we can find.


On Sat, May 31, 2014 at 1:47 PM, Erick Erickson <er...@gmail.com>
wrote:

> This is very weird.
>
> Are you sure that all the Java versions are identical? And all the JVM
> parameters are the same? Grasping at straws here.
>
> More grasping at straws: I'm a little suspicious that you are using
> routing. You say that the indexes are about the same size, but is it is
> possible that your routing is somehow loading the problem shard abnormally?
> By that I mean somehow the documents on that shard are different, or have a
> drastically higher number of hits than the other shards?
>
> You can fire queries at shards with &distrib=false and NOT have it go to
> other shards, perhaps if you can isolate the problem queries that might
> shed some light on the problem.
>
>
> Best
> Erick@Baffled.com
>
>
> On Sat, May 31, 2014 at 8:33 AM, Joe Gresock <jg...@gmail.com> wrote:
>
> > It has taken as little as 2 minutes to happen the last time we tried.  It
> > basically happens upon high query load (peak user hours during the day).
> >  When we reduce functionality by disabling most searches, it stabilizes.
> >  So it really is only on high query load.  Our ingest rate is fairly low.
> >
> > It happens no matter how many nodes in the shard are up.
> >
> >
> > Joe
> >
> >
> > On Sat, May 31, 2014 at 11:04 AM, Jack Krupansky <
> jack@basetechnology.com>
> > wrote:
> >
> > > When you restart, how long does it take it hit the problem? And how
> much
> > > query or update activity is happening in that time? Is there any other
> > > activity showing up in the log?
> > >
> > > If you bring up only a single node in that problematic shard, do you
> > still
> > > see the problem?
> > >
> > > -- Jack Krupansky
> > >
> > > -----Original Message----- From: Joe Gresock
> > > Sent: Saturday, May 31, 2014 9:34 AM
> > > To: solr-user@lucene.apache.org
> > > Subject: Uneven shard heap usage
> > >
> > >
> > > Hi folks,
> > >
> > > I'm trying to figure out why one shard of an evenly-distributed 3-shard
> > > cluster would suddenly start running out of heap space, after 9+ months
> > of
> > > stable performance.  We're using the "!" delimiter in our ids to
> > distribute
> > > the documents, and indeed the disk size of our shards are very similar
> > > (31-32GB on disk per replica).
> > >
> > > Our setup is:
> > > 9 VMs with 16GB RAM, 8 vcpus (with a 4:1 oversubscription ratio, so
> > > basically 2 physical CPUs), 24GB disk
> > > 3 shards, 3 replicas per shard (1 leader, 2 replicas, whatever).  We
> > > reserve 10g heap for each solr instance.
> > > Also 3 zookeeper VMs, which are very stable
> > >
> > > Since the troubles started, we've been monitoring all 9 with jvisualvm,
> > and
> > > shards 2 and 3 keep a steady amount of heap space reserved, always
> having
> > > horizontal lines (with some minor gc).  They're using 4-5GB heap, and
> > when
> > > we force gc using jvisualvm, they drop to 1GB usage.  Shard 1, however,
> > > quickly has a steep slope, and eventually has concurrent mode failures
> in
> > > the gc logs, requiring us to restart the instances when they can no
> > longer
> > > do anything but gc.
> > >
> > > We've tried ruling out physical host problems by moving all 3 Shard 1
> > > replicas to different hosts that are underutilized, however we still
> get
> > > the same problem.  We'll still be working on ruling out infrastructure
> > > issues, but I wanted to ask the questions here in case it makes sense:
> > >
> > > * Does it make sense that all the replicas on one shard of a cluster
> > would
> > > have heap problems, when the other shard replicas do not, assuming a
> > fairly
> > > even data distribution?
> > > * One thing we changed recently was to make all of our fields stored,
> > > instead of only half of them.  This was to support atomic updates.  Can
> > > stored fields, even though lazily loaded, cause problems like this?
> > >
> > > Thanks for any input,
> > > Joe
> > >
> > >
> > >
> > >
> > >
> > > --
> > > I know what it is to be in need, and I know what it is to have plenty.
>  I
> > > have learned the secret of being content in any and every situation,
> > > whether well fed or hungry, whether living in plenty or in want.  I can
> > do
> > > all this through him who gives me strength.    *-Philippians 4:12-13*
> > >
> >
> >
> >
> > --
> > I know what it is to be in need, and I know what it is to have plenty.  I
> > have learned the secret of being content in any and every situation,
> > whether well fed or hungry, whether living in plenty or in want.  I can
> do
> > all this through him who gives me strength.    *-Philippians 4:12-13*
> >
>



-- 
I know what it is to be in need, and I know what it is to have plenty.  I
have learned the secret of being content in any and every situation,
whether well fed or hungry, whether living in plenty or in want.  I can do
all this through him who gives me strength.    *-Philippians 4:12-13*

Re: Uneven shard heap usage

Posted by Erick Erickson <er...@gmail.com>.
This is very weird.

Are you sure that all the Java versions are identical? And all the JVM
parameters are the same? Grasping at straws here.

More grasping at straws: I'm a little suspicious that you are using
routing. You say that the indexes are about the same size, but is it is
possible that your routing is somehow loading the problem shard abnormally?
By that I mean somehow the documents on that shard are different, or have a
drastically higher number of hits than the other shards?

You can fire queries at shards with &distrib=false and NOT have it go to
other shards, perhaps if you can isolate the problem queries that might
shed some light on the problem.


Best
Erick@Baffled.com


On Sat, May 31, 2014 at 8:33 AM, Joe Gresock <jg...@gmail.com> wrote:

> It has taken as little as 2 minutes to happen the last time we tried.  It
> basically happens upon high query load (peak user hours during the day).
>  When we reduce functionality by disabling most searches, it stabilizes.
>  So it really is only on high query load.  Our ingest rate is fairly low.
>
> It happens no matter how many nodes in the shard are up.
>
>
> Joe
>
>
> On Sat, May 31, 2014 at 11:04 AM, Jack Krupansky <ja...@basetechnology.com>
> wrote:
>
> > When you restart, how long does it take it hit the problem? And how much
> > query or update activity is happening in that time? Is there any other
> > activity showing up in the log?
> >
> > If you bring up only a single node in that problematic shard, do you
> still
> > see the problem?
> >
> > -- Jack Krupansky
> >
> > -----Original Message----- From: Joe Gresock
> > Sent: Saturday, May 31, 2014 9:34 AM
> > To: solr-user@lucene.apache.org
> > Subject: Uneven shard heap usage
> >
> >
> > Hi folks,
> >
> > I'm trying to figure out why one shard of an evenly-distributed 3-shard
> > cluster would suddenly start running out of heap space, after 9+ months
> of
> > stable performance.  We're using the "!" delimiter in our ids to
> distribute
> > the documents, and indeed the disk size of our shards are very similar
> > (31-32GB on disk per replica).
> >
> > Our setup is:
> > 9 VMs with 16GB RAM, 8 vcpus (with a 4:1 oversubscription ratio, so
> > basically 2 physical CPUs), 24GB disk
> > 3 shards, 3 replicas per shard (1 leader, 2 replicas, whatever).  We
> > reserve 10g heap for each solr instance.
> > Also 3 zookeeper VMs, which are very stable
> >
> > Since the troubles started, we've been monitoring all 9 with jvisualvm,
> and
> > shards 2 and 3 keep a steady amount of heap space reserved, always having
> > horizontal lines (with some minor gc).  They're using 4-5GB heap, and
> when
> > we force gc using jvisualvm, they drop to 1GB usage.  Shard 1, however,
> > quickly has a steep slope, and eventually has concurrent mode failures in
> > the gc logs, requiring us to restart the instances when they can no
> longer
> > do anything but gc.
> >
> > We've tried ruling out physical host problems by moving all 3 Shard 1
> > replicas to different hosts that are underutilized, however we still get
> > the same problem.  We'll still be working on ruling out infrastructure
> > issues, but I wanted to ask the questions here in case it makes sense:
> >
> > * Does it make sense that all the replicas on one shard of a cluster
> would
> > have heap problems, when the other shard replicas do not, assuming a
> fairly
> > even data distribution?
> > * One thing we changed recently was to make all of our fields stored,
> > instead of only half of them.  This was to support atomic updates.  Can
> > stored fields, even though lazily loaded, cause problems like this?
> >
> > Thanks for any input,
> > Joe
> >
> >
> >
> >
> >
> > --
> > I know what it is to be in need, and I know what it is to have plenty.  I
> > have learned the secret of being content in any and every situation,
> > whether well fed or hungry, whether living in plenty or in want.  I can
> do
> > all this through him who gives me strength.    *-Philippians 4:12-13*
> >
>
>
>
> --
> I know what it is to be in need, and I know what it is to have plenty.  I
> have learned the secret of being content in any and every situation,
> whether well fed or hungry, whether living in plenty or in want.  I can do
> all this through him who gives me strength.    *-Philippians 4:12-13*
>

Re: Uneven shard heap usage

Posted by Joe Gresock <jg...@gmail.com>.
It has taken as little as 2 minutes to happen the last time we tried.  It
basically happens upon high query load (peak user hours during the day).
 When we reduce functionality by disabling most searches, it stabilizes.
 So it really is only on high query load.  Our ingest rate is fairly low.

It happens no matter how many nodes in the shard are up.


Joe


On Sat, May 31, 2014 at 11:04 AM, Jack Krupansky <ja...@basetechnology.com>
wrote:

> When you restart, how long does it take it hit the problem? And how much
> query or update activity is happening in that time? Is there any other
> activity showing up in the log?
>
> If you bring up only a single node in that problematic shard, do you still
> see the problem?
>
> -- Jack Krupansky
>
> -----Original Message----- From: Joe Gresock
> Sent: Saturday, May 31, 2014 9:34 AM
> To: solr-user@lucene.apache.org
> Subject: Uneven shard heap usage
>
>
> Hi folks,
>
> I'm trying to figure out why one shard of an evenly-distributed 3-shard
> cluster would suddenly start running out of heap space, after 9+ months of
> stable performance.  We're using the "!" delimiter in our ids to distribute
> the documents, and indeed the disk size of our shards are very similar
> (31-32GB on disk per replica).
>
> Our setup is:
> 9 VMs with 16GB RAM, 8 vcpus (with a 4:1 oversubscription ratio, so
> basically 2 physical CPUs), 24GB disk
> 3 shards, 3 replicas per shard (1 leader, 2 replicas, whatever).  We
> reserve 10g heap for each solr instance.
> Also 3 zookeeper VMs, which are very stable
>
> Since the troubles started, we've been monitoring all 9 with jvisualvm, and
> shards 2 and 3 keep a steady amount of heap space reserved, always having
> horizontal lines (with some minor gc).  They're using 4-5GB heap, and when
> we force gc using jvisualvm, they drop to 1GB usage.  Shard 1, however,
> quickly has a steep slope, and eventually has concurrent mode failures in
> the gc logs, requiring us to restart the instances when they can no longer
> do anything but gc.
>
> We've tried ruling out physical host problems by moving all 3 Shard 1
> replicas to different hosts that are underutilized, however we still get
> the same problem.  We'll still be working on ruling out infrastructure
> issues, but I wanted to ask the questions here in case it makes sense:
>
> * Does it make sense that all the replicas on one shard of a cluster would
> have heap problems, when the other shard replicas do not, assuming a fairly
> even data distribution?
> * One thing we changed recently was to make all of our fields stored,
> instead of only half of them.  This was to support atomic updates.  Can
> stored fields, even though lazily loaded, cause problems like this?
>
> Thanks for any input,
> Joe
>
>
>
>
>
> --
> I know what it is to be in need, and I know what it is to have plenty.  I
> have learned the secret of being content in any and every situation,
> whether well fed or hungry, whether living in plenty or in want.  I can do
> all this through him who gives me strength.    *-Philippians 4:12-13*
>



-- 
I know what it is to be in need, and I know what it is to have plenty.  I
have learned the secret of being content in any and every situation,
whether well fed or hungry, whether living in plenty or in want.  I can do
all this through him who gives me strength.    *-Philippians 4:12-13*

Re: Uneven shard heap usage

Posted by Jack Krupansky <ja...@basetechnology.com>.
When you restart, how long does it take it hit the problem? And how much 
query or update activity is happening in that time? Is there any other 
activity showing up in the log?

If you bring up only a single node in that problematic shard, do you still 
see the problem?

-- Jack Krupansky

-----Original Message----- 
From: Joe Gresock
Sent: Saturday, May 31, 2014 9:34 AM
To: solr-user@lucene.apache.org
Subject: Uneven shard heap usage

Hi folks,

I'm trying to figure out why one shard of an evenly-distributed 3-shard
cluster would suddenly start running out of heap space, after 9+ months of
stable performance.  We're using the "!" delimiter in our ids to distribute
the documents, and indeed the disk size of our shards are very similar
(31-32GB on disk per replica).

Our setup is:
9 VMs with 16GB RAM, 8 vcpus (with a 4:1 oversubscription ratio, so
basically 2 physical CPUs), 24GB disk
3 shards, 3 replicas per shard (1 leader, 2 replicas, whatever).  We
reserve 10g heap for each solr instance.
Also 3 zookeeper VMs, which are very stable

Since the troubles started, we've been monitoring all 9 with jvisualvm, and
shards 2 and 3 keep a steady amount of heap space reserved, always having
horizontal lines (with some minor gc).  They're using 4-5GB heap, and when
we force gc using jvisualvm, they drop to 1GB usage.  Shard 1, however,
quickly has a steep slope, and eventually has concurrent mode failures in
the gc logs, requiring us to restart the instances when they can no longer
do anything but gc.

We've tried ruling out physical host problems by moving all 3 Shard 1
replicas to different hosts that are underutilized, however we still get
the same problem.  We'll still be working on ruling out infrastructure
issues, but I wanted to ask the questions here in case it makes sense:

* Does it make sense that all the replicas on one shard of a cluster would
have heap problems, when the other shard replicas do not, assuming a fairly
even data distribution?
* One thing we changed recently was to make all of our fields stored,
instead of only half of them.  This was to support atomic updates.  Can
stored fields, even though lazily loaded, cause problems like this?

Thanks for any input,
Joe





-- 
I know what it is to be in need, and I know what it is to have plenty.  I
have learned the secret of being content in any and every situation,
whether well fed or hungry, whether living in plenty or in want.  I can do
all this through him who gives me strength.    *-Philippians 4:12-13*