You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Daniel Angelov <da...@gmail.com> on 2017/05/27 13:14:30 UTC

Long string in fq value parameter, more than 2000000 chars

Hello,

I would like to ask, what could be the memory/cpu impact, if the fq
parameter in many of the queries is a long string (fq={!terms
f=...}...,.... ) around 2000000 chars. Most of the queries are like:
"q={!frange l=Timestamp1 u=Timestamp2}... + some others criteria". This is
with SolrCloud 4.1, on 10 hosts, 3 collections, summary in all collections
are around 10000000 docs. The queries are over all 3 collections.

I have sometimes OOM exceptions. And I can see GC times are pretty long.
The heap size is 64 GB on each host. The cache settings are the default.

Is it possible the long fq parameter in the requests to cause OOM
exceptions?


Thank you

Daniel

Re: Long string in fq value parameter, more than 2000000 chars

Posted by Rick Leir <rl...@leirtech.com>.
Daniel,
Is it worth saying that you have honkin' long queries and there must be a simpler way? ( I am a big fan of KISS . .  Keep It Simple Stupid). I am not calling you names, just saying that this acronym comes up in just about every project I work on. It is akin to the Peter Principle, where design complexity inevitably increases to the breaking point, then I get cranky. And you probably can tell us a solid reason for having the long queries.  Cheers -- Rick

On May 30, 2017 9:22:24 AM EDT, Susheel Kumar <su...@gmail.com> wrote:
>If you are able to draw gc logs in gcviewer when OOM happens, it can
>give
>you idea if it was sudden OOM or heap gets filled over a period of
>time.
>This may help to nail down if any particular query is causing the
>problem
>or something else...
>
>Thanks,
>Susheel
>
>On Sat, May 27, 2017 at 5:36 PM, Daniel Angelov
><da...@gmail.com>
>wrote:
>
>> Thanks for the support so far.
>> I am going to analyze the logs in order to check the frequency of
>such
>> queries. BTW, I have forgot to mention, the soft and the hard commits
>are
>> each 60 sec.
>>
>> BR
>> Daniel
>>
>> Am 27.05.2017 22:57 schrieb "Erik Hatcher" <er...@gmail.com>:
>>
>> > Another technique to consider is {!join}.  Index the cross ref id
>"sets"
>> > to another core and use a short and sweet join, if there are stable
>sets
>> of
>> > id's.
>> >
>> >    Erik
>> >
>> > > On May 27, 2017, at 11:39, Alexandre Rafalovitch
><ar...@gmail.com>
>> > wrote:
>> > >
>> > > On top of Shawn's analysis, I am also wondering how often those
>FQ
>> > > queries are reused. Because they and the matching documents are
>> > > getting cached, so there might be quite a bit of space taken with
>that
>> > > too.
>> > >
>> > > Regards,
>> > >    Alex.
>> > > ----
>> > > http://www.solr-start.com/ - Resources for Solr users, new and
>> > experienced
>> > >
>> > >
>> > >> On 27 May 2017 at 11:32, Shawn Heisey <ap...@elyograg.org>
>wrote:
>> > >>> On 5/27/2017 9:05 AM, Shawn Heisey wrote:
>> > >>>> On 5/27/2017 7:14 AM, Daniel Angelov wrote:
>> > >>>> I would like to ask, what could be the memory/cpu impact, if
>the fq
>> > >>>> parameter in many of the queries is a long string (fq={!terms
>> > >>>> f=...}...,.... ) around 2000000 chars. Most of the queries are
>like:
>> > >>>> "q={!frange l=Timestamp1 u=Timestamp2}... + some others
>criteria".
>> > >>>> This is with SolrCloud 4.1, on 10 hosts, 3 collections,
>summary in
>> > >>>> all collections are around 10000000 docs. The queries are over
>all 3
>> > >>>> collections.
>> > >>
>> > >> Followup after a little more thought:
>> > >>
>> > >> If we assume that the terms in your filter query are a generous
>15
>> > >> characters each (plus a comma), that means there are in the
>ballpark
>> of
>> > >> 125 thousand of them in a two million byte filter query.  If
>they're
>> > >> smaller, then there would be more.  Considering 56 bytes of
>overhead
>> for
>> > >> each one, there's at least another 7 million bytes of memory for
>> 125000
>> > >> terms when the terms parser divides that filter into multiple
>String
>> > >> objects, plus memory required for the data in each of those
>small
>> > >> strings, which will be just a little bit less than the original
>four
>> > >> million bytes, because it will exclude the commas.  A fair
>amount of
>> > >> garbage will probably also be generated in order to parse the
>filter
>> ...
>> > >> and then once the query is done, the 15 megabytes (or more) of
>memory
>> > >> for the strings will also be garbage.  This is going to repeat
>for
>> every
>> > >> shard.
>> > >>
>> > >> I haven't even discussed what happens for memory requirements on
>the
>> > >> Lucene frange parser, because I don't have any idea what those
>are,
>> and
>> > >> you didn't describe the function you're using.  I also don't
>know how
>> > >> much memory Lucene is going to require in order to execute a
>terms
>> > >> filter with at least 125K terms.  I don't imagine it's going to
>be
>> > small.
>> > >>
>> > >> Thanks,
>> > >> Shawn
>> > >>
>> >
>>

-- 
Sorry for being brief. Alternate email is rickleir at yahoo dot com 

Re: Long string in fq value parameter, more than 2000000 chars

Posted by Susheel Kumar <su...@gmail.com>.
If you are able to draw gc logs in gcviewer when OOM happens, it can give
you idea if it was sudden OOM or heap gets filled over a period of time.
This may help to nail down if any particular query is causing the problem
or something else...

Thanks,
Susheel

On Sat, May 27, 2017 at 5:36 PM, Daniel Angelov <da...@gmail.com>
wrote:

> Thanks for the support so far.
> I am going to analyze the logs in order to check the frequency of such
> queries. BTW, I have forgot to mention, the soft and the hard commits are
> each 60 sec.
>
> BR
> Daniel
>
> Am 27.05.2017 22:57 schrieb "Erik Hatcher" <er...@gmail.com>:
>
> > Another technique to consider is {!join}.  Index the cross ref id "sets"
> > to another core and use a short and sweet join, if there are stable sets
> of
> > id's.
> >
> >    Erik
> >
> > > On May 27, 2017, at 11:39, Alexandre Rafalovitch <ar...@gmail.com>
> > wrote:
> > >
> > > On top of Shawn's analysis, I am also wondering how often those FQ
> > > queries are reused. Because they and the matching documents are
> > > getting cached, so there might be quite a bit of space taken with that
> > > too.
> > >
> > > Regards,
> > >    Alex.
> > > ----
> > > http://www.solr-start.com/ - Resources for Solr users, new and
> > experienced
> > >
> > >
> > >> On 27 May 2017 at 11:32, Shawn Heisey <ap...@elyograg.org> wrote:
> > >>> On 5/27/2017 9:05 AM, Shawn Heisey wrote:
> > >>>> On 5/27/2017 7:14 AM, Daniel Angelov wrote:
> > >>>> I would like to ask, what could be the memory/cpu impact, if the fq
> > >>>> parameter in many of the queries is a long string (fq={!terms
> > >>>> f=...}...,.... ) around 2000000 chars. Most of the queries are like:
> > >>>> "q={!frange l=Timestamp1 u=Timestamp2}... + some others criteria".
> > >>>> This is with SolrCloud 4.1, on 10 hosts, 3 collections, summary in
> > >>>> all collections are around 10000000 docs. The queries are over all 3
> > >>>> collections.
> > >>
> > >> Followup after a little more thought:
> > >>
> > >> If we assume that the terms in your filter query are a generous 15
> > >> characters each (plus a comma), that means there are in the ballpark
> of
> > >> 125 thousand of them in a two million byte filter query.  If they're
> > >> smaller, then there would be more.  Considering 56 bytes of overhead
> for
> > >> each one, there's at least another 7 million bytes of memory for
> 125000
> > >> terms when the terms parser divides that filter into multiple String
> > >> objects, plus memory required for the data in each of those small
> > >> strings, which will be just a little bit less than the original four
> > >> million bytes, because it will exclude the commas.  A fair amount of
> > >> garbage will probably also be generated in order to parse the filter
> ...
> > >> and then once the query is done, the 15 megabytes (or more) of memory
> > >> for the strings will also be garbage.  This is going to repeat for
> every
> > >> shard.
> > >>
> > >> I haven't even discussed what happens for memory requirements on the
> > >> Lucene frange parser, because I don't have any idea what those are,
> and
> > >> you didn't describe the function you're using.  I also don't know how
> > >> much memory Lucene is going to require in order to execute a terms
> > >> filter with at least 125K terms.  I don't imagine it's going to be
> > small.
> > >>
> > >> Thanks,
> > >> Shawn
> > >>
> >
>

Re: Long string in fq value parameter, more than 2000000 chars

Posted by Daniel Angelov <da...@gmail.com>.
Thanks for the support so far.
I am going to analyze the logs in order to check the frequency of such
queries. BTW, I have forgot to mention, the soft and the hard commits are
each 60 sec.

BR
Daniel

Am 27.05.2017 22:57 schrieb "Erik Hatcher" <er...@gmail.com>:

> Another technique to consider is {!join}.  Index the cross ref id "sets"
> to another core and use a short and sweet join, if there are stable sets of
> id's.
>
>    Erik
>
> > On May 27, 2017, at 11:39, Alexandre Rafalovitch <ar...@gmail.com>
> wrote:
> >
> > On top of Shawn's analysis, I am also wondering how often those FQ
> > queries are reused. Because they and the matching documents are
> > getting cached, so there might be quite a bit of space taken with that
> > too.
> >
> > Regards,
> >    Alex.
> > ----
> > http://www.solr-start.com/ - Resources for Solr users, new and
> experienced
> >
> >
> >> On 27 May 2017 at 11:32, Shawn Heisey <ap...@elyograg.org> wrote:
> >>> On 5/27/2017 9:05 AM, Shawn Heisey wrote:
> >>>> On 5/27/2017 7:14 AM, Daniel Angelov wrote:
> >>>> I would like to ask, what could be the memory/cpu impact, if the fq
> >>>> parameter in many of the queries is a long string (fq={!terms
> >>>> f=...}...,.... ) around 2000000 chars. Most of the queries are like:
> >>>> "q={!frange l=Timestamp1 u=Timestamp2}... + some others criteria".
> >>>> This is with SolrCloud 4.1, on 10 hosts, 3 collections, summary in
> >>>> all collections are around 10000000 docs. The queries are over all 3
> >>>> collections.
> >>
> >> Followup after a little more thought:
> >>
> >> If we assume that the terms in your filter query are a generous 15
> >> characters each (plus a comma), that means there are in the ballpark of
> >> 125 thousand of them in a two million byte filter query.  If they're
> >> smaller, then there would be more.  Considering 56 bytes of overhead for
> >> each one, there's at least another 7 million bytes of memory for 125000
> >> terms when the terms parser divides that filter into multiple String
> >> objects, plus memory required for the data in each of those small
> >> strings, which will be just a little bit less than the original four
> >> million bytes, because it will exclude the commas.  A fair amount of
> >> garbage will probably also be generated in order to parse the filter ...
> >> and then once the query is done, the 15 megabytes (or more) of memory
> >> for the strings will also be garbage.  This is going to repeat for every
> >> shard.
> >>
> >> I haven't even discussed what happens for memory requirements on the
> >> Lucene frange parser, because I don't have any idea what those are, and
> >> you didn't describe the function you're using.  I also don't know how
> >> much memory Lucene is going to require in order to execute a terms
> >> filter with at least 125K terms.  I don't imagine it's going to be
> small.
> >>
> >> Thanks,
> >> Shawn
> >>
>

Re: Long string in fq value parameter, more than 2000000 chars

Posted by Erik Hatcher <er...@gmail.com>.
Another technique to consider is {!join}.  Index the cross ref id "sets" to another core and use a short and sweet join, if there are stable sets of id's. 

   Erik

> On May 27, 2017, at 11:39, Alexandre Rafalovitch <ar...@gmail.com> wrote:
> 
> On top of Shawn's analysis, I am also wondering how often those FQ
> queries are reused. Because they and the matching documents are
> getting cached, so there might be quite a bit of space taken with that
> too.
> 
> Regards,
>    Alex.
> ----
> http://www.solr-start.com/ - Resources for Solr users, new and experienced
> 
> 
>> On 27 May 2017 at 11:32, Shawn Heisey <ap...@elyograg.org> wrote:
>>> On 5/27/2017 9:05 AM, Shawn Heisey wrote:
>>>> On 5/27/2017 7:14 AM, Daniel Angelov wrote:
>>>> I would like to ask, what could be the memory/cpu impact, if the fq
>>>> parameter in many of the queries is a long string (fq={!terms
>>>> f=...}...,.... ) around 2000000 chars. Most of the queries are like:
>>>> "q={!frange l=Timestamp1 u=Timestamp2}... + some others criteria".
>>>> This is with SolrCloud 4.1, on 10 hosts, 3 collections, summary in
>>>> all collections are around 10000000 docs. The queries are over all 3
>>>> collections.
>> 
>> Followup after a little more thought:
>> 
>> If we assume that the terms in your filter query are a generous 15
>> characters each (plus a comma), that means there are in the ballpark of
>> 125 thousand of them in a two million byte filter query.  If they're
>> smaller, then there would be more.  Considering 56 bytes of overhead for
>> each one, there's at least another 7 million bytes of memory for 125000
>> terms when the terms parser divides that filter into multiple String
>> objects, plus memory required for the data in each of those small
>> strings, which will be just a little bit less than the original four
>> million bytes, because it will exclude the commas.  A fair amount of
>> garbage will probably also be generated in order to parse the filter ...
>> and then once the query is done, the 15 megabytes (or more) of memory
>> for the strings will also be garbage.  This is going to repeat for every
>> shard.
>> 
>> I haven't even discussed what happens for memory requirements on the
>> Lucene frange parser, because I don't have any idea what those are, and
>> you didn't describe the function you're using.  I also don't know how
>> much memory Lucene is going to require in order to execute a terms
>> filter with at least 125K terms.  I don't imagine it's going to be small.
>> 
>> Thanks,
>> Shawn
>> 

Re: Long string in fq value parameter, more than 2000000 chars

Posted by Alexandre Rafalovitch <ar...@gmail.com>.
On top of Shawn's analysis, I am also wondering how often those FQ
queries are reused. Because they and the matching documents are
getting cached, so there might be quite a bit of space taken with that
too.

Regards,
    Alex.
----
http://www.solr-start.com/ - Resources for Solr users, new and experienced


On 27 May 2017 at 11:32, Shawn Heisey <ap...@elyograg.org> wrote:
> On 5/27/2017 9:05 AM, Shawn Heisey wrote:
>> On 5/27/2017 7:14 AM, Daniel Angelov wrote:
>>> I would like to ask, what could be the memory/cpu impact, if the fq
>>> parameter in many of the queries is a long string (fq={!terms
>>> f=...}...,.... ) around 2000000 chars. Most of the queries are like:
>>> "q={!frange l=Timestamp1 u=Timestamp2}... + some others criteria".
>>> This is with SolrCloud 4.1, on 10 hosts, 3 collections, summary in
>>> all collections are around 10000000 docs. The queries are over all 3
>>> collections.
>
> Followup after a little more thought:
>
> If we assume that the terms in your filter query are a generous 15
> characters each (plus a comma), that means there are in the ballpark of
> 125 thousand of them in a two million byte filter query.  If they're
> smaller, then there would be more.  Considering 56 bytes of overhead for
> each one, there's at least another 7 million bytes of memory for 125000
> terms when the terms parser divides that filter into multiple String
> objects, plus memory required for the data in each of those small
> strings, which will be just a little bit less than the original four
> million bytes, because it will exclude the commas.  A fair amount of
> garbage will probably also be generated in order to parse the filter ...
> and then once the query is done, the 15 megabytes (or more) of memory
> for the strings will also be garbage.  This is going to repeat for every
> shard.
>
> I haven't even discussed what happens for memory requirements on the
> Lucene frange parser, because I don't have any idea what those are, and
> you didn't describe the function you're using.  I also don't know how
> much memory Lucene is going to require in order to execute a terms
> filter with at least 125K terms.  I don't imagine it's going to be small.
>
> Thanks,
> Shawn
>

Re: Long string in fq value parameter, more than 2000000 chars

Posted by Shawn Heisey <ap...@elyograg.org>.
On 5/27/2017 9:05 AM, Shawn Heisey wrote:
> On 5/27/2017 7:14 AM, Daniel Angelov wrote:
>> I would like to ask, what could be the memory/cpu impact, if the fq
>> parameter in many of the queries is a long string (fq={!terms
>> f=...}...,.... ) around 2000000 chars. Most of the queries are like:
>> "q={!frange l=Timestamp1 u=Timestamp2}... + some others criteria".
>> This is with SolrCloud 4.1, on 10 hosts, 3 collections, summary in
>> all collections are around 10000000 docs. The queries are over all 3
>> collections. 

Followup after a little more thought:

If we assume that the terms in your filter query are a generous 15
characters each (plus a comma), that means there are in the ballpark of
125 thousand of them in a two million byte filter query.  If they're
smaller, then there would be more.  Considering 56 bytes of overhead for
each one, there's at least another 7 million bytes of memory for 125000
terms when the terms parser divides that filter into multiple String
objects, plus memory required for the data in each of those small
strings, which will be just a little bit less than the original four
million bytes, because it will exclude the commas.  A fair amount of
garbage will probably also be generated in order to parse the filter ...
and then once the query is done, the 15 megabytes (or more) of memory
for the strings will also be garbage.  This is going to repeat for every
shard.

I haven't even discussed what happens for memory requirements on the
Lucene frange parser, because I don't have any idea what those are, and
you didn't describe the function you're using.  I also don't know how
much memory Lucene is going to require in order to execute a terms
filter with at least 125K terms.  I don't imagine it's going to be small.

Thanks,
Shawn


Re: Long string in fq value parameter, more than 2000000 chars

Posted by Shawn Heisey <ap...@elyograg.org>.
On 5/27/2017 7:14 AM, Daniel Angelov wrote:
> I would like to ask, what could be the memory/cpu impact, if the fq
> parameter in many of the queries is a long string (fq={!terms
> f=...}...,.... ) around 2000000 chars. Most of the queries are like:
> "q={!frange l=Timestamp1 u=Timestamp2}... + some others criteria". This is
> with SolrCloud 4.1, on 10 hosts, 3 collections, summary in all collections
> are around 10000000 docs. The queries are over all 3 collections.
>
> I have sometimes OOM exceptions. And I can see GC times are pretty long.
> The heap size is 64 GB on each host. The cache settings are the default.
>
> Is it possible the long fq parameter in the requests to cause OOM
> exceptions?

A two million character string in Java will take just over four million
bytes of memory.  This is because Java uses UTF-16 internally, and
overhead on a String object is approximately 56 bytes.  With multiple
shards, that string is going to get copied for each shard.  There might
be other places in the Solr and Lucene code where the string will also
get copied multiple times.  At four megabytes for each copy, that's
going to eat up memory quickly.  It will also take a non-trivial amount
of time to accomplish each copy.

OOM exceptions on a 64GB heap?  Even if we consider the info just
mentioned and there are several copies of the two million character
string floating around, it sounds like you are doing some massively
complex queries, or that your index size is beyond gargantuan.  I cannot
imagine needing a 64GB heap for 30 million documents unless the system
is handling some very unusual queries, and/or an enormous index, and/or
some *extremely* large Solr caches.

I suspect there are many details that we haven't heard yet.  I'm not
even sure exactly what to ask for, so I'll ask for the moon:

On a per-server basis, can we see the following info?

Total memory installed in the server.
How many Solr instances are running on the server.
The total amount of max heap memory allocated to Solr.
A list of other things running on the server besides Solr.
Total size of the solr home directory.
How many documents does that solr home size represent? If there are
multiple shards/replicas, all of them must be counted.
solrconfig.xml and the schema would be useful.

More general questions:

What does a typical query involve?
If there are facets, describe each field used in a facet -- term
cardinality, typical contents, analysis, etc.

If the system is running an OS with the "top" utility available, run top
(not htop or any other variety), press shift-M to sort by memory, grab a
screenshot, and put the information on the Internet somewhere we can
access it with a URL.  If it's on Windows, similar information can be
obtained with Resource Monitor, sort by "Working Set" on the Memory tab.

Thanks,
Shawn