You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by slee <sl...@gmail.com> on 2016/09/21 20:03:38 UTC

Performance Issue when querying Multivalued fields [SOLR 6.1.0]

I've been doing a lot of reading on this forum with regards to performance on
multivalued fields, and nothing helps. When I query on singlie fields, the
response time is fairly quick (typically < 1sec). However, when I query on
multivalued fields, the response is > 2 mins ~ 3 mins.

Here's my current environment:
CPU: Intel Xeon E5-2637 v3 @ 3.5Ghz
RAM: 16GB
OS: Windows 7 64 Bit
HD Controller: SCSI

SOLR Documents: 17 million.
Average # of terms in a multivalued fields: 54~60
Schema: Multivalue field has indexed="true"

I've set both my XMS and XMX to 5g, using -m 5g option. Another thing I
realized is, every time I query on the multivalued, the memory consumptions
takes up over 90%. Could this also be the cause of the issue? I have tried
MMapDirectoryFactory, the results seems to be the same (vs the default
NRTCachingDirectoryFactory). 

Please help. Any advise would be appreciated. 
Thanks.




--
View this message in context: http://lucene.472066.n3.nabble.com/Performance-Issue-when-querying-Multivalued-fields-SOLR-6-1-0-tp4297255.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Performance Issue when querying Multivalued fields [SOLR 6.1.0]

Posted by Erick Erickson <er...@gmail.com>.
I totally missed EdgeNGram. Good catch Alex!

Yeah, that's a killer. My shot in the dark here is that
your analysis chain isn't the best choice to support your use-case and you're
shooting yourself in the foot. So let's back up and talk
about your use-case and maybe re-define your analysis
chain for better performance.

Best,
Erick

On Thu, Sep 22, 2016 at 8:21 AM, Alexandre Rafalovitch
<ar...@gmail.com> wrote:
> Well,
>
> I am guessing this is the line that's causing the problem:
> <filter class="solr.EdgeNGramFilterFactory" minGramSize="3"
> maxGramSize="50"/>
>
> Run your real sample for that field against your indexing definition
> in Admin UI and see how many tokens you end up with. You may have 50
> tokens, but if each of them generates up to 47 representations......
>
> Regards,
>     Alex.
> ----
> Newsletter and resources for Solr beginners and intermediates:
> http://www.solr-start.com/
>
>
> On 22 September 2016 at 22:08, slee <sl...@gmail.com> wrote:
>> Here's what I have define in my schema:
>> <fieldType name="c_text" class="solr.TextField" positionIncrementGap="100">
>>     <analyzer type="index">
>>       <tokenizer class="solr.KeywordTokenizerFactory"/>
>>       <filter class="solr.LowerCaseFilterFactory"/>
>>       <filter class="solr.ASCIIFoldingFilterFactory"/>
>>       <filter class="solr.EdgeNGramFilterFactory" minGramSize="3"
>> maxGramSize="50"/>
>>     </analyzer>
>>     <analyzer type="query">
>>       <tokenizer class="solr.KeywordTokenizerFactory"/>
>>       <filter class="solr.LowerCaseFilterFactory"/>
>>       <filter class="solr.ASCIIFoldingFilterFactory"/>
>>     </analyzer>
>>   </fieldType>
>>
>> <field name="global_Value" type="c_text" multiValued="true" indexed="true"
>> required="true" stored="true"/>
>>
>> This is what I send in the query (2 values):
>> q=global_Value:*mas+AND+global_Value:*sef&df=text&rows=5&version=2.2&echoParams=explicit&fl=global_Value
>>
>> In addition, memory is taking way over 90%, given the heap space set at 5g.
>>
>>
>>
>>
>> --
>> View this message in context: http://lucene.472066.n3.nabble.com/Performance-Issue-when-querying-Multivalued-fields-SOLR-6-1-0-tp4297255p4297474.html
>> Sent from the Solr - User mailing list archive at Nabble.com.

Re: Performance Issue when querying Multivalued fields [SOLR 6.1.0]

Posted by Alexandre Rafalovitch <ar...@gmail.com>.
Well,

I am guessing this is the line that's causing the problem:
<filter class="solr.EdgeNGramFilterFactory" minGramSize="3"
maxGramSize="50"/>

Run your real sample for that field against your indexing definition
in Admin UI and see how many tokens you end up with. You may have 50
tokens, but if each of them generates up to 47 representations......

Regards,
    Alex.
----
Newsletter and resources for Solr beginners and intermediates:
http://www.solr-start.com/


On 22 September 2016 at 22:08, slee <sl...@gmail.com> wrote:
> Here's what I have define in my schema:
> <fieldType name="c_text" class="solr.TextField" positionIncrementGap="100">
>     <analyzer type="index">
>       <tokenizer class="solr.KeywordTokenizerFactory"/>
>       <filter class="solr.LowerCaseFilterFactory"/>
>       <filter class="solr.ASCIIFoldingFilterFactory"/>
>       <filter class="solr.EdgeNGramFilterFactory" minGramSize="3"
> maxGramSize="50"/>
>     </analyzer>
>     <analyzer type="query">
>       <tokenizer class="solr.KeywordTokenizerFactory"/>
>       <filter class="solr.LowerCaseFilterFactory"/>
>       <filter class="solr.ASCIIFoldingFilterFactory"/>
>     </analyzer>
>   </fieldType>
>
> <field name="global_Value" type="c_text" multiValued="true" indexed="true"
> required="true" stored="true"/>
>
> This is what I send in the query (2 values):
> q=global_Value:*mas+AND+global_Value:*sef&df=text&rows=5&version=2.2&echoParams=explicit&fl=global_Value
>
> In addition, memory is taking way over 90%, given the heap space set at 5g.
>
>
>
>
> --
> View this message in context: http://lucene.472066.n3.nabble.com/Performance-Issue-when-querying-Multivalued-fields-SOLR-6-1-0-tp4297255p4297474.html
> Sent from the Solr - User mailing list archive at Nabble.com.

Re: Performance Issue when querying Multivalued fields [SOLR 6.1.0]

Posted by Alexandre Rafalovitch <ar...@gmail.com>.
Not fully clear still, but perhaps you need several fields, at least one of
which just contains your SEF and OFF values serving effectively as binary
switches (FQ matches). And then maybe you strip the leading IDs that you
are not matching on.

Remember your Solr data shape does not need to match your original data
shape. Especially with extra fields that you could get through copyField
commands or through UpdateRequestProcessor duplicates. And you don't need
to store those duplicates, just index them for most effective search.

And yes, reversing filter and edge ngram together mean you don't need a
wildcard queries.

Regards,
    Alex

On 23 Sep 2016 1:49 AM, "slee" <sl...@gmail.com> wrote:

> Alex,
>
> You do have a point with EdgeNGramFilterFactory. As requested, I've
> attached
> a sample screenshotfor your review.
> <http://lucene.472066.n3.nabble.com/file/n4297542/sample.png>
>
> Erick,
>
> Here's my use-case. Assume I have the following term stored in global_Value
> as such:
> - executionvenuetype#*OFF*-FACILITY
> - partyid#B2A*SEF*9AJP5P9OLL1190
>
> Now, I want to retrieve any document matching the term in global_Value that
> contains the keyword: "off" and "sef". With regards to leading wild-card,
> that's intentional. Not a mail issue. These fields typically contains Guid,
> and some financial terms (eg: Bonds, swaps, etc..). If I don't use any
> non-wildcard, then it's an exact match. But my use-case dictates that it
> should retrieve if it's a partial match.
>
> So what's my best bet for analyzer in such cases ?
>
>
>
> --
> View this message in context: http://lucene.472066.n3.
> nabble.com/Performance-Issue-when-querying-Multivalued-fields-SOLR-6-1-0-
> tp4297255p4297542.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>

Re: Performance Issue when querying Multivalued fields [SOLR 6.1.0]

Posted by Alexandre Rafalovitch <ar...@gmail.com>.
Yes, swap will switch which core the name points to. For non Cloud setup.

Just remember that your directory name does not get renamed, when you are
deleting the old one. Just the core name in core.properties file.

Regards,
   Alex

On 24 Sep 2016 10:28 AM, "slee" <sl...@gmail.com> wrote:

Erick / Alex,

I want to thank you both. Your hints got me understand SOLR a bit better. I
ended up with reversewildcard, and it speeds up performance a lot. That's
what I'm expecting from SOLR...  I also no longer experience the huge memory
hog.

The only down-side I can think of is, you need to re-index when you change
the schema. But I can live with that, since I'll have 2 machines where one
is for reading, the other one is for indexing... I'll swap when the indexing
is done.. I presume that's what the swap from the Admin UI is for right?



--
View this message in context: http://lucene.472066.n3.
nabble.com/Performance-Issue-when-querying-Multivalued-fields-SOLR-6-1-0-
tp4297255p4297821.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Performance Issue when querying Multivalued fields [SOLR 6.1.0]

Posted by slee <sl...@gmail.com>.
Erick / Alex,

I want to thank you both. Your hints got me understand SOLR a bit better. I
ended up with reversewildcard, and it speeds up performance a lot. That's
what I'm expecting from SOLR...  I also no longer experience the huge memory
hog.

The only down-side I can think of is, you need to re-index when you change
the schema. But I can live with that, since I'll have 2 machines where one
is for reading, the other one is for indexing... I'll swap when the indexing
is done.. I presume that's what the swap from the Admin UI is for right?



--
View this message in context: http://lucene.472066.n3.nabble.com/Performance-Issue-when-querying-Multivalued-fields-SOLR-6-1-0-tp4297255p4297821.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Performance Issue when querying Multivalued fields [SOLR 6.1.0]

Posted by Alexandre Rafalovitch <ar...@gmail.com>.
But if "SEF" and "OFF" are known to be searched for and especially if
they are well-delimited, they could just be pulled-out into a separate
field and just checked with an FQ.

In the end, there may be no need for either EdgeNGram or wildcards.
Just twisting the data during _indexing_ to represent the business
domain requirements.

Regards,
   Alex.

On 23 September 2016 at 10:39, Erick Erickson <er...@gmail.com> wrote:
> If you can break these up into tokens somehow, that's clearly best. But from the
> patterns you show it's not likely. WordDelimiterFactory won't quite
> work since it
> wouldn't be able to separate ASEF into the token SEF.....

Re: Performance Issue when querying Multivalued fields [SOLR 6.1.0]

Posted by Erick Erickson <er...@gmail.com>.
If you can break these up into tokens somehow, that's clearly best. But from the
patterns you show it's not likely. WordDelimiterFactory won't quite
work since it
wouldn't be able to separate ASEF into the token SEF.....

You'll have a _lot_ fewer terms if you don't use edgengram. Try just
using bigrams (i.e. NGramFilterFactory) with both mingram and maxgram set
to 2.

Now you do phrase searches (also automatic) on pairs. So in your example
some of the pairs are:
#o
of
ff
f-

To find off, you search for the _phrase_ "of ff". There'll be some
fiddling here to
make it all work.

Best,
Erick

On Thu, Sep 22, 2016 at 11:49 AM, slee <sl...@gmail.com> wrote:
> Alex,
>
> You do have a point with EdgeNGramFilterFactory. As requested, I've attached
> a sample screenshotfor your review.
> <http://lucene.472066.n3.nabble.com/file/n4297542/sample.png>
>
> Erick,
>
> Here's my use-case. Assume I have the following term stored in global_Value
> as such:
> - executionvenuetype#*OFF*-FACILITY
> - partyid#B2A*SEF*9AJP5P9OLL1190
>
> Now, I want to retrieve any document matching the term in global_Value that
> contains the keyword: "off" and "sef". With regards to leading wild-card,
> that's intentional. Not a mail issue. These fields typically contains Guid,
> and some financial terms (eg: Bonds, swaps, etc..). If I don't use any
> non-wildcard, then it's an exact match. But my use-case dictates that it
> should retrieve if it's a partial match.
>
> So what's my best bet for analyzer in such cases ?
>
>
>
> --
> View this message in context: http://lucene.472066.n3.nabble.com/Performance-Issue-when-querying-Multivalued-fields-SOLR-6-1-0-tp4297255p4297542.html
> Sent from the Solr - User mailing list archive at Nabble.com.

Re: Performance Issue when querying Multivalued fields [SOLR 6.1.0]

Posted by slee <sl...@gmail.com>.
Alex,

You do have a point with EdgeNGramFilterFactory. As requested, I've attached
a sample screenshotfor your review.
<http://lucene.472066.n3.nabble.com/file/n4297542/sample.png> 

Erick,

Here's my use-case. Assume I have the following term stored in global_Value
as such:
- executionvenuetype#*OFF*-FACILITY
- partyid#B2A*SEF*9AJP5P9OLL1190

Now, I want to retrieve any document matching the term in global_Value that
contains the keyword: "off" and "sef". With regards to leading wild-card,
that's intentional. Not a mail issue. These fields typically contains Guid,
and some financial terms (eg: Bonds, swaps, etc..). If I don't use any
non-wildcard, then it's an exact match. But my use-case dictates that it
should retrieve if it's a partial match.

So what's my best bet for analyzer in such cases ?



--
View this message in context: http://lucene.472066.n3.nabble.com/Performance-Issue-when-querying-Multivalued-fields-SOLR-6-1-0-tp4297255p4297542.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Performance Issue when querying Multivalued fields [SOLR 6.1.0]

Posted by Erick Erickson <er...@gmail.com>.
Wait: Are you really doing leading wildcard queries? If so, that's
likely the root of
the problem. Unless you add ReverseWildcardFilterFactory to your
analysis chain, Lucene has to enumerate your entire set of terms to
find likely candidates,
which takes a lot of resources. What happens if you use similar
trailing wildcards? And
what happens when you use simple non-wildcard queries?

Or is this just bolding that gets translated to asterisks by the mail
formatting?

Finally, what are typical values in this field? I'm really asking if your use of
KeywordTokenizer is the best choice here. It often is, but I've seen
it mis-used so
I thought we should check.

Best,
Erick



On Thu, Sep 22, 2016 at 8:08 AM, slee <sl...@gmail.com> wrote:
> Here's what I have define in my schema:
> <fieldType name="c_text" class="solr.TextField" positionIncrementGap="100">
>     <analyzer type="index">
>       <tokenizer class="solr.KeywordTokenizerFactory"/>
>       <filter class="solr.LowerCaseFilterFactory"/>
>       <filter class="solr.ASCIIFoldingFilterFactory"/>
>       <filter class="solr.EdgeNGramFilterFactory" minGramSize="3"
> maxGramSize="50"/>
>     </analyzer>
>     <analyzer type="query">
>       <tokenizer class="solr.KeywordTokenizerFactory"/>
>       <filter class="solr.LowerCaseFilterFactory"/>
>       <filter class="solr.ASCIIFoldingFilterFactory"/>
>     </analyzer>
>   </fieldType>
>
> <field name="global_Value" type="c_text" multiValued="true" indexed="true"
> required="true" stored="true"/>
>
> This is what I send in the query (2 values):
> q=global_Value:*mas+AND+global_Value:*sef&df=text&rows=5&version=2.2&echoParams=explicit&fl=global_Value
>
> In addition, memory is taking way over 90%, given the heap space set at 5g.
>
>
>
>
> --
> View this message in context: http://lucene.472066.n3.nabble.com/Performance-Issue-when-querying-Multivalued-fields-SOLR-6-1-0-tp4297255p4297474.html
> Sent from the Solr - User mailing list archive at Nabble.com.

Re: Performance Issue when querying Multivalued fields [SOLR 6.1.0]

Posted by slee <sl...@gmail.com>.
Here's what I have define in my schema:
<fieldType name="c_text" class="solr.TextField" positionIncrementGap="100">
    <analyzer type="index">
      <tokenizer class="solr.KeywordTokenizerFactory"/>
      <filter class="solr.LowerCaseFilterFactory"/>
      <filter class="solr.ASCIIFoldingFilterFactory"/>
      <filter class="solr.EdgeNGramFilterFactory" minGramSize="3"
maxGramSize="50"/>
    </analyzer>
    <analyzer type="query">
      <tokenizer class="solr.KeywordTokenizerFactory"/>
      <filter class="solr.LowerCaseFilterFactory"/>
      <filter class="solr.ASCIIFoldingFilterFactory"/>
    </analyzer>
  </fieldType>

<field name="global_Value" type="c_text" multiValued="true" indexed="true"
required="true" stored="true"/>

This is what I send in the query (2 values):
q=global_Value:*mas+AND+global_Value:*sef&df=text&rows=5&version=2.2&echoParams=explicit&fl=global_Value

In addition, memory is taking way over 90%, given the heap space set at 5g.




--
View this message in context: http://lucene.472066.n3.nabble.com/Performance-Issue-when-querying-Multivalued-fields-SOLR-6-1-0-tp4297255p4297474.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Performance Issue when querying Multivalued fields [SOLR 6.1.0]

Posted by Alexandre Rafalovitch <ar...@gmail.com>.
Could you share the field and type definition and the type of query you are
doing.

Under the covers, multi valued fields are mostly the same as multi term
strings, just with large gaps between term positions.

And if you return the same set of fields, rehydration of the document
should be the same as well.

Regards,
   Alex

On 22 Sep 2016 8:30 AM, "Stan Lee" <sl...@gmail.com> wrote:

> I did 3 sets of query as followed:
> - multi-value field only : slow
> -  single field value: fast
> - multi-value and single field combine: slow
>
> So yes, the difference is base on which field you search against. I'm
> experiencing the same issue described here:
>
> http://stackoverflow.com/questions/29745135/performance-issue-with-
> multivalued-field-in-lucene
>
> This individual ended up using elasticsearch which doesn't help me. I'm
> wondering if multivalue fields cannot exceed certain terms? I only have 54
> to 60 terms.
>
>
>   Original Message
> From: arafalov@gmail.com
> Sent: September 21, 2016 7:40 PM
> To: solr-user@lucene.apache.org
> Reply-to: solr-user@lucene.apache.org
> Subject: Re: Performance Issue when querying Multivalued fields [SOLR
> 6.1.0]
>
> Do you _return_ the same set of fields in both queries? Is the difference
> truly just which field you search against?
>
> Regards,
>     Alex
>
> On 22 Sep 2016 3:03 AM, "slee" <sl...@gmail.com> wrote:
>
> > I've been doing a lot of reading on this forum with regards to
> performance
> > on
> > multivalued fields, and nothing helps. When I query on singlie fields,
> the
> > response time is fairly quick (typically < 1sec). However, when I query
> on
> > multivalued fields, the response is > 2 mins ~ 3 mins.
> >
> > Here's my current environment:
> > CPU: Intel Xeon E5-2637 v3 @ 3.5Ghz
> > RAM: 16GB
> > OS: Windows 7 64 Bit
> > HD Controller: SCSI
> >
> > SOLR Documents: 17 million.
> > Average # of terms in a multivalued fields: 54~60
> > Schema: Multivalue field has indexed="true"
> >
> > I've set both my XMS and XMX to 5g, using -m 5g option. Another thing I
> > realized is, every time I query on the multivalued, the memory
> consumptions
> > takes up over 90%. Could this also be the cause of the issue? I have
> tried
> > MMapDirectoryFactory, the results seems to be the same (vs the default
> > NRTCachingDirectoryFactory).
> >
> > Please help. Any advise would be appreciated.
> > Thanks.
> >
> >
> >
> >
> > --
> > View this message in context: http://lucene.472066.n3.
> > nabble.com/Performance-Issue-when-querying-Multivalued-
> > fields-SOLR-6-1-0-tp4297255.html
> > Sent from the Solr - User mailing list archive at Nabble.com.
> >
>

Re: Performance Issue when querying Multivalued fields [SOLR 6.1.0]

Posted by Stan Lee <sl...@gmail.com>.
I did 3 sets of query as followed:
- multi-value field only : slow
-  single field value: fast
- multi-value and single field combine: slow

So yes, the difference is base on which field you search against. I'm experiencing the same issue described here: 

http://stackoverflow.com/questions/29745135/performance-issue-with-multivalued-field-in-lucene

This individual ended up using elasticsearch which doesn't help me. I'm wondering if multivalue fields cannot exceed certain terms? I only have 54 to 60 terms.


  Original Message  
From: arafalov@gmail.com
Sent: September 21, 2016 7:40 PM
To: solr-user@lucene.apache.org
Reply-to: solr-user@lucene.apache.org
Subject: Re: Performance Issue when querying Multivalued fields [SOLR 6.1.0]

Do you _return_ the same set of fields in both queries? Is the difference
truly just which field you search against?

Regards,
    Alex

On 22 Sep 2016 3:03 AM, "slee" <sl...@gmail.com> wrote:

> I've been doing a lot of reading on this forum with regards to performance
> on
> multivalued fields, and nothing helps. When I query on singlie fields, the
> response time is fairly quick (typically < 1sec). However, when I query on
> multivalued fields, the response is > 2 mins ~ 3 mins.
>
> Here's my current environment:
> CPU: Intel Xeon E5-2637 v3 @ 3.5Ghz
> RAM: 16GB
> OS: Windows 7 64 Bit
> HD Controller: SCSI
>
> SOLR Documents: 17 million.
> Average # of terms in a multivalued fields: 54~60
> Schema: Multivalue field has indexed="true"
>
> I've set both my XMS and XMX to 5g, using -m 5g option. Another thing I
> realized is, every time I query on the multivalued, the memory consumptions
> takes up over 90%. Could this also be the cause of the issue? I have tried
> MMapDirectoryFactory, the results seems to be the same (vs the default
> NRTCachingDirectoryFactory).
>
> Please help. Any advise would be appreciated.
> Thanks.
>
>
>
>
> --
> View this message in context: http://lucene.472066.n3.
> nabble.com/Performance-Issue-when-querying-Multivalued-
> fields-SOLR-6-1-0-tp4297255.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>

Re: Performance Issue when querying Multivalued fields [SOLR 6.1.0]

Posted by Alexandre Rafalovitch <ar...@gmail.com>.
Do you _return_ the same set of fields in both queries? Is the difference
truly just which field you search against?

Regards,
    Alex

On 22 Sep 2016 3:03 AM, "slee" <sl...@gmail.com> wrote:

> I've been doing a lot of reading on this forum with regards to performance
> on
> multivalued fields, and nothing helps. When I query on singlie fields, the
> response time is fairly quick (typically < 1sec). However, when I query on
> multivalued fields, the response is > 2 mins ~ 3 mins.
>
> Here's my current environment:
> CPU: Intel Xeon E5-2637 v3 @ 3.5Ghz
> RAM: 16GB
> OS: Windows 7 64 Bit
> HD Controller: SCSI
>
> SOLR Documents: 17 million.
> Average # of terms in a multivalued fields: 54~60
> Schema: Multivalue field has indexed="true"
>
> I've set both my XMS and XMX to 5g, using -m 5g option. Another thing I
> realized is, every time I query on the multivalued, the memory consumptions
> takes up over 90%. Could this also be the cause of the issue? I have tried
> MMapDirectoryFactory, the results seems to be the same (vs the default
> NRTCachingDirectoryFactory).
>
> Please help. Any advise would be appreciated.
> Thanks.
>
>
>
>
> --
> View this message in context: http://lucene.472066.n3.
> nabble.com/Performance-Issue-when-querying-Multivalued-
> fields-SOLR-6-1-0-tp4297255.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>