You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by Webster Homer <we...@milliporesigma.com> on 2019/01/02 19:55:04 UTC

RE: Query kills Solrcloud

We are still having serious problems with our solrcloud failing due to this problem.
The problem is clearly data related. 
How can I determine what documents are being searched? Is it possible to get Solr/lucene to output the docids being searched?

I believe that this is a lucene bug, but I need to narrow the focus to a smaller number of records, and I'm not certain how to do that efficiently. Are there debug parameters that could help?

-----Original Message-----
From: Webster Homer <we...@milliporesigma.com> 
Sent: Thursday, December 20, 2018 3:45 PM
To: solr-user@lucene.apache.org
Subject: Query kills Solrcloud

We are experiencing almost nightly solr crashes due to Japanese queries. I’ve been able to determine that one of our field types seems to be a culprit. When I run a much reduced version of the query against out DEV solrcloud I see the memory usage jump from less than a gb to 5gb using only a single field in the query. The collection is fairly small ~411,000 documents of which only ~25,000 have searchable Japanese fields. I have been able to simplify the query to run against a single Japanese field in the schema. The JVM memory jumps from less than a gig to close to 5 gb, and back down. The QTime is 36959 which seems high for 2500 documents. Indeed the single field that I’m using in my test case has 2031 documents.

I extended the query to 5 fields and watch the memory usage in the Solr Console application. The memory usage goes to almost 6gb with a QTime of 100909. The Solrconsole shows connection errors, and when I look at the Cloud graph all the replicas on the node where I submitted the query are down. In dev the replicas eventually recover. In production, with the full query which has a lot more fields in the qf parameter, the solr cloud dies.
One example query term:
ジエチルアミノヒドロキシベンゾイル安息香酸ヘキシル

This is the field type that we have defined:
   <fieldtype name="text_deep_cjk" class="solr.TextField" positionIncrementGap="10000" autoGeneratePhraseQueries="false">
     <analyzer type="index">
        <!-- remove spaces between CJK characters -->
       <charFilter class="solr.PatternReplaceCharFilterFactory" pattern="([\p{IsHangul}\p{IsHan}\p{IsKatakana}\p{IsHiragana}]+)\s+(?=[\p{IsHangul}\p{IsHan}\p{IsKatakana}\p{IsHiragana}])" replacement="$1"/>
         <tokenizer class="solr.ICUTokenizerFactory" />
        <!-- normalize width before bigram, as e.g. half-width dakuten combine  -->
        <filter class="solr.CJKWidthFilterFactory"/>
        <!-- Transform Traditional Han to Simplified Han -->
        <filter class="solr.ICUTransformFilterFactory" id="Traditional-Simplified"/>
                <!-- Transform Hiragana to Katakana just as was done for Endeca -->
        <filter class="solr.ICUTransformFilterFactory" id="Hiragana-Katakana"/>
        <filter class="solr.ICUFoldingFilterFactory"/>   <!-- NFKC, case folding, diacritics removed -->
        <filter class="solr.CJKBigramFilterFactory" han="true" hiragana="true" katakana="true" hangul="true" outputUnigrams="true" />
      </analyzer>

     <analyzer type="query">
         <!-- remove spaces between CJK characters -->
       <charFilter class="solr.PatternReplaceCharFilterFactory" pattern="([\p{IsHangul}\p{IsHan}\p{IsKatakana}\p{IsHiragana}]+)\s+(?=[\p{IsHangul}\p{IsHan}\p{IsKatakana}\p{IsHiragana}])" replacement="$1"/>

       <tokenizer class="solr.ICUTokenizerFactory" />
       <filter class="solr.SynonymGraphFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true" tokenizerFactory="solr.ICUTokenizerFactory" />
        <filter class="solr.CJKWidthFilterFactory"/>
        <!-- Transform Traditional Han to Simplified Han -->
        <filter class="solr.ICUTransformFilterFactory" id="Traditional-Simplified"/>
                <!-- Transform Hiragana to Katakana just as was done for Endeca -->
        <filter class="solr.ICUTransformFilterFactory" id="Hiragana-Katakana"/>
        <filter class="solr.ICUFoldingFilterFactory"/>   <!-- NFKC, case folding, diacritics removed -->
        <filter class="solr.CJKBigramFilterFactory" han="true" hiragana="true" katakana="true" hangul="true" outputUnigrams="true" />
      </analyzer>
    </fieldtype>

Why is searching even 1 field of this type so expensive?
I suspect that this is data related, as other queries return in far less than a second. What are good strategies for determining what documents are causing the problem? I’m new to debugging Solr so I could use some help. I’d like to reduce the number of records to a minimum to create a small dataset to reproduce the problem.
Right now our only option is to stop using this fieldtype, but it does improve the relevancy of searches that don’t cause Solr to crash.

It would be a great help if the Solrconsole would not timeout on these queries, is there a way to turn off the timeout?
We are running Solr 7.2

Re: Query kills Solrcloud

Posted by Gus Heck <gu...@gmail.com>.

Are you able to re-index a subset into a new collection?

For control of timeouts I would suggest Postman or curl, or some other
non-browser client.

On Wed, Jan 2, 2019 at 2:55 PM Webster Homer <
webster.homer@milliporesigma.com> wrote:

> We are still having serious problems with our solrcloud failing due to
> this problem.
> The problem is clearly data related.
> How can I determine what documents are being searched? Is it possible to
> get Solr/lucene to output the docids being searched?
>
> I believe that this is a lucene bug, but I need to narrow the focus to a
> smaller number of records, and I'm not certain how to do that efficiently.
> Are there debug parameters that could help?
>
> -----Original Message-----
> From: Webster Homer <we...@milliporesigma.com>
> Sent: Thursday, December 20, 2018 3:45 PM
> To: solr-user@lucene.apache.org
> Subject: Query kills Solrcloud
>
> We are experiencing almost nightly solr crashes due to Japanese queries.
> I’ve been able to determine that one of our field types seems to be a
> culprit. When I run a much reduced version of the query against out DEV
> solrcloud I see the memory usage jump from less than a gb to 5gb using only
> a single field in the query. The collection is fairly small ~411,000
> documents of which only ~25,000 have searchable Japanese fields. I have
> been able to simplify the query to run against a single Japanese field in
> the schema. The JVM memory jumps from less than a gig to close to 5 gb, and
> back down. The QTime is 36959 which seems high for 2500 documents. Indeed
> the single field that I’m using in my test case has 2031 documents.
>
> I extended the query to 5 fields and watch the memory usage in the Solr
> Console application. The memory usage goes to almost 6gb with a QTime of
> 100909. The Solrconsole shows connection errors, and when I look at the
> Cloud graph all the replicas on the node where I submitted the query are
> down. In dev the replicas eventually recover. In production, with the full
> query which has a lot more fields in the qf parameter, the solr cloud dies.
> One example query term:
> ジエチルアミノヒドロキシベンゾイル安息香酸ヘキシル
>
> This is the field type that we have defined:
>    <fieldtype name="text_deep_cjk" class="solr.TextField"
> positionIncrementGap="10000" autoGeneratePhraseQueries="false">
>      <analyzer type="index">
>         <!-- remove spaces between CJK characters -->
>        <charFilter class="solr.PatternReplaceCharFilterFactory"
> pattern="([\p{IsHangul}\p{IsHan}\p{IsKatakana}\p{IsHiragana}]+)\s+(?=[\p{IsHangul}\p{IsHan}\p{IsKatakana}\p{IsHiragana}])"
> replacement="$1"/>
>          <tokenizer class="solr.ICUTokenizerFactory" />
>         <!-- normalize width before bigram, as e.g. half-width dakuten
> combine  -->
>         <filter class="solr.CJKWidthFilterFactory"/>
>         <!-- Transform Traditional Han to Simplified Han -->
>         <filter class="solr.ICUTransformFilterFactory"
> id="Traditional-Simplified"/>
>                 <!-- Transform Hiragana to Katakana just as was done for
> Endeca -->
>         <filter class="solr.ICUTransformFilterFactory"
> id="Hiragana-Katakana"/>
>         <filter class="solr.ICUFoldingFilterFactory"/>   <!-- NFKC, case
> folding, diacritics removed -->
>         <filter class="solr.CJKBigramFilterFactory" han="true"
> hiragana="true" katakana="true" hangul="true" outputUnigrams="true" />
>       </analyzer>
>
>      <analyzer type="query">
>          <!-- remove spaces between CJK characters -->
>        <charFilter class="solr.PatternReplaceCharFilterFactory"
> pattern="([\p{IsHangul}\p{IsHan}\p{IsKatakana}\p{IsHiragana}]+)\s+(?=[\p{IsHangul}\p{IsHan}\p{IsKatakana}\p{IsHiragana}])"
> replacement="$1"/>
>
>        <tokenizer class="solr.ICUTokenizerFactory" />
>        <filter class="solr.SynonymGraphFilterFactory"
> synonyms="synonyms.txt" ignoreCase="true" expand="true"
> tokenizerFactory="solr.ICUTokenizerFactory" />
>         <filter class="solr.CJKWidthFilterFactory"/>
>         <!-- Transform Traditional Han to Simplified Han -->
>         <filter class="solr.ICUTransformFilterFactory"
> id="Traditional-Simplified"/>
>                 <!-- Transform Hiragana to Katakana just as was done for
> Endeca -->
>         <filter class="solr.ICUTransformFilterFactory"
> id="Hiragana-Katakana"/>
>         <filter class="solr.ICUFoldingFilterFactory"/>   <!-- NFKC, case
> folding, diacritics removed -->
>         <filter class="solr.CJKBigramFilterFactory" han="true"
> hiragana="true" katakana="true" hangul="true" outputUnigrams="true" />
>       </analyzer>
>     </fieldtype>
>
> Why is searching even 1 field of this type so expensive?
> I suspect that this is data related, as other queries return in far less
> than a second. What are good strategies for determining what documents are
> causing the problem? I’m new to debugging Solr so I could use some help.
> I’d like to reduce the number of records to a minimum to create a small
> dataset to reproduce the problem.
> Right now our only option is to stop using this fieldtype, but it does
> improve the relevancy of searches that don’t cause Solr to crash.
>
> It would be a great help if the Solrconsole would not timeout on these
> queries, is there a way to turn off the timeout?
> We are running Solr 7.2
>


-- 
http://www.the111shift.com