You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Tomasz Wegrzanowski <to...@gmail.com> on 2011/11/22 05:40:40 UTC

Matching + and &

Hi,

I've been trying to match some phrases with + and & (like c++,
google+, r&d etc.),
but tokenized gets rid of them before I can do anything with synonym filters.

So I tried using CharFilters like this:

    <fieldType name="text" class="solr.TextField"
positionIncrementGap="100" autoGeneratePhraseQueries="true">
      <analyzer type="index">
        <charFilter class="solr.PatternReplaceCharFilterFactory"
pattern="\+" replacement=" plus "/>
        <charFilter class="solr.PatternReplaceCharFilterFactory"
pattern="&amp;" replacement=" and "/>
        <tokenizer class="solr.StandardTokenizerFactory"/>
        <filter class="solr.SynonymFilterFactory"
synonyms="synonyms_case_sensitive.txt" ignoreCase="false"
expand="true"/>
        <filter class="solr.SynonymFilterFactory"
synonyms="synonyms.txt" ignoreCase="true" expand="true"/>
        <filter class="solr.StopFilterFactory" ignoreCase="true"
words="stopwords.txt" enablePositionIncrements="true" />
        <filter class="solr.WordDelimiterFilterFactory"
generateWordParts="1" generateNumberParts="1" catenateWords="1"
catenateNumbers="1" catenateAll="0" splitOnCaseChange="1"/>
        <filter class="solr.LowerCaseFilterFactory"/>
        <filter class="solr.KeywordMarkerFilterFactory"
protected="protwords.txt"/>
        <filter class="solr.SnowballPorterFilterFactory" language="English"/>
        <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
      </analyzer>
      <analyzer type="query">
        <tokenizer class="solr.StandardTokenizerFactory"/>
        <filter class="solr.SynonymFilterFactory"
synonyms="query_synonyms.txt" ignoreCase="true" expand="false" />
        <filter class="solr.StopFilterFactory" ignoreCase="true"
words="stopwords.txt" enablePositionIncrements="true" />
        <filter class="solr.WordDelimiterFilterFactory"
generateWordParts="1" generateNumberParts="1" catenateWords="0"
catenateNumbers="0" catenateAll="0" splitOnCaseChange="1"/>
        <filter class="solr.LowerCaseFilterFactory"/>
        <filter class="solr.KeywordMarkerFilterFactory"
protected="protwords.txt"/>
        <filter class="solr.SnowballPorterFilterFactory" language="English"/>
        <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
      </analyzer>
    </fieldType>

This mostly works, but for a very small number of documents, mostly
those with large number of pluses in them,
highlighter just crashes (and it's highlighter since turning it off
and reissuing the query works just fine, if I replace
pluses with spaces and reindex, the same query reruns just fine) with
exception like this:

Nov 21, 2011 11:35:11 PM org.apache.solr.common.SolrException log
SEVERE: java.lang.StringIndexOutOfBoundsException: String index out of range: -1
	at java.lang.String.substring(String.java:1938)
	at org.apache.lucene.search.highlight.Highlighter.getBestTextFragments(Highlighter.java:237)
	at org.apache.solr.highlight.DefaultSolrHighlighter.doHighlightingByHighlighter(DefaultSolrHighlighter.java:462)
	at org.apache.solr.highlight.DefaultSolrHighlighter.doHighlighting(DefaultSolrHighlighter.java:378)
	at org.apache.solr.handler.component.HighlightComponent.process(HighlightComponent.java:116)
	at org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHandler.java:194)
	at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:129)
	at org.apache.solr.core.SolrCore.execute(SolrCore.java:1360)
	at org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:343)
	at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:244)
	at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:235)
	at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:206)
	at org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:233)
	at org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:191)
	at org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:128)
	at org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:102)
	at org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:109)
	at org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:286)
	at org.apache.coyote.http11.Http11Processor.process(Http11Processor.java:845)
	at org.apache.coyote.http11.Http11Protocol$Http11ConnectionHandler.process(Http11Protocol.java:583)
	at org.apache.tomcat.util.net.JIoEndpoint$Worker.run(JIoEndpoint.java:447)
	at java.lang.Thread.run(Thread.java:619)

Is this a known issue?

Are CharFilters even the right way to approach it?

Or should I perhaps change or subclass StandardTokenizerFactory to
treat + and & as words?
I haven't looked at StandardTokenizerFactory code yet, so I don't know
how feasible would that be.

Thanks,
Tomasz

Re: Matching + and &

Posted by Tomasz Wegrzanowski <to...@gmail.com>.
On 24 November 2011 15:18, Tomasz Wegrzanowski
<to...@gmail.com> wrote:
> On 22 November 2011 14:28, Jan Høydahl <ja...@cominvent.com> wrote:
>> Why do you need spaces in the replacement?
>>
>> Try pattern="\+" replacement="plus" - it will cause the transformed charstream to contain as many tokens as the original and avoid the highlighting crash.
>
> I tried that, it still crashes.
>
> Replacing it with single character, including single non-ASCII
> character, doesn't cause a crash.
>
> I'm sort of tempted to just use reuse some CJK character, and synonym filter
> it to mean "plus".

In case anybody else runs into this problem, I found a solution.

The only thing that works and doesn't seem to crash solr is CJK expansions:

  <!-- they're not random, that's just what these characters mean -->
  <charFilter class="solr.PatternReplaceCharFilterFactory"
pattern="\+" replacement="加"/>  <charFilter
class="solr.PatternReplaceCharFilterFactory" pattern="&amp;"
replacement="和"/>
Followed by un-CJK-ing in synonym filter:

# General rules
加 => plus
和 => and
# And any special synonyms you want:
r and d, r 和 d => r and d, research and development
s and p, s 和 p => s and p, standand and poor's
at and t, at  和 t => at and t, american telephone and telegraph

User never sees these CJK characters, they only exist for a brief time
within solr pipeline to make tokenizer happy.

I also tried private use Unicode characters, but they're ignored by tokenizer.

Re: Matching + and &

Posted by Tomasz Wegrzanowski <to...@gmail.com>.
On 22 November 2011 14:28, Jan Høydahl <ja...@cominvent.com> wrote:
> Why do you need spaces in the replacement?
>
> Try pattern="\+" replacement="plus" - it will cause the transformed charstream to contain as many tokens as the original and avoid the highlighting crash.

I tried that, it still crashes.

Replacing it with single character, including single non-ASCII
character, doesn't cause a crash.

I'm sort of tempted to just use reuse some CJK character, and synonym filter
it to mean "plus".

Re: Matching + and &

Posted by Jan Høydahl <ja...@cominvent.com>.
Why do you need spaces in the replacement?

Try pattern="\+" replacement="plus" - it will cause the transformed charstream to contain as many tokens as the original and avoid the highlighting crash.

--
Jan Høydahl, search solution architect
Cominvent AS - www.cominvent.com
Solr Training - www.solrtraining.com

On 22. nov. 2011, at 05:40, Tomasz Wegrzanowski wrote:

> Hi,
> 
> I've been trying to match some phrases with + and & (like c++,
> google+, r&d etc.),
> but tokenized gets rid of them before I can do anything with synonym filters.
> 
> So I tried using CharFilters like this:
> 
>    <fieldType name="text" class="solr.TextField"
> positionIncrementGap="100" autoGeneratePhraseQueries="true">
>      <analyzer type="index">
>        <charFilter class="solr.PatternReplaceCharFilterFactory"
> pattern="\+" replacement=" plus "/>
>        <charFilter class="solr.PatternReplaceCharFilterFactory"
> pattern="&amp;" replacement=" and "/>
>        <tokenizer class="solr.StandardTokenizerFactory"/>
>        <filter class="solr.SynonymFilterFactory"
> synonyms="synonyms_case_sensitive.txt" ignoreCase="false"
> expand="true"/>
>        <filter class="solr.SynonymFilterFactory"
> synonyms="synonyms.txt" ignoreCase="true" expand="true"/>
>        <filter class="solr.StopFilterFactory" ignoreCase="true"
> words="stopwords.txt" enablePositionIncrements="true" />
>        <filter class="solr.WordDelimiterFilterFactory"
> generateWordParts="1" generateNumberParts="1" catenateWords="1"
> catenateNumbers="1" catenateAll="0" splitOnCaseChange="1"/>
>        <filter class="solr.LowerCaseFilterFactory"/>
>        <filter class="solr.KeywordMarkerFilterFactory"
> protected="protwords.txt"/>
>        <filter class="solr.SnowballPorterFilterFactory" language="English"/>
>        <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
>      </analyzer>
>      <analyzer type="query">
>        <tokenizer class="solr.StandardTokenizerFactory"/>
>        <filter class="solr.SynonymFilterFactory"
> synonyms="query_synonyms.txt" ignoreCase="true" expand="false" />
>        <filter class="solr.StopFilterFactory" ignoreCase="true"
> words="stopwords.txt" enablePositionIncrements="true" />
>        <filter class="solr.WordDelimiterFilterFactory"
> generateWordParts="1" generateNumberParts="1" catenateWords="0"
> catenateNumbers="0" catenateAll="0" splitOnCaseChange="1"/>
>        <filter class="solr.LowerCaseFilterFactory"/>
>        <filter class="solr.KeywordMarkerFilterFactory"
> protected="protwords.txt"/>
>        <filter class="solr.SnowballPorterFilterFactory" language="English"/>
>        <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
>      </analyzer>
>    </fieldType>
> 
> This mostly works, but for a very small number of documents, mostly
> those with large number of pluses in them,
> highlighter just crashes (and it's highlighter since turning it off
> and reissuing the query works just fine, if I replace
> pluses with spaces and reindex, the same query reruns just fine) with
> exception like this:
> 
> Nov 21, 2011 11:35:11 PM org.apache.solr.common.SolrException log
> SEVERE: java.lang.StringIndexOutOfBoundsException: String index out of range: -1
> 	at java.lang.String.substring(String.java:1938)
> 	at org.apache.lucene.search.highlight.Highlighter.getBestTextFragments(Highlighter.java:237)
> 	at org.apache.solr.highlight.DefaultSolrHighlighter.doHighlightingByHighlighter(DefaultSolrHighlighter.java:462)
> 	at org.apache.solr.highlight.DefaultSolrHighlighter.doHighlighting(DefaultSolrHighlighter.java:378)
> 	at org.apache.solr.handler.component.HighlightComponent.process(HighlightComponent.java:116)
> 	at org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHandler.java:194)
> 	at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:129)
> 	at org.apache.solr.core.SolrCore.execute(SolrCore.java:1360)
> 	at org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:343)
> 	at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:244)
> 	at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:235)
> 	at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:206)
> 	at org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:233)
> 	at org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:191)
> 	at org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:128)
> 	at org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:102)
> 	at org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:109)
> 	at org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:286)
> 	at org.apache.coyote.http11.Http11Processor.process(Http11Processor.java:845)
> 	at org.apache.coyote.http11.Http11Protocol$Http11ConnectionHandler.process(Http11Protocol.java:583)
> 	at org.apache.tomcat.util.net.JIoEndpoint$Worker.run(JIoEndpoint.java:447)
> 	at java.lang.Thread.run(Thread.java:619)
> 
> Is this a known issue?
> 
> Are CharFilters even the right way to approach it?
> 
> Or should I perhaps change or subclass StandardTokenizerFactory to
> treat + and & as words?
> I haven't looked at StandardTokenizerFactory code yet, so I don't know
> how feasible would that be.
> 
> Thanks,
> Tomasz