You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Bojan Miletic <ex...@gmail.com> on 2011/12/17 15:31:38 UTC

Problem with synonyms containing whitespace

Hi everyone,

I'm having a bit of problem with synonyms.

My synonyms.txt looks like this:

> class\ 3\ (gvw\ 10001\ -\ 14000), light
> class 4 (gvw 14001 - 16000), class 5 (gvw 16001 - 19500), class 6 (gvw
> 19501 - 26000), medium
>

When testing in analyzer by using solr admin light gets correctly
recognised as one of the synonims, but when searching for  class 3 (gvw
10001 - 14000) analyzer can't find any synonyms.
As you can see I tried escaping whitespaces with \ but that didn't help.

Configuration of used field is

> !-- lowercases the entire field value, keeping it as a single token.  -->
>       <!-- used for working with synonyms -->
>     <fieldType name="lowercase_syn" class="solr.TextField"
> positionIncrementGap="100">
>       <analyzer type="index">>
>         <tokenizer class="solr.KeywordTokenizerFactory"/>
>         <filter class="solr.LowerCaseFilterFactory" />
>         <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt"
> ignoreCase="true" expand="true"/>
>       </analyzer>
>       <analyzer type="query">>
>         <tokenizer class="solr.KeywordTokenizerFactory"/>
>         <filter class="solr.LowerCaseFilterFactory" />
>       </analyzer>
>     </fieldType>
>

Could you please help me?
Thanks

RE: Problem with synonyms containing whitespace

Posted by Soumitra Banerjee <so...@gmail.com>.
Hi all -

 

I have a similar problem, as follows:

 

Some of the synonyms for acetone are as follows:

 

1090,b-Ketopropane,Dimethyl formaldehyde,2-Propanone,dimethylketone,Ketone,
dimethyl-,methyl ketone,propan-2-one,propanone,β-Ketopropane,67-64-1

 

The analyzer during indexing is splitting 

b-Ketopropane to  b and b-Ketopropane

 

and Dimethyl formaldehyde to 

 

Dimethyl and  formaldehyde

 

How should I format my synonyms to avoid the splitting?

 

My Schema is as follows:

 

<fieldType name="text_syn" class="solr.TextField"
positionIncrementGap="100">

      <analyzer type="index">

        <tokenizer class="solr.WhitespaceTokenizerFactory" />

        <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt"
ignoreCase="true" expand="true" />

        <filter class="solr.StopFilterFactory" ignoreCase="true"
words="stopwords.txt" enablePositionIncrements="true" />

        <filter class="solr.WordDelimiterFilterFactory"
generateWordParts="1" generateNumberParts="1" catenateWords="1"
catenateNumbers="1" catenateAll="0" splitOnCaseChange="1" />

        <filter class="solr.LowerCaseFilterFactory" />

        <filter class="solr.SnowballPorterFilterFactory" language="English"
protected="protwords.txt" />

      </analyzer>

      <analyzer type="query">

        <tokenizer class="solr.WhitespaceTokenizerFactory" />

        <!--<filter class="solr.SynonymFilterFactory"
synonyms="synonyms.txt" ignoreCase="true" expand="true" />-->

        <filter class="solr.StopFilterFactory" ignoreCase="true"
words="stopwords.txt" enablePositionIncrements="true" />

        <filter class="solr.WordDelimiterFilterFactory"
generateWordParts="1" generateNumberParts="1" catenateWords="0"
catenateNumbers="0" catenateAll="0" splitOnCaseChange="1" />

        <filter class="solr.LowerCaseFilterFactory" />

        <filter class="solr.SnowballPorterFilterFactory" language="English"
protected="protwords.txt" />

      </analyzer>

    </fieldType>

 

 

As always thanks for the help.

 

Regards, Soumitra

 

-----Original Message-----
From: Bojan Miletic [mailto:extreme2max@gmail.com] 
Sent: Saturday, December 17, 2011 6:32 AM
To: solr-user@lucene.apache.org
Subject: Problem with synonyms containing whitespace

 

Hi everyone,

 

I'm having a bit of problem with synonyms.

 

My synonyms.txt looks like this:

 

> class\ 3\ (gvw\ 10001\ -\ 14000), light class 4 (gvw 14001 - 16000), 

> class 5 (gvw 16001 - 19500), class 6 (gvw

> 19501 - 26000), medium

> 

 

When testing in analyzer by using solr admin light gets correctly recognised
as one of the synonims, but when searching for  class 3 (gvw

10001 - 14000) analyzer can't find any synonyms.

As you can see I tried escaping whitespaces with \ but that didn't help.

 

Configuration of used field is

 

> !-- lowercases the entire field value, keeping it as a single token.  -->

>       <!-- used for working with synonyms -->

>     <fieldType name="lowercase_syn" class="solr.TextField"

> positionIncrementGap="100">

>       <analyzer type="index">>

>         <tokenizer class="solr.KeywordTokenizerFactory"/>

>         <filter class="solr.LowerCaseFilterFactory" />

>         <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt"

> ignoreCase="true" expand="true"/>

>       </analyzer>

>       <analyzer type="query">>

>         <tokenizer class="solr.KeywordTokenizerFactory"/>

>         <filter class="solr.LowerCaseFilterFactory" />

>       </analyzer>

>     </fieldType>

> 

 

Could you please help me?

Thanks


RE: Problem with synonyms containing whitespace

Posted by Soumitra Banerjee <so...@gmail.com>.
Hi all - 

 

I forgot to mention what I want to achieve in my search:

 

When a search is performed for documents that contain product name say A, I
want all the documents that has A to be displayed first (exact match)
followed by all documents that contains A in their product name(documents
where A appears in the middle followed by  documents whose product name ends
with A), followed by all documents whole product name is a synonym of A.
E.g.

 

A <- Exact match

XXAXX <- contains 

XXXXA <- ends with

All Synonyms of A <- synonyms

 

Please let me know if this search makes sense or not and if Solr can produce
the results.

 

Regards, Soumitra

 

From: Soumitra Banerjee [mailto:soumitrabanerjee@gmail.com] 
Sent: Saturday, December 17, 2011 2:27 PM
To: 'solr-user@lucene.apache.org'
Subject: RE: Problem with synonyms containing whitespace

 

Hi all -

 

I have a similar problem, as follows:

 

Some of the synonyms for acetone are as follows:

 

1090,b-Ketopropane,Dimethyl formaldehyde,2-Propanone,dimethylketone,Ketone,
dimethyl-,methyl ketone,propan-2-one,propanone,β-Ketopropane,67-64-1

 

The analyzer during indexing is splitting 

b-Ketopropane to  b and b-Ketopropane

 

and Dimethyl formaldehyde to 

 

Dimethyl and  formaldehyde

 

How should I format my synonyms to avoid the splitting?

 

My Schema is as follows:

 

<fieldType name="text_syn" class="solr.TextField"
positionIncrementGap="100">

      <analyzer type="index">

        <tokenizer class="solr.WhitespaceTokenizerFactory" />

        <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt"
ignoreCase="true" expand="true" />

        <filter class="solr.StopFilterFactory" ignoreCase="true"
words="stopwords.txt" enablePositionIncrements="true" />

        <filter class="solr.WordDelimiterFilterFactory"
generateWordParts="1" generateNumberParts="1" catenateWords="1"
catenateNumbers="1" catenateAll="0" splitOnCaseChange="1" />

        <filter class="solr.LowerCaseFilterFactory" />

        <filter class="solr.SnowballPorterFilterFactory" language="English"
protected="protwords.txt" />

      </analyzer>

      <analyzer type="query">

        <tokenizer class="solr.WhitespaceTokenizerFactory" />

        <!--<filter class="solr.SynonymFilterFactory"
synonyms="synonyms.txt" ignoreCase="true" expand="true" />-->

        <filter class="solr.StopFilterFactory" ignoreCase="true"
words="stopwords.txt" enablePositionIncrements="true" />

        <filter class="solr.WordDelimiterFilterFactory"
generateWordParts="1" generateNumberParts="1" catenateWords="0"
catenateNumbers="0" catenateAll="0" splitOnCaseChange="1" />

        <filter class="solr.LowerCaseFilterFactory" />

        <filter class="solr.SnowballPorterFilterFactory" language="English"
protected="protwords.txt" />

      </analyzer>

    </fieldType>

 

 

As always thanks for the help.

 

Regards, Soumitra

 

-----Original Message-----
From: Bojan Miletic [mailto:extreme2max@gmail.com] 
Sent: Saturday, December 17, 2011 6:32 AM
To: solr-user@lucene.apache.org
Subject: Problem with synonyms containing whitespace

 

Hi everyone,

 

I'm having a bit of problem with synonyms.

 

My synonyms.txt looks like this:

 

> class\ 3\ (gvw\ 10001\ -\ 14000), light class 4 (gvw 14001 - 16000), 

> class 5 (gvw 16001 - 19500), class 6 (gvw

> 19501 - 26000), medium

> 

 

When testing in analyzer by using solr admin light gets correctly recognised
as one of the synonims, but when searching for  class 3 (gvw

10001 - 14000) analyzer can't find any synonyms.

As you can see I tried escaping whitespaces with \ but that didn't help.

 

Configuration of used field is

 

> !-- lowercases the entire field value, keeping it as a single token.  -->

>       <!-- used for working with synonyms -->

>     <fieldType name="lowercase_syn" class="solr.TextField"

> positionIncrementGap="100">

>       <analyzer type="index">>

>         <tokenizer class="solr.KeywordTokenizerFactory"/>

>         <filter class="solr.LowerCaseFilterFactory" />

>         <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt"

> ignoreCase="true" expand="true"/>

>       </analyzer>

>       <analyzer type="query">>

>         <tokenizer class="solr.KeywordTokenizerFactory"/>

>         <filter class="solr.LowerCaseFilterFactory" />

>       </analyzer>

>     </fieldType>

> 

 

Could you please help me?

Thanks


Re: Problem with synonyms containing whitespace

Posted by Koji Sekiguchi <ko...@r.email.ne.jp>.
SynonymFilterFactory can take tokenizerFactory attribute that is used when
reading synonyms file. If you don't specify it, WhitespaceTokenizerFactory will be used.

https://builds.apache.org/job/Solr-trunk/javadoc/org/apache/solr/analysis/SynonymFilterFactory.html

koji
-- 
Check out "Query Log Visualizer" for Apache Solr
http://www.rondhuit-demo.com/loganalyzer/loganalyzer.html
http://www.rondhuit.com/en/

(11/12/17 23:31), Bojan Miletic wrote:
> Hi everyone,
>
> I'm having a bit of problem with synonyms.
>
> My synonyms.txt looks like this:
>
>> class\ 3\ (gvw\ 10001\ -\ 14000), light
>> class 4 (gvw 14001 - 16000), class 5 (gvw 16001 - 19500), class 6 (gvw
>> 19501 - 26000), medium
>>
>
> When testing in analyzer by using solr admin light gets correctly
> recognised as one of the synonims, but when searching for  class 3 (gvw
> 10001 - 14000) analyzer can't find any synonyms.
> As you can see I tried escaping whitespaces with \ but that didn't help.
>
> Configuration of used field is
>
>> !-- lowercases the entire field value, keeping it as a single token.  -->
>>        <!-- used for working with synonyms -->
>>      <fieldType name="lowercase_syn" class="solr.TextField"
>> positionIncrementGap="100">
>>        <analyzer type="index">>
>>          <tokenizer class="solr.KeywordTokenizerFactory"/>
>>          <filter class="solr.LowerCaseFilterFactory" />
>>          <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt"
>> ignoreCase="true" expand="true"/>
>>        </analyzer>
>>        <analyzer type="query">>
>>          <tokenizer class="solr.KeywordTokenizerFactory"/>
>>          <filter class="solr.LowerCaseFilterFactory" />
>>        </analyzer>
>>      </fieldType>
>>
>
> Could you please help me?
> Thanks
>

Re: Problem with synonyms containing whitespace

Posted by "srujan.kommoju" <sr...@gmail.com>.
thanks for the solution its working fine for me.
I did the same configuration but missed the
tokenizerFactory="solr.KeywordTokenizerFactory" in the filter tag. that
great



--
Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html

Re: Problem with synonyms containing whitespace

Posted by Bojan Miletic <ex...@gmail.com>.
Thanks guys,

Problem was in default tokenizer that synonymFilterFactory was using. I
modified parameter
<filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt"
ignoreCase="true" expand="true"
         tokenizerFactory="solr.KeywordTokenizerFactory"/>

and now everything is working ^_^.

On Sat, Dec 17, 2011 at 11:48 PM, Ahmet Arslan <io...@yahoo.com> wrote:

> > When testing in analyzer by using solr admin light gets
> > correctly
> > recognised as one of the synonims, but when searching
> > for  class 3 (gvw
> > 10001 - 14000) analyzer can't find any synonyms.
>
> QueryParser splits query string on white spaces before it reaches analysis
> phase. You can try querying with quotes or using term or raw query parser.
> q="class 3 (gvw 10001 - 14000)"
> q={!term f=fieldName}class 3 (gvw 10001 - 14000)
>

Re: Problem with synonyms containing whitespace

Posted by Ahmet Arslan <io...@yahoo.com>.
> When testing in analyzer by using solr admin light gets
> correctly
> recognised as one of the synonims, but when searching
> for  class 3 (gvw
> 10001 - 14000) analyzer can't find any synonyms.

QueryParser splits query string on white spaces before it reaches analysis phase. You can try querying with quotes or using term or raw query parser.
q="class 3 (gvw 10001 - 14000)"
q={!term f=fieldName}class 3 (gvw 10001 - 14000)