You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Bojan Miletic <ex...@gmail.com> on 2011/12/17 15:31:38 UTC
Problem with synonyms containing whitespace
Hi everyone,
I'm having a bit of problem with synonyms.
My synonyms.txt looks like this:
> class\ 3\ (gvw\ 10001\ -\ 14000), light
> class 4 (gvw 14001 - 16000), class 5 (gvw 16001 - 19500), class 6 (gvw
> 19501 - 26000), medium
>
When testing in analyzer by using solr admin light gets correctly
recognised as one of the synonims, but when searching for class 3 (gvw
10001 - 14000) analyzer can't find any synonyms.
As you can see I tried escaping whitespaces with \ but that didn't help.
Configuration of used field is
> !-- lowercases the entire field value, keeping it as a single token. -->
> <!-- used for working with synonyms -->
> <fieldType name="lowercase_syn" class="solr.TextField"
> positionIncrementGap="100">
> <analyzer type="index">>
> <tokenizer class="solr.KeywordTokenizerFactory"/>
> <filter class="solr.LowerCaseFilterFactory" />
> <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt"
> ignoreCase="true" expand="true"/>
> </analyzer>
> <analyzer type="query">>
> <tokenizer class="solr.KeywordTokenizerFactory"/>
> <filter class="solr.LowerCaseFilterFactory" />
> </analyzer>
> </fieldType>
>
Could you please help me?
Thanks
RE: Problem with synonyms containing whitespace
Posted by Soumitra Banerjee <so...@gmail.com>.
Hi all -
I have a similar problem, as follows:
Some of the synonyms for acetone are as follows:
1090,b-Ketopropane,Dimethyl formaldehyde,2-Propanone,dimethylketone,Ketone,
dimethyl-,methyl ketone,propan-2-one,propanone,β-Ketopropane,67-64-1
The analyzer during indexing is splitting
b-Ketopropane to b and b-Ketopropane
and Dimethyl formaldehyde to
Dimethyl and formaldehyde
How should I format my synonyms to avoid the splitting?
My Schema is as follows:
<fieldType name="text_syn" class="solr.TextField"
positionIncrementGap="100">
<analyzer type="index">
<tokenizer class="solr.WhitespaceTokenizerFactory" />
<filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt"
ignoreCase="true" expand="true" />
<filter class="solr.StopFilterFactory" ignoreCase="true"
words="stopwords.txt" enablePositionIncrements="true" />
<filter class="solr.WordDelimiterFilterFactory"
generateWordParts="1" generateNumberParts="1" catenateWords="1"
catenateNumbers="1" catenateAll="0" splitOnCaseChange="1" />
<filter class="solr.LowerCaseFilterFactory" />
<filter class="solr.SnowballPorterFilterFactory" language="English"
protected="protwords.txt" />
</analyzer>
<analyzer type="query">
<tokenizer class="solr.WhitespaceTokenizerFactory" />
<!--<filter class="solr.SynonymFilterFactory"
synonyms="synonyms.txt" ignoreCase="true" expand="true" />-->
<filter class="solr.StopFilterFactory" ignoreCase="true"
words="stopwords.txt" enablePositionIncrements="true" />
<filter class="solr.WordDelimiterFilterFactory"
generateWordParts="1" generateNumberParts="1" catenateWords="0"
catenateNumbers="0" catenateAll="0" splitOnCaseChange="1" />
<filter class="solr.LowerCaseFilterFactory" />
<filter class="solr.SnowballPorterFilterFactory" language="English"
protected="protwords.txt" />
</analyzer>
</fieldType>
As always thanks for the help.
Regards, Soumitra
-----Original Message-----
From: Bojan Miletic [mailto:extreme2max@gmail.com]
Sent: Saturday, December 17, 2011 6:32 AM
To: solr-user@lucene.apache.org
Subject: Problem with synonyms containing whitespace
Hi everyone,
I'm having a bit of problem with synonyms.
My synonyms.txt looks like this:
> class\ 3\ (gvw\ 10001\ -\ 14000), light class 4 (gvw 14001 - 16000),
> class 5 (gvw 16001 - 19500), class 6 (gvw
> 19501 - 26000), medium
>
When testing in analyzer by using solr admin light gets correctly recognised
as one of the synonims, but when searching for class 3 (gvw
10001 - 14000) analyzer can't find any synonyms.
As you can see I tried escaping whitespaces with \ but that didn't help.
Configuration of used field is
> !-- lowercases the entire field value, keeping it as a single token. -->
> <!-- used for working with synonyms -->
> <fieldType name="lowercase_syn" class="solr.TextField"
> positionIncrementGap="100">
> <analyzer type="index">>
> <tokenizer class="solr.KeywordTokenizerFactory"/>
> <filter class="solr.LowerCaseFilterFactory" />
> <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt"
> ignoreCase="true" expand="true"/>
> </analyzer>
> <analyzer type="query">>
> <tokenizer class="solr.KeywordTokenizerFactory"/>
> <filter class="solr.LowerCaseFilterFactory" />
> </analyzer>
> </fieldType>
>
Could you please help me?
Thanks
RE: Problem with synonyms containing whitespace
Posted by Soumitra Banerjee <so...@gmail.com>.
Hi all -
I forgot to mention what I want to achieve in my search:
When a search is performed for documents that contain product name say A, I
want all the documents that has A to be displayed first (exact match)
followed by all documents that contains A in their product name(documents
where A appears in the middle followed by documents whose product name ends
with A), followed by all documents whole product name is a synonym of A.
E.g.
A <- Exact match
XXAXX <- contains
XXXXA <- ends with
All Synonyms of A <- synonyms
Please let me know if this search makes sense or not and if Solr can produce
the results.
Regards, Soumitra
From: Soumitra Banerjee [mailto:soumitrabanerjee@gmail.com]
Sent: Saturday, December 17, 2011 2:27 PM
To: 'solr-user@lucene.apache.org'
Subject: RE: Problem with synonyms containing whitespace
Hi all -
I have a similar problem, as follows:
Some of the synonyms for acetone are as follows:
1090,b-Ketopropane,Dimethyl formaldehyde,2-Propanone,dimethylketone,Ketone,
dimethyl-,methyl ketone,propan-2-one,propanone,β-Ketopropane,67-64-1
The analyzer during indexing is splitting
b-Ketopropane to b and b-Ketopropane
and Dimethyl formaldehyde to
Dimethyl and formaldehyde
How should I format my synonyms to avoid the splitting?
My Schema is as follows:
<fieldType name="text_syn" class="solr.TextField"
positionIncrementGap="100">
<analyzer type="index">
<tokenizer class="solr.WhitespaceTokenizerFactory" />
<filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt"
ignoreCase="true" expand="true" />
<filter class="solr.StopFilterFactory" ignoreCase="true"
words="stopwords.txt" enablePositionIncrements="true" />
<filter class="solr.WordDelimiterFilterFactory"
generateWordParts="1" generateNumberParts="1" catenateWords="1"
catenateNumbers="1" catenateAll="0" splitOnCaseChange="1" />
<filter class="solr.LowerCaseFilterFactory" />
<filter class="solr.SnowballPorterFilterFactory" language="English"
protected="protwords.txt" />
</analyzer>
<analyzer type="query">
<tokenizer class="solr.WhitespaceTokenizerFactory" />
<!--<filter class="solr.SynonymFilterFactory"
synonyms="synonyms.txt" ignoreCase="true" expand="true" />-->
<filter class="solr.StopFilterFactory" ignoreCase="true"
words="stopwords.txt" enablePositionIncrements="true" />
<filter class="solr.WordDelimiterFilterFactory"
generateWordParts="1" generateNumberParts="1" catenateWords="0"
catenateNumbers="0" catenateAll="0" splitOnCaseChange="1" />
<filter class="solr.LowerCaseFilterFactory" />
<filter class="solr.SnowballPorterFilterFactory" language="English"
protected="protwords.txt" />
</analyzer>
</fieldType>
As always thanks for the help.
Regards, Soumitra
-----Original Message-----
From: Bojan Miletic [mailto:extreme2max@gmail.com]
Sent: Saturday, December 17, 2011 6:32 AM
To: solr-user@lucene.apache.org
Subject: Problem with synonyms containing whitespace
Hi everyone,
I'm having a bit of problem with synonyms.
My synonyms.txt looks like this:
> class\ 3\ (gvw\ 10001\ -\ 14000), light class 4 (gvw 14001 - 16000),
> class 5 (gvw 16001 - 19500), class 6 (gvw
> 19501 - 26000), medium
>
When testing in analyzer by using solr admin light gets correctly recognised
as one of the synonims, but when searching for class 3 (gvw
10001 - 14000) analyzer can't find any synonyms.
As you can see I tried escaping whitespaces with \ but that didn't help.
Configuration of used field is
> !-- lowercases the entire field value, keeping it as a single token. -->
> <!-- used for working with synonyms -->
> <fieldType name="lowercase_syn" class="solr.TextField"
> positionIncrementGap="100">
> <analyzer type="index">>
> <tokenizer class="solr.KeywordTokenizerFactory"/>
> <filter class="solr.LowerCaseFilterFactory" />
> <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt"
> ignoreCase="true" expand="true"/>
> </analyzer>
> <analyzer type="query">>
> <tokenizer class="solr.KeywordTokenizerFactory"/>
> <filter class="solr.LowerCaseFilterFactory" />
> </analyzer>
> </fieldType>
>
Could you please help me?
Thanks
Re: Problem with synonyms containing whitespace
Posted by Koji Sekiguchi <ko...@r.email.ne.jp>.
SynonymFilterFactory can take tokenizerFactory attribute that is used when
reading synonyms file. If you don't specify it, WhitespaceTokenizerFactory will be used.
https://builds.apache.org/job/Solr-trunk/javadoc/org/apache/solr/analysis/SynonymFilterFactory.html
koji
--
Check out "Query Log Visualizer" for Apache Solr
http://www.rondhuit-demo.com/loganalyzer/loganalyzer.html
http://www.rondhuit.com/en/
(11/12/17 23:31), Bojan Miletic wrote:
> Hi everyone,
>
> I'm having a bit of problem with synonyms.
>
> My synonyms.txt looks like this:
>
>> class\ 3\ (gvw\ 10001\ -\ 14000), light
>> class 4 (gvw 14001 - 16000), class 5 (gvw 16001 - 19500), class 6 (gvw
>> 19501 - 26000), medium
>>
>
> When testing in analyzer by using solr admin light gets correctly
> recognised as one of the synonims, but when searching for class 3 (gvw
> 10001 - 14000) analyzer can't find any synonyms.
> As you can see I tried escaping whitespaces with \ but that didn't help.
>
> Configuration of used field is
>
>> !-- lowercases the entire field value, keeping it as a single token. -->
>> <!-- used for working with synonyms -->
>> <fieldType name="lowercase_syn" class="solr.TextField"
>> positionIncrementGap="100">
>> <analyzer type="index">>
>> <tokenizer class="solr.KeywordTokenizerFactory"/>
>> <filter class="solr.LowerCaseFilterFactory" />
>> <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt"
>> ignoreCase="true" expand="true"/>
>> </analyzer>
>> <analyzer type="query">>
>> <tokenizer class="solr.KeywordTokenizerFactory"/>
>> <filter class="solr.LowerCaseFilterFactory" />
>> </analyzer>
>> </fieldType>
>>
>
> Could you please help me?
> Thanks
>
Re: Problem with synonyms containing whitespace
Posted by "srujan.kommoju" <sr...@gmail.com>.
thanks for the solution its working fine for me.
I did the same configuration but missed the
tokenizerFactory="solr.KeywordTokenizerFactory" in the filter tag. that
great
--
Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html
Re: Problem with synonyms containing whitespace
Posted by Bojan Miletic <ex...@gmail.com>.
Thanks guys,
Problem was in default tokenizer that synonymFilterFactory was using. I
modified parameter
<filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt"
ignoreCase="true" expand="true"
tokenizerFactory="solr.KeywordTokenizerFactory"/>
and now everything is working ^_^.
On Sat, Dec 17, 2011 at 11:48 PM, Ahmet Arslan <io...@yahoo.com> wrote:
> > When testing in analyzer by using solr admin light gets
> > correctly
> > recognised as one of the synonims, but when searching
> > for class 3 (gvw
> > 10001 - 14000) analyzer can't find any synonyms.
>
> QueryParser splits query string on white spaces before it reaches analysis
> phase. You can try querying with quotes or using term or raw query parser.
> q="class 3 (gvw 10001 - 14000)"
> q={!term f=fieldName}class 3 (gvw 10001 - 14000)
>
Re: Problem with synonyms containing whitespace
Posted by Ahmet Arslan <io...@yahoo.com>.
> When testing in analyzer by using solr admin light gets
> correctly
> recognised as one of the synonims, but when searching
> for class 3 (gvw
> 10001 - 14000) analyzer can't find any synonyms.
QueryParser splits query string on white spaces before it reaches analysis phase. You can try querying with quotes or using term or raw query parser.
q="class 3 (gvw 10001 - 14000)"
q={!term f=fieldName}class 3 (gvw 10001 - 14000)