You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by "Nemani, Raj" <Ra...@turner.com> on 2011/04/05 18:06:18 UTC

question on solr.ASCIIFoldingFilterFactory

All,

I am using solr.ASCIIFoldingFilterFactory to perform accent insensitive search.  One of the words that got indexed as part my indexing process is "después".  Having used the ASCIIFoldingFilterFactory,I expected that If I searched for word "despues" I should have the document containing the word "después" show up in the results but that was not the case.  Then I used the Analysis.jsp to analyze "después" and noticed that the ASCIIFoldingFilterFactory folded "después" as "despue".  

 

If I repeat the above exercise for the word "Imágenes", then Analysis.jsp tell me that the ASCIIFoldingFilterFactory folded "Imágenes" as "imagen".  But I can search for "Imagenes" and get the correct results.

 

I am not familiar with Spanish but I found the above behavior confusing.  Can anybody please explain the behavior described above?

 

Thank a million in advance

Raj

 


Re: question on solr.ASCIIFoldingFilterFactory

Posted by Ben Davies <be...@gmail.com>.
I can't remember where I read it, but I think MappingCharFilterFactory is
prefered.
There is an example in the example schema.

<charFilter class="solr.MappingCharFilterFactory"
mapping="mapping-ISOLatin1Accent.txt"/>

>From this, I get:
org.apache.solr.analysis.MappingCharFilterFactory
{mapping=mapping-ISOLatin1Accent.txt}
|text|despues|



On Tue, Apr 5, 2011 at 5:06 PM, Nemani, Raj <Ra...@turner.com> wrote:

> All,
>
> I am using solr.ASCIIFoldingFilterFactory to perform accent insensitive
> search.  One of the words that got indexed as part my indexing process is
> "después".  Having used the ASCIIFoldingFilterFactory,I expected that If I
> searched for word "despues" I should have the document containing the word
> "después" show up in the results but that was not the case.  Then I used the
> Analysis.jsp to analyze "después" and noticed that the
> ASCIIFoldingFilterFactory folded "después" as "despue".
>
>
>
> If I repeat the above exercise for the word "Imágenes", then Analysis.jsp
> tell me that the ASCIIFoldingFilterFactory folded "Imágenes" as "imagen".
>  But I can search for "Imagenes" and get the correct results.
>
>
>
> I am not familiar with Spanish but I found the above behavior confusing.
>  Can anybody please explain the behavior described above?
>
>
>
> Thank a million in advance
>
> Raj
>
>
>
>

Re: question on solr.ASCIIFoldingFilterFactory

Posted by Markus Jelsma <ma...@openindex.io>.
It's not the ASCII folding filter but the stemmer that's removing some trailing 
characters. Something you can easily spot on the analysis page.

> Here is the field type definition for ‘text’ field which is what I am using
> for the indexed fields.  Can you guys notice any obvious filter that could
> be the issue?
> 
> ---------------------------------------------------------------------------
> 
> <fieldType name="text" class="solr.TextField" positionIncrementGap="100">
> 
>       <analyzer type="index">
> 
>         <tokenizer class="solr.WhitespaceTokenizerFactory"/>
> 
>         <!-- in this example, we will only use synonyms at query time
> 
>         <filter class="solr.SynonymFilterFactory"
> synonyms="index_synonyms.txt" ignoreCase="true" expand="false"/>
> 
>         -->
> 
>         <!-- Case insensitive stop word removal.
> 
>           add enablePositionIncrements=true in both the index and query
> 
>           analyzers to leave a 'gap' for more accurate phrase queries.
> 
>         -->
> 
>         <filter class="solr.StopFilterFactory"
> 
>                 ignoreCase="true"
> 
>                 words="stopwords.txt"
> 
>                 enablePositionIncrements="true"
> 
>                 />
> 
>         <filter class="solr.WordDelimiterFilterFactory"
> generateWordParts="1" generateNumberParts="1" catenateWords="1"
> catenateNumbers="1" catenateAll="0" splitOnCaseChange="1"/>
> 
>         <filter class="solr.LowerCaseFilterFactory"/>
> 
>         <filter class="solr.SnowballPorterFilterFactory" language="English"
> protected="protwords.txt"/>
> 
>                                 <filter
> class="solr.ASCIIFoldingFilterFactory"/>
> 
>       </analyzer>
> 
>       <analyzer type="query">
> 
>         <tokenizer class="solr.WhitespaceTokenizerFactory"/>
> 
>         <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt"
> ignoreCase="true" expand="true"/>
> 
>         <filter class="solr.StopFilterFactory"
> 
>                 ignoreCase="true"
> 
>                 words="stopwords.txt"
> 
>                 enablePositionIncrements="true"
> 
>                 />
> 
>         <filter class="solr.WordDelimiterFilterFactory"
> generateWordParts="1" generateNumberParts="1" catenateWords="0"
> catenateNumbers="0" catenateAll="0" splitOnCaseChange="1"/>
> 
>         <filter class="solr.LowerCaseFilterFactory"/>
> 
>         <filter class="solr.SnowballPorterFilterFactory" language="English"
> protected="protwords.txt"/>
> 
>       </analyzer>
> 
>     </fieldType>
> 
> 
> 
> From: Steven A Rowe [mailto:sarowe@syr.edu]
> Sent: Tuesday, April 05, 2011 12:28 PM
> To: solr-user@lucene.apache.org
> Subject: RE: question on solr.ASCIIFoldingFilterFactory
> 
> 
> 
> I added this test method locally to TestASCIIFoldingFilter.java in the
> Lucene/Solr 3.1.0 source
> 
> tree, and it passed, so the filter is not the problem (and the Solr factory
> certainly isn't
> 
> either - it's just a wrapper) - I second Ludovic's question - you must have
> other filters
> 
> configured:
> 
> 
> 
>   public void testPluralNotTrimmed() throws Exception {
> 
>     TokenStream stream = new WhitespaceTokenizer(TEST_VERSION_CURRENT, new
> StringReader
> 
>       ("después Imágenes"));
> 
>     ASCIIFoldingFilter filter = new ASCIIFoldingFilter(stream);
> 
>     CharTermAttribute termAtt =
> filter.getAttribute(CharTermAttribute.class);
> 
> 
> 
>     assertTermEquals("despues", filter, termAtt);
> 
>     assertTermEquals("Imagenes", filter, termAtt);
> 
>   }
> 
> 
> 
> Steve

RE: question on solr.ASCIIFoldingFilterFactory

Posted by cquezel <cq...@gmail.com>.
lboutros wrote:
> 
> I used Spanish stemming, put the ASCIIFoldingFilterFactory before the
> stemming filter and added it in the query part too.
> 
> Ludovic.
> 

My experiments with french stemmer does not yield good results with this
order. Applying the ASCIIFoldingFilterFactory before stemming confuses the
language specific stemmer. For example:

 "étranglée" => ASCIIFoldingFilterFactory => "etranglee" => FrencheStemmer
=> "etranglee"
 "étranglé" => ASCIIFoldingFilterFactory => "etrangle" => FrencheStemmer =>
"etrangl"


 "étranglée" => FrencheStemmer => "étrangl" => ASCIIFoldingFilterFactory =>
"etrangl"
 "étranglé" => FrencheStemmer => "étrangl" => ASCIIFoldingFilterFactory =>
"etrangl"



--
View this message in context: http://lucene.472066.n3.nabble.com/question-on-solr-ASCIIFoldingFilterFactory-tp2780463p3223314.html
Sent from the Solr - User mailing list archive at Nabble.com.

RE: question on solr.ASCIIFoldingFilterFactory

Posted by "Nemani, Raj" <Ra...@turner.com>.
Thank you so much.  I will give this a try.  Thanks again everybody for
your help
Raj


-----Original Message-----
From: lboutros [mailto:boutrosl@gmail.com] 
Sent: Tuesday, April 05, 2011 2:28 PM
To: solr-user@lucene.apache.org
Subject: RE: question on solr.ASCIIFoldingFilterFactory

this analyzer seems to work :


	
        
        
        
        
		
        
    
    
        
        
        
        
        
		
         
 
 

I used Spanish stemming, put the ASCIIFoldingFilterFactory before the
stemming filter and added it in the query part too.

Ludovic.


-----
Jouve
France.
--
View this message in context:
http://lucene.472066.n3.nabble.com/question-on-solr-ASCIIFoldingFilterFa
ctory-tp2780463p2780973.html
Sent from the Solr - User mailing list archive at Nabble.com.

RE: question on solr.ASCIIFoldingFilterFactory

Posted by lboutros <bo...@gmail.com>.
this analyzer seems to work :


	
        
        
        
        
		
        
    
    
        
        
        
        
        
		
         
 
 

I used Spanish stemming, put the ASCIIFoldingFilterFactory before the
stemming filter and added it in the query part too.

Ludovic.


-----
Jouve
France.
--
View this message in context: http://lucene.472066.n3.nabble.com/question-on-solr-ASCIIFoldingFilterFactory-tp2780463p2780973.html
Sent from the Solr - User mailing list archive at Nabble.com.

RE: question on solr.ASCIIFoldingFilterFactory

Posted by lboutros <bo...@gmail.com>.
Your analyzer contains these two filters :


 

before :

 

So two things :

The words you are testing are not english words (no ?), so the stemming will
have strange behavior.
If you really want to remove accents, try to put the
ASCIIFoldingFilterFactory before the two others.

Ludovic.

-----
Jouve
France.
--
View this message in context: http://lucene.472066.n3.nabble.com/question-on-solr-ASCIIFoldingFilterFactory-tp2780463p2780790.html
Sent from the Solr - User mailing list archive at Nabble.com.

RE: question on solr.ASCIIFoldingFilterFactory

Posted by "Nemani, Raj" <Ra...@turner.com>.
Here is the field type definition for ‘text’ field which is what I am using for the indexed fields.  Can you guys notice any obvious filter that could be the issue?

---------------------------------------------------------------------------

<fieldType name="text" class="solr.TextField" positionIncrementGap="100">

      <analyzer type="index">

        <tokenizer class="solr.WhitespaceTokenizerFactory"/>

        <!-- in this example, we will only use synonyms at query time

        <filter class="solr.SynonymFilterFactory" synonyms="index_synonyms.txt" ignoreCase="true" expand="false"/>

        -->

        <!-- Case insensitive stop word removal.

          add enablePositionIncrements=true in both the index and query

          analyzers to leave a 'gap' for more accurate phrase queries.

        -->

        <filter class="solr.StopFilterFactory"

                ignoreCase="true"

                words="stopwords.txt"

                enablePositionIncrements="true"

                />

        <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="1" catenateNumbers="1" catenateAll="0" splitOnCaseChange="1"/>

        <filter class="solr.LowerCaseFilterFactory"/>

        <filter class="solr.SnowballPorterFilterFactory" language="English" protected="protwords.txt"/>

                                <filter class="solr.ASCIIFoldingFilterFactory"/>

      </analyzer>

      <analyzer type="query">

        <tokenizer class="solr.WhitespaceTokenizerFactory"/>

        <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true"/>

        <filter class="solr.StopFilterFactory"

                ignoreCase="true"

                words="stopwords.txt"

                enablePositionIncrements="true"

                />

        <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="0" catenateNumbers="0" catenateAll="0" splitOnCaseChange="1"/>

        <filter class="solr.LowerCaseFilterFactory"/>

        <filter class="solr.SnowballPorterFilterFactory" language="English" protected="protwords.txt"/>

      </analyzer>

    </fieldType>

 

From: Steven A Rowe [mailto:sarowe@syr.edu] 
Sent: Tuesday, April 05, 2011 12:28 PM
To: solr-user@lucene.apache.org
Subject: RE: question on solr.ASCIIFoldingFilterFactory

 

I added this test method locally to TestASCIIFoldingFilter.java in the Lucene/Solr 3.1.0 source

tree, and it passed, so the filter is not the problem (and the Solr factory certainly isn't

either - it's just a wrapper) - I second Ludovic's question - you must have other filters

configured:

 

  public void testPluralNotTrimmed() throws Exception {

    TokenStream stream = new WhitespaceTokenizer(TEST_VERSION_CURRENT, new StringReader

      ("después Imágenes"));

    ASCIIFoldingFilter filter = new ASCIIFoldingFilter(stream);

    CharTermAttribute termAtt = filter.getAttribute(CharTermAttribute.class);

 

    assertTermEquals("despues", filter, termAtt);

    assertTermEquals("Imagenes", filter, termAtt);

  }  

 

Steve

 

 


RE: question on solr.ASCIIFoldingFilterFactory

Posted by Steven A Rowe <sa...@syr.edu>.
I added this test method locally to TestASCIIFoldingFilter.java in the Lucene/Solr 3.1.0 source tree, and it passed, so the filter is not the problem (and the Solr factory certainly isn't either - it's just a wrapper) - I second Ludovic's question - you must have other filters configured:

  public void testPluralNotTrimmed() throws Exception {
    TokenStream stream = new WhitespaceTokenizer(TEST_VERSION_CURRENT, new StringReader
      ("después Imágenes"));
    ASCIIFoldingFilter filter = new ASCIIFoldingFilter(stream);
    CharTermAttribute termAtt = filter.getAttribute(CharTermAttribute.class);

    assertTermEquals("despues", filter, termAtt);
    assertTermEquals("Imagenes", filter, termAtt);
  }  

Steve

> -----Original Message-----
> From: lboutros [mailto:boutrosl@gmail.com]
> Sent: Tuesday, April 05, 2011 12:18 PM
> To: solr-user@lucene.apache.org
> Subject: Re: question on solr.ASCIIFoldingFilterFactory
> 
> Is there any Stemming configured in for this field in your schema
> configuration file ?
> 
> Ludovic.
> 
> 2011/4/5 Nemani, Raj [via Lucene] <
> ml-node+2780463-48954297-383657@n3.nabble.com>
> 
> > All,
> >
> > I am using solr.ASCIIFoldingFilterFactory to perform accent insensitive
> > search.  One of the words that got indexed as part my indexing process
> is
> > "después".  Having used the ASCIIFoldingFilterFactory,I expected that If
> I
> > searched for word "despues" I should have the document containing the
> word
> > "después" show up in the results but that was not the case.  Then I used
> the
> > Analysis.jsp to analyze "después" and noticed that the
> > ASCIIFoldingFilterFactory folded "después" as "despue".
> >
> >
> >
> > If I repeat the above exercise for the word "Imágenes", then
> Analysis.jsp
> > tell me that the ASCIIFoldingFilterFactory folded "Imágenes" as
> "imagen".
> >  But I can search for "Imagenes" and get the correct results.
> >
> >
> >
> > I am not familiar with Spanish but I found the above behavior confusing.
> >  Can anybody please explain the behavior described above?
> >
> >
> >
> > Thank a million in advance
> >
> > Raj
> >
> >
> >
> >
> >
> > ------------------------------
> >  If you reply to this email, your message will be added to the
> discussion
> > below:
> >
> > http://lucene.472066.n3.nabble.com/question-on-solr-
> ASCIIFoldingFilterFactory-tp2780463p2780463.html
> >  To start a new topic under Solr - User, email
> > ml-node+472068-1765922688-383657@n3.nabble.com
> > To unsubscribe from Solr - User, click
> here<http://lucene.472066.n3.nabble.com/template/NamlServlet.jtp?macro=uns
> ubscribe_by_code&node=472068&code=Ym91dHJvc2xAZ21haWwuY29tfDQ3MjA2OHw0Mzk2
> MDUxNjE=>.
> >
> >
> 
> 
> -----
> Jouve
> France.
> --
> View this message in context: http://lucene.472066.n3.nabble.com/question-
> on-solr-ASCIIFoldingFilterFactory-tp2780463p2780509.html
> Sent from the Solr - User mailing list archive at Nabble.com.

Re: question on solr.ASCIIFoldingFilterFactory

Posted by lboutros <bo...@gmail.com>.
Is there any Stemming configured in for this field in your schema
configuration file ?

Ludovic.

2011/4/5 Nemani, Raj [via Lucene] <
ml-node+2780463-48954297-383657@n3.nabble.com>

> All,
>
> I am using solr.ASCIIFoldingFilterFactory to perform accent insensitive
> search.  One of the words that got indexed as part my indexing process is
> "después".  Having used the ASCIIFoldingFilterFactory,I expected that If I
> searched for word "despues" I should have the document containing the word
> "después" show up in the results but that was not the case.  Then I used the
> Analysis.jsp to analyze "después" and noticed that the
> ASCIIFoldingFilterFactory folded "después" as "despue".
>
>
>
> If I repeat the above exercise for the word "Imágenes", then Analysis.jsp
> tell me that the ASCIIFoldingFilterFactory folded "Imágenes" as "imagen".
>  But I can search for "Imagenes" and get the correct results.
>
>
>
> I am not familiar with Spanish but I found the above behavior confusing.
>  Can anybody please explain the behavior described above?
>
>
>
> Thank a million in advance
>
> Raj
>
>
>
>
>
> ------------------------------
>  If you reply to this email, your message will be added to the discussion
> below:
>
> http://lucene.472066.n3.nabble.com/question-on-solr-ASCIIFoldingFilterFactory-tp2780463p2780463.html
>  To start a new topic under Solr - User, email
> ml-node+472068-1765922688-383657@n3.nabble.com
> To unsubscribe from Solr - User, click here<http://lucene.472066.n3.nabble.com/template/NamlServlet.jtp?macro=unsubscribe_by_code&node=472068&code=Ym91dHJvc2xAZ21haWwuY29tfDQ3MjA2OHw0Mzk2MDUxNjE=>.
>
>


-----
Jouve
France.
--
View this message in context: http://lucene.472066.n3.nabble.com/question-on-solr-ASCIIFoldingFilterFactory-tp2780463p2780509.html
Sent from the Solr - User mailing list archive at Nabble.com.