You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Yann PICHOT <yp...@gmail.com> on 2010/02/05 15:41:18 UTC

Use of solr.ASCIIFoldingFilterFactory

Hi,

I have define this type in my schema.xml file :

    <fieldType name="text" class="solr.TextField"
positionIncrementGap="100">
      <analyzer type="index">
        <tokenizer class="solr.WhitespaceTokenizerFactory"/>
        <filter class="solr.ASCIIFoldingFilterFactory" />
        <filter class="solr.LowerCaseFilterFactory" />
      </analyzer>
      <analyzer type="query">
        <tokenizer class="solr.WhitespaceTokenizerFactory"/>
        <filter class="solr.ASCIIFoldingFilterFactory" />
        <filter class="solr.LowerCaseFilterFactory"/>
      </analyzer>
    </fieldType>

Fields definition :

  <fields>
    <field name="id" type="string" indexed="true" stored="true"
required="true" />
    <field name="idProd" type="string" indexed="false" stored="false"
required="false" />
    <field name="description" type="text" indexed="true" stored="true"
required="false" />
    <field name="artiste" type="text" indexed="true" stored="true"
required="false" />
    <field name="collection" type="text" indexed="true" stored="true"
required="false" />
    <field name="titre" type="text" indexed="true" stored="true"
required="false" />
    <field name="all" type="text" indexed="true" stored="true"
required="false" />
  </fields>

  <copyField source="description" dest="all"/>
  <copyField source="collection" dest="all"/>
  <copyField source="artiste" dest="all"/>
  <copyField source="titre" dest="all"/>

I have import my documents with DataImportHandler (my orginals documents are
in RDBMS).

I test query this query string  on SOLR web application : all:chateau.
Results (content of the field "all")  :
  CHATEAU D'AMBOISE
  [CHATEAU EN FRANCE, BABELON]
  ope dvd rene chateau
  CHATEAU DE LA LOIRE
  DE CHATEAU EN CHATEAU ENTRE LA LOIRE ET LE CHER
  [LE CHATEAU AMBULANT, HAYAO MIYAZAKI]
  [Chambres d'hôtes au château, Moreau]
  [ARCHIMEDE, LA VIE DE CHATEAU, KRAHENBUHL]
  [NEUF, NAISSANCE D UN CHATEAU FORT, MACAULAY]
  [ARCHIMEDE, LA VIE DE CHATEAU, KRAHENBUHL]

Now i try this query string : all:château.
No result :(

I don't understand. I think the second query respond the same result of the
first query but it is not the case.

I use SOLR 1.4 (Solr Implementation Version: 1.4.0 833479 - grantingersoll -
2009-11-06 12:33:40).
Java 32 bits : Java(TM) SE Runtime Environment (build 1.6.0_17-b04)
OS : Windows Seven 64 bits

Regards,
-- 
Yann

Re: Use of solr.ASCIIFoldingFilterFactory

Posted by Yann PICHOT <yp...@gmail.com>.
Hello,

Thank's, your response solve my problem.

Thank's for all,

On Sun, Feb 7, 2010 at 4:00 PM, Sven Maurmann <sv...@kippdata.de>wrote:

> Hi,
>
> you might have run into an encoding problem. If you use Tomcat as
> the container for Solr you should probably consult the following
>
>
>  http://wiki.apache.org/solr/SolrTomcat#URI_Charset_Config
>
> Cheers,
>    Sven
>
>
>
> --On Freitag, 5. Februar 2010 15:41 +0100 Yann PICHOT <yp...@gmail.com>
> wrote:
>
>  Hi,
>>
>> I have define this type in my schema.xml file :
>>
>>    <fieldType name="text" class="solr.TextField"
>> positionIncrementGap="100">
>>      <analyzer type="index">
>>        <tokenizer class="solr.WhitespaceTokenizerFactory"/>
>>        <filter class="solr.ASCIIFoldingFilterFactory" />
>>        <filter class="solr.LowerCaseFilterFactory" />
>>      </analyzer>
>>      <analyzer type="query">
>>        <tokenizer class="solr.WhitespaceTokenizerFactory"/>
>>        <filter class="solr.ASCIIFoldingFilterFactory" />
>>        <filter class="solr.LowerCaseFilterFactory"/>
>>      </analyzer>
>>    </fieldType>
>>
>> Fields definition :
>>
>>  <fields>
>>    <field name="id" type="string" indexed="true" stored="true"
>> required="true" />
>>    <field name="idProd" type="string" indexed="false" stored="false"
>> required="false" />
>>    <field name="description" type="text" indexed="true" stored="true"
>> required="false" />
>>    <field name="artiste" type="text" indexed="true" stored="true"
>> required="false" />
>>    <field name="collection" type="text" indexed="true" stored="true"
>> required="false" />
>>    <field name="titre" type="text" indexed="true" stored="true"
>> required="false" />
>>    <field name="all" type="text" indexed="true" stored="true"
>> required="false" />
>>  </fields>
>>
>>  <copyField source="description" dest="all"/>
>>  <copyField source="collection" dest="all"/>
>>  <copyField source="artiste" dest="all"/>
>>  <copyField source="titre" dest="all"/>
>>
>> I have import my documents with DataImportHandler (my orginals documents
>> are in RDBMS).
>>
>> I test query this query string  on SOLR web application : all:chateau.
>> Results (content of the field "all")  :
>>  CHATEAU D'AMBOISE
>>  [CHATEAU EN FRANCE, BABELON]
>>  ope dvd rene chateau
>>  CHATEAU DE LA LOIRE
>>  DE CHATEAU EN CHATEAU ENTRE LA LOIRE ET LE CHER
>>  [LE CHATEAU AMBULANT, HAYAO MIYAZAKI]
>>  [Chambres d'hôtes au château, Moreau]
>>  [ARCHIMEDE, LA VIE DE CHATEAU, KRAHENBUHL]
>>  [NEUF, NAISSANCE D UN CHATEAU FORT, MACAULAY]
>>  [ARCHIMEDE, LA VIE DE CHATEAU, KRAHENBUHL]
>>
>> Now i try this query string : all:château.
>> No result :(
>>
>> I don't understand. I think the second query respond the same result of
>> the first query but it is not the case.
>>
>> I use SOLR 1.4 (Solr Implementation Version: 1.4.0 833479 -
>> grantingersoll - 2009-11-06 12:33:40).
>> Java 32 bits : Java(TM) SE Runtime Environment (build 1.6.0_17-b04)
>> OS : Windows Seven 64 bits
>>
>> Regards,
>> --
>> Yann
>>
>


-- 
Yann

Re: Use of solr.ASCIIFoldingFilterFactory

Posted by Sven Maurmann <sv...@kippdata.de>.
Hi,

you might have run into an encoding problem. If you use Tomcat as
the container for Solr you should probably consult the following

   http://wiki.apache.org/solr/SolrTomcat#URI_Charset_Config

Cheers,
     Sven


--On Freitag, 5. Februar 2010 15:41 +0100 Yann PICHOT <yp...@gmail.com> 
wrote:

> Hi,
>
> I have define this type in my schema.xml file :
>
>     <fieldType name="text" class="solr.TextField"
> positionIncrementGap="100">
>       <analyzer type="index">
>         <tokenizer class="solr.WhitespaceTokenizerFactory"/>
>         <filter class="solr.ASCIIFoldingFilterFactory" />
>         <filter class="solr.LowerCaseFilterFactory" />
>       </analyzer>
>       <analyzer type="query">
>         <tokenizer class="solr.WhitespaceTokenizerFactory"/>
>         <filter class="solr.ASCIIFoldingFilterFactory" />
>         <filter class="solr.LowerCaseFilterFactory"/>
>       </analyzer>
>     </fieldType>
>
> Fields definition :
>
>   <fields>
>     <field name="id" type="string" indexed="true" stored="true"
> required="true" />
>     <field name="idProd" type="string" indexed="false" stored="false"
> required="false" />
>     <field name="description" type="text" indexed="true" stored="true"
> required="false" />
>     <field name="artiste" type="text" indexed="true" stored="true"
> required="false" />
>     <field name="collection" type="text" indexed="true" stored="true"
> required="false" />
>     <field name="titre" type="text" indexed="true" stored="true"
> required="false" />
>     <field name="all" type="text" indexed="true" stored="true"
> required="false" />
>   </fields>
>
>   <copyField source="description" dest="all"/>
>   <copyField source="collection" dest="all"/>
>   <copyField source="artiste" dest="all"/>
>   <copyField source="titre" dest="all"/>
>
> I have import my documents with DataImportHandler (my orginals documents
> are in RDBMS).
>
> I test query this query string  on SOLR web application : all:chateau.
> Results (content of the field "all")  :
>   CHATEAU D'AMBOISE
>   [CHATEAU EN FRANCE, BABELON]
>   ope dvd rene chateau
>   CHATEAU DE LA LOIRE
>   DE CHATEAU EN CHATEAU ENTRE LA LOIRE ET LE CHER
>   [LE CHATEAU AMBULANT, HAYAO MIYAZAKI]
>   [Chambres d'hôtes au château, Moreau]
>   [ARCHIMEDE, LA VIE DE CHATEAU, KRAHENBUHL]
>   [NEUF, NAISSANCE D UN CHATEAU FORT, MACAULAY]
>   [ARCHIMEDE, LA VIE DE CHATEAU, KRAHENBUHL]
>
> Now i try this query string : all:château.
> No result :(
>
> I don't understand. I think the second query respond the same result of
> the first query but it is not the case.
>
> I use SOLR 1.4 (Solr Implementation Version: 1.4.0 833479 -
> grantingersoll - 2009-11-06 12:33:40).
> Java 32 bits : Java(TM) SE Runtime Environment (build 1.6.0_17-b04)
> OS : Windows Seven 64 bits
>
> Regards,
> --
> Yann

RE: Use of solr.ASCIIFoldingFilterFactory

Posted by Steven A Rowe <sa...@syr.edu>.
Hi Yann,

I'm pretty sure that this is a character encoding problem.

You wrote:
> I do other tests. I directly write url in my browser adresse bar :
> http://localhost:8080/solr/select/?q=all:château<http://localhost:8080/solr/select/?q=all:ch%C3%A2teau>
> and i have result  !!! and the  url is
> now : http://localhost:8080/solr/select/?q=all:ch%E2teau
> The character â is replace by %E2. I use Firefox 3.6.

Do you get hits with the first form, where â is replaced by %C3%A2?

SolrJ will encode â in UTF-8, which is hex encoded as %C3%A2, not %E2, which is the Latin-1 encoding (aka ISO-8859-1; windows-1252 is almost the same).

Are you using Tomcat?  You need to change the configuration in order for Tomcat to properly handle UTF-8-hex-encoded queries - see http://wiki.apache.org/solr/SolrTomcat#URI_Charset_Config

Steve


Re: Use of solr.ASCIIFoldingFilterFactory

Posted by Yann PICHOT <yp...@gmail.com>.
On Fri, Feb 5, 2010 at 4:53 PM, Ahmet Arslan <io...@yahoo.com> wrote:

> > Just for your information: since you are using
> > whitespacetokenizer château won't retrieve documents
> > containing (comma) château,
>
> Thats the problem. I just see that your matched (multivalued-field) all
> contains chateau thats why all:chateau is matching. And it has château,
> (with comma) so all:château is not matching. Probably  q=all:château,  will
> return that document. It is better to use StandardTokenizerFactory in your
> case.
>
>
>
>
I use StandardTokenizerFactory. No change. I don't htink the problem is with
comma. Comma is add by Solr. On XML result no comma appear :
<doc>
<arr name="all">
<str>Tout voir, tout savoir</str>
<str>Le château féodal</str>
<str>Harris</str>
<str>Bruce</str>
<str>Dennis</str>
</arr>
<arr name="artiste">
<str>Harris</str>
<str>Bruce</str>
<str>Dennis</str>
</arr>
<arr name="collection">
<str>Tout voir, tout savoir</str>
</arr>
<arr name="description">
<str>Le château féodal</str>
</arr>
<arr name="id">
<str>84907</str>
</arr>
</doc>

I do other tests. I directly write url in my browser adresse bar :
http://localhost:8080/solr/select/?q=all:château<http://localhost:8080/solr/select/?q=all:ch%C3%A2teau>
and i have result  !!! and the  url is now :
http://localhost:8080/solr/select/?q=all:ch%E2teau
The character â is replace by %E2. I use Firefox 3.6.

I try i with IE same result ...

Then i test with Solrj ... java code :

    SolrServer server = getServer();
    SolrQuery query = new SolrQuery();
    query.setQuery( "all:château" );
    QueryResponse rsp = server.query( query );
    SolrDocumentList docs = rsp.getResults();
    for (Iterator iterator = docs.iterator(); iterator.hasNext();)
    {
      SolrDocument solrDocument = (SolrDocument) iterator.next();
      System.out.println("  " + solrDocument.getFieldValue("all"));
    }

No result ...

-- 
Yann

Re: Use of solr.ASCIIFoldingFilterFactory

Posted by Ahmet Arslan <io...@yahoo.com>.
> Just for your information: since you are using
> whitespacetokenizer château won't retrieve documents
> containing (comma) château, 

Thats the problem. I just see that your matched (multivalued-field) all contains chateau thats why all:chateau is matching. And it has château, (with comma) so all:château is not matching. Probably  q=all:château,  will return that document. It is better to use StandardTokenizerFactory in your case.


      

Re: Use of solr.ASCIIFoldingFilterFactory

Posted by Ahmet Arslan <io...@yahoo.com>.
> château is reduce to chateau. I test it on
> /admin/anaysis.jsp, result :
> Index Analyzer  château  chateau  chateau
> Query Analyzer  château  chateau
>  chateau

Strange. If château is reduce to chateau, then both words should return same set of documents. May be you added that filter to you index analyzer after indexing. Can you delete your index and re-start tomcat and run full-import?

Also what is the output of q=all:château&debugQuery=on

Just for your information: since you are using whitespacetokenizer château won't retrieve documents containing (comma) château, 


      

Re: Use of solr.ASCIIFoldingFilterFactory

Posted by Yann PICHOT <yp...@gmail.com>.
On Fri, Feb 5, 2010 at 4:00 PM, Ahmet Arslan <io...@yahoo.com> wrote:

> > I test query this query string  on SOLR web
> > application : all:chateau.
> > Results (content of the field "all")  :
> >   CHATEAU D'AMBOISE
> >   [CHATEAU EN FRANCE, BABELON]
> >   ope dvd rene chateau
> >   CHATEAU DE LA LOIRE
> >   DE CHATEAU EN CHATEAU ENTRE LA LOIRE ET LE CHER
> >   [LE CHATEAU AMBULANT, HAYAO MIYAZAKI]
> >   [Chambres d'hôtes au château, Moreau]
> >   [ARCHIMEDE, LA VIE DE CHATEAU, KRAHENBUHL]
> >   [NEUF, NAISSANCE D UN CHATEAU FORT, MACAULAY]
> >   [ARCHIMEDE, LA VIE DE CHATEAU, KRAHENBUHL]
> >
> > Now i try this query string : all:château.
> > No result :(
> >
> > I don't understand. I think the second query respond the
> > same result of the
> > first query but it is not the case.
>
> Probably château isn't reduced to chateau. You can confirm this by
> /admin/anaysis.jsp.
>
> If thats the case you can use :
>
> <charFilter class="solr.MappingCharFilterFactory"
> mapping="mapping-ISOLatin1Accent.txt"/>
>
> If mapping-ISOLatin1Accent.txt does not contain â, you can easily add this
> entry to it. â => a
>
>
château is reduce to chateau. I test it on /admin/anaysis.jsp, result :
Index Analyzer  château  chateau  chateau Query Analyzer  château  chateau
 chateau


-- 
Yann

Re: Use of solr.ASCIIFoldingFilterFactory

Posted by Ahmet Arslan <io...@yahoo.com>.
> I test query this query string  on SOLR web
> application : all:chateau.
> Results (content of the field "all")  :
>   CHATEAU D'AMBOISE
>   [CHATEAU EN FRANCE, BABELON]
>   ope dvd rene chateau
>   CHATEAU DE LA LOIRE
>   DE CHATEAU EN CHATEAU ENTRE LA LOIRE ET LE CHER
>   [LE CHATEAU AMBULANT, HAYAO MIYAZAKI]
>   [Chambres d'hôtes au château, Moreau]
>   [ARCHIMEDE, LA VIE DE CHATEAU, KRAHENBUHL]
>   [NEUF, NAISSANCE D UN CHATEAU FORT, MACAULAY]
>   [ARCHIMEDE, LA VIE DE CHATEAU, KRAHENBUHL]
> 
> Now i try this query string : all:château.
> No result :(
> 
> I don't understand. I think the second query respond the
> same result of the
> first query but it is not the case.

Probably château isn't reduced to chateau. You can confirm this by /admin/anaysis.jsp.

If thats the case you can use :

<charFilter class="solr.MappingCharFilterFactory" mapping="mapping-ISOLatin1Accent.txt"/>

If mapping-ISOLatin1Accent.txt does not contain â, you can easily add this entry to it. â => a