You are viewing a plain text version of this content. The canonical link for it is here.

Posted to java-user@lucene.apache.org by Dora <ju...@gmail.com> on 2008/11/25 16:21:30 UTC

Re: Indexing accented characters, then searching by any form

Karl Wettin wrote:
> 
> Try this (dry coded) snippet instead:
> 
> StandardAnalyzer objAnalyzer = new StandardAnalyzer() {
>    public TokenStream tokenStream(String fieldName, Reader reader) {
>      return new ISOLatin1AccentFilter(super.tokenStream(fieldName,  
> reader));
>    }
> }
> 

I tried this, but it does not work as expected.

I am using an utility class with a static method that gives me an analyzer:

public static Analyzer getAnalyzer() 
	{  
		StandardAnalyzer objAnalyzer = new StandardAnalyzer() {
			   public TokenStream tokenStream(String fieldName, Reader reader) {
			     return new ISOLatin1AccentFilter(super.tokenStream(fieldName,
reader));
			   }
			};
			return objAnalyzer;
		}
	}

So when I need the analyzer (for indexing or searching) I perform an
UtilityClass.getAnalyzer() call.

It works for my query parser: The accent are correctly removed when
performing the search.
If my index contains "cafe" searching for "café" will find the documents
containing "cafe"

But when explore my index with Luke I can see that the indexer does not use
the ISOLatin1AccentFilter  (I tested with a breakpoint in the overriden
tokenStream method) and if the document contains "café", the index will
contain "café".

As a consequence, search on word having accent is not possible: the index
contains the accent, while it is removed by the search process.

So my index contains "café", but when I search for "café" the filter changes
it in "cafe" and it gives no hit...

Any clue on why my filter is not used at time of indexation ?

-- 
View this message in context: http://www.nabble.com/Indexing-accented-characters%2C-then-searching-by-any-form-tp15412778p20682548.html
Sent from the Lucene - Java Users mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

RE: Indexing accented characters, then searching by any form

Posted by Dora <ju...@gmail.com>.

It seems that the index and search process does not work in the same way:

The "tokenStream" method is called at time of search while for indexing the
"resusableTokenStream" is called.

Overriding resusableTokenStream (like I did for tokenStream) fixed the
problem.
-- 
View this message in context: http://www.nabble.com/Indexing-accented-characters%2C-then-searching-by-any-form-tp15412778p20898127.html
Sent from the Lucene - Java Users mailing list archive at Nabble.com.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

RE: Indexing accented characters, then searching by any form

Posted by Dora <ju...@gmail.com>.

Diego Cassinera wrote:
> 
> Are you sure you are creating the fields with Field.Index.ANALYZED ?
> 
> 

Yes, my fields are all ANALYZED. (One was ANALYZED_NO_NORMS but changing it
to ANALYZED did not solve the problem)

I checked with the debugger, and the analyzer I use tu update my indexer
does contain my ISOLatin1AccentFilter.

It looks like the indexWriter does not go through the tokenStream method.
Maybe this is because I perform an updateDocument() instead of a
addDocument() ?

Here is how I index a document: 
m_analyzer is an Analyzer returned by my getAnalyzer method
field and field value are a "key" to my document (a unique ID)

IndexWriter luceneIndexWriter = new IndexWriter(m_indexDir, m_analyzer,		
IndexWriter.MaxFieldLength.UNLIMITED);
luceneIndexWriter.updateDocument(new Term(field, fieldValue),
luceneDocument);

-- 
View this message in context: http://www.nabble.com/Indexing-accented-characters%2C-then-searching-by-any-form-tp15412778p20696670.html
Sent from the Lucene - Java Users mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

RE: Indexing accented characters, then searching by any form

Posted by Diego Cassinera <di...@mercadolibre.com>.

Are you sure you are creating the fields with Field.Index.ANALYZED ?

-----Mensaje original-----
De: Dora [mailto:julien.barret@gmail.com] 
Enviado el: martes, 25 de noviembre de 2008 12:22 p.m.
Para: java-user@lucene.apache.org
Asunto: Re: Indexing accented characters, then searching by any form

Karl Wettin wrote:
> 
> Try this (dry coded) snippet instead:
> 
> StandardAnalyzer objAnalyzer = new StandardAnalyzer() {
>    public TokenStream tokenStream(String fieldName, Reader reader) {
>      return new ISOLatin1AccentFilter(super.tokenStream(fieldName,  
> reader));
>    }
> }
> 

I tried this, but it does not work as expected.

I am using an utility class with a static method that gives me an analyzer:

public static Analyzer getAnalyzer() 
	{  
		StandardAnalyzer objAnalyzer = new StandardAnalyzer() {
			   public TokenStream tokenStream(String fieldName, Reader reader) {
			     return new ISOLatin1AccentFilter(super.tokenStream(fieldName,
reader));
			   }
			};
			return objAnalyzer;
		}
	}

So when I need the analyzer (for indexing or searching) I perform an
UtilityClass.getAnalyzer() call.

It works for my query parser: The accent are correctly removed when
performing the search.
If my index contains "cafe" searching for "café" will find the documents
containing "cafe"

But when explore my index with Luke I can see that the indexer does not use
the ISOLatin1AccentFilter  (I tested with a breakpoint in the overriden
tokenStream method) and if the document contains "café", the index will
contain "café".

As a consequence, search on word having accent is not possible: the index
contains the accent, while it is removed by the search process.

So my index contains "café", but when I search for "café" the filter changes
it in "cafe" and it gives no hit...

Any clue on why my filter is not used at time of indexation ?

-- 
View this message in context: http://www.nabble.com/Indexing-accented-characters%2C-then-searching-by-any-form-tp15412778p20682548.html
Sent from the Lucene - Java Users mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org