You are viewing a plain text version of this content. The canonical link for it is here.

Posted to java-user@lucene.apache.org by Clemens Wyss <cl...@mysign.ch> on 2011/04/21 17:02:45 UTC

"Umlaute" getting lost

I keep my search terms in a dedicated RAMDirectory (the termIndex). 
In there I palce all the term of my real index. When putting the terms into the 
termIndex I can still see [using the debugger] the Umlaute (äöü). Unfortunately when searching the 
termIndex the documents no more contain these Umlaute.

Populating the termIndex:
termIndex = new RAMDirectory();
IndexWriterConfig config = new IndexWriterConfig( Version.LUCENE_31, new TermAnalyzer( locale ) );
termIndexWriter = new IndexWriter( termIndex, config );
TermEnum tEnum = realIndexReader.terms();
while ( tEnum.next() )
{
	Term t = tEnum.term();
	String termText = t.text();
	Document termDocument = new Document();
	Field field = new Field( FIELDNAME_TERM, termText, Field.Store.YES, Field.Index.ANALYZED );
	termDocument.add( field );
	// and add term into the index
	termIndexWriter.addDocument( termDocument );
}
termIndexWriter.commit();
termIndexWriter.optimize();
termIndexWriter.close();

termIndexReader = IndexReader.open( termIndex, true );
---------- searching terms
Query q = fuzzy ? new FuzzyQuery( new Term( FIELDNAME_TERM, termFilter.toLowerCase() ) ) :
					new WildcardQuery( new Term( FIELDNAME_TERM, "*" + termFilter.toLowerCase() + "*" ) );
TopDocs topDocs = new IndexSearcher( getTermIndexReader() ).search( q, 100 );				
for ( ScoreDoc hit : topDocs.scoreDocs )
{
	Document doc = getTermIndexReader().document( hit.doc );
	String indexTerm = doc.get( FIELDNAME_TERM );
	if ( !returnValue.contains( indexTerm  ) )
	{
		returnValue.add( indexTerm );
	}
}
----------
The TermAbnalyzer is the same analyzer as the main index analyzer with the exception that a LowerCaseFilter is applied.
I have unit tests for my Umlaute which work as expected. 
Unfortunately this is not the case when I debug my real app...
What could possibly cause the "loss"?

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

AW: "Umlaute" getting lost

Posted by Clemens Wyss <cl...@mysign.ch>.

TermAnalyzer# tokenStream ( final String fieldName, final Reader reader )
------------------------------------------------------------------------------------------
TokenStream t = new WhitespaceAnalyzer( Version.LUCENE_31 ).tokenStream( fieldName, cf);
t = new StopFilter( Version.LUCENE_31, t, stopWordSet, true );
t = new ShingleAnalyzerWrapper( t, 4 ).tokenStream( fieldName, reader );
t = new LowerCaseFilter( Version.LUCENE_31, t );
return t;

Thx
Clemens

> -----Ursprüngliche Nachricht-----
> Von: Simon Willnauer [mailto:simon.willnauer@googlemail.com]
> Gesendet: Montag, 25. April 2011 12:13
> An: java-user@lucene.apache.org
> Betreff: Re: "Umlaute" getting lost
> 
> On Sun, Apr 24, 2011 at 8:30 AM, Grant Ingersoll <gs...@apache.org>
> wrote:
> >
> > On Apr 21, 2011, at 5:02 PM, Clemens Wyss wrote:
> >
> >> I keep my search terms in a dedicated RAMDirectory (the termIndex).
> >> In there I palce all the term of my real index. When putting the
> >> terms into the termIndex I can still see [using the debugger] the
> >> Umlaute (äöü). Unfortunately when searching the termIndex the
> documents no more contain these Umlaute.
> >>
> >> Populating the termIndex:
> >> termIndex = new RAMDirectory();
> >> IndexWriterConfig config = new IndexWriterConfig( Version.LUCENE_31,
> >> new TermAnalyzer( locale ) ); termIndexWriter = new IndexWriter(
> >> termIndex, config ); TermEnum tEnum = realIndexReader.terms(); while
> >> ( tEnum.next() ) {
> >>       Term t = tEnum.term();
> >>       String termText = t.text();
> >>       Document termDocument = new Document();
> >>       Field field = new Field( FIELDNAME_TERM, termText,
> >> Field.Store.YES, Field.Index.ANALYZED );
> >>       termDocument.add( field );
> >>       // and add term into the index
> >>       termIndexWriter.addDocument( termDocument ); }
> >> termIndexWriter.commit(); termIndexWriter.optimize();
> >> termIndexWriter.close();
> >>
> >> termIndexReader = IndexReader.open( termIndex, true );
> >> ---------- searching terms
> >> Query q = fuzzy ? new FuzzyQuery( new Term( FIELDNAME_TERM,
> termFilter.toLowerCase() ) ) :
> >>                                       new WildcardQuery( new Term(
> >> FIELDNAME_TERM, "*" + termFilter.toLowerCase() + "*" ) ); TopDocs
> >> topDocs = new IndexSearcher( getTermIndexReader() ).search( q, 100 );
> >> for ( ScoreDoc hit : topDocs.scoreDocs ) {
> >>       Document doc = getTermIndexReader().document( hit.doc );
> >>       String indexTerm = doc.get( FIELDNAME_TERM );
> >>       if ( !returnValue.contains( indexTerm  ) )
> >>       {
> >>               returnValue.add( indexTerm );
> >>       }
> >> }
> >> ----------
> >> The TermAbnalyzer is the same analyzer as the main index analyzer with
> the exception that a LowerCaseFilter is applied.
> >
> > What is the Analyzer for the Main Index?  What is the tokenizer and token
> filters used?
> 
> in other words, can you provide what TermAnalyzer is composed of?
> 
> 
> simon
> >
> > Out of curiosity, what is the problem you are trying to solve?
> >
> > --------------------------
> > Grant Ingersoll
> > http://www.lucidimagination.com/
> >
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > For additional commands, e-mail: java-user-help@lucene.apache.org
> >
> >
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org

Re: "Umlaute" getting lost

Posted by Simon Willnauer <si...@googlemail.com>.

On Sun, Apr 24, 2011 at 8:30 AM, Grant Ingersoll <gs...@apache.org> wrote:
>
> On Apr 21, 2011, at 5:02 PM, Clemens Wyss wrote:
>
>> I keep my search terms in a dedicated RAMDirectory (the termIndex).
>> In there I palce all the term of my real index. When putting the terms into the
>> termIndex I can still see [using the debugger] the Umlaute (äöü). Unfortunately when searching the
>> termIndex the documents no more contain these Umlaute.
>>
>> Populating the termIndex:
>> termIndex = new RAMDirectory();
>> IndexWriterConfig config = new IndexWriterConfig( Version.LUCENE_31, new TermAnalyzer( locale ) );
>> termIndexWriter = new IndexWriter( termIndex, config );
>> TermEnum tEnum = realIndexReader.terms();
>> while ( tEnum.next() )
>> {
>>       Term t = tEnum.term();
>>       String termText = t.text();
>>       Document termDocument = new Document();
>>       Field field = new Field( FIELDNAME_TERM, termText, Field.Store.YES, Field.Index.ANALYZED );
>>       termDocument.add( field );
>>       // and add term into the index
>>       termIndexWriter.addDocument( termDocument );
>> }
>> termIndexWriter.commit();
>> termIndexWriter.optimize();
>> termIndexWriter.close();
>>
>> termIndexReader = IndexReader.open( termIndex, true );
>> ---------- searching terms
>> Query q = fuzzy ? new FuzzyQuery( new Term( FIELDNAME_TERM, termFilter.toLowerCase() ) ) :
>>                                       new WildcardQuery( new Term( FIELDNAME_TERM, "*" + termFilter.toLowerCase() + "*" ) );
>> TopDocs topDocs = new IndexSearcher( getTermIndexReader() ).search( q, 100 );
>> for ( ScoreDoc hit : topDocs.scoreDocs )
>> {
>>       Document doc = getTermIndexReader().document( hit.doc );
>>       String indexTerm = doc.get( FIELDNAME_TERM );
>>       if ( !returnValue.contains( indexTerm  ) )
>>       {
>>               returnValue.add( indexTerm );
>>       }
>> }
>> ----------
>> The TermAbnalyzer is the same analyzer as the main index analyzer with the exception that a LowerCaseFilter is applied.
>
> What is the Analyzer for the Main Index?  What is the tokenizer and token filters used?

in other words, can you provide what TermAnalyzer is composed of?


simon
>
> Out of curiosity, what is the problem you are trying to solve?
>
> --------------------------
> Grant Ingersoll
> http://www.lucidimagination.com/
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

AW: "Umlaute" getting lost

Posted by Clemens Wyss <cl...@mysign.ch>.

>Out of curiosity, what is the problem you are trying to solve?
I am trying to provide suggestions for search terms/word, such as google does. When the user starts typing the search term, I look up my TermIndex to provide possible search terms which fit the characters provided...

Thx
Clemens

> -----Ursprüngliche Nachricht-----
> Von: Grant Ingersoll [mailto:gsingers@apache.org]
> Gesendet: Sonntag, 24. April 2011 08:30
> An: java-user@lucene.apache.org
> Betreff: Re: "Umlaute" getting lost
> 
> 
> On Apr 21, 2011, at 5:02 PM, Clemens Wyss wrote:
> 
> > I keep my search terms in a dedicated RAMDirectory (the termIndex).
> > In there I palce all the term of my real index. When putting the terms
> > into the termIndex I can still see [using the debugger] the Umlaute
> > (äöü). Unfortunately when searching the termIndex the documents no
> more contain these Umlaute.
> >
> > Populating the termIndex:
> > termIndex = new RAMDirectory();
> > IndexWriterConfig config = new IndexWriterConfig( Version.LUCENE_31,
> > new TermAnalyzer( locale ) ); termIndexWriter = new IndexWriter(
> > termIndex, config ); TermEnum tEnum = realIndexReader.terms(); while (
> > tEnum.next() ) {
> > 	Term t = tEnum.term();
> > 	String termText = t.text();
> > 	Document termDocument = new Document();
> > 	Field field = new Field( FIELDNAME_TERM, termText, Field.Store.YES,
> Field.Index.ANALYZED );
> > 	termDocument.add( field );
> > 	// and add term into the index
> > 	termIndexWriter.addDocument( termDocument ); }
> > termIndexWriter.commit(); termIndexWriter.optimize();
> > termIndexWriter.close();
> >
> > termIndexReader = IndexReader.open( termIndex, true );
> > ---------- searching terms
> > Query q = fuzzy ? new FuzzyQuery( new Term( FIELDNAME_TERM,
> termFilter.toLowerCase() ) ) :
> > 					new WildcardQuery( new Term(
> FIELDNAME_TERM, "*" + termFilter.toLowerCase() + "*" ) );
> > TopDocs topDocs = new IndexSearcher( getTermIndexReader() ).search( q,
> 100 );
> > for ( ScoreDoc hit : topDocs.scoreDocs ) {
> > 	Document doc = getTermIndexReader().document( hit.doc );
> > 	String indexTerm = doc.get( FIELDNAME_TERM );
> > 	if ( !returnValue.contains( indexTerm  ) )
> > 	{
> > 		returnValue.add( indexTerm );
> > 	}
> > }
> > ----------
> > The TermAbnalyzer is the same analyzer as the main index analyzer with
> the exception that a LowerCaseFilter is applied.
> 
> What is the Analyzer for the Main Index?  What is the tokenizer and token
> filters used?
> 
> Out of curiosity, what is the problem you are trying to solve?
> 
> --------------------------
> Grant Ingersoll
> http://www.lucidimagination.com/
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: "Umlaute" getting lost

Posted by Grant Ingersoll <gs...@apache.org>.

On Apr 21, 2011, at 5:02 PM, Clemens Wyss wrote:

> I keep my search terms in a dedicated RAMDirectory (the termIndex). 
> In there I palce all the term of my real index. When putting the terms into the 
> termIndex I can still see [using the debugger] the Umlaute (äöü). Unfortunately when searching the 
> termIndex the documents no more contain these Umlaute.
> 
> Populating the termIndex:
> termIndex = new RAMDirectory();
> IndexWriterConfig config = new IndexWriterConfig( Version.LUCENE_31, new TermAnalyzer( locale ) );
> termIndexWriter = new IndexWriter( termIndex, config );
> TermEnum tEnum = realIndexReader.terms();
> while ( tEnum.next() )
> {
> 	Term t = tEnum.term();
> 	String termText = t.text();
> 	Document termDocument = new Document();
> 	Field field = new Field( FIELDNAME_TERM, termText, Field.Store.YES, Field.Index.ANALYZED );
> 	termDocument.add( field );
> 	// and add term into the index
> 	termIndexWriter.addDocument( termDocument );
> }
> termIndexWriter.commit();
> termIndexWriter.optimize();
> termIndexWriter.close();
> 
> termIndexReader = IndexReader.open( termIndex, true );
> ---------- searching terms
> Query q = fuzzy ? new FuzzyQuery( new Term( FIELDNAME_TERM, termFilter.toLowerCase() ) ) :
> 					new WildcardQuery( new Term( FIELDNAME_TERM, "*" + termFilter.toLowerCase() + "*" ) );
> TopDocs topDocs = new IndexSearcher( getTermIndexReader() ).search( q, 100 );				
> for ( ScoreDoc hit : topDocs.scoreDocs )
> {
> 	Document doc = getTermIndexReader().document( hit.doc );
> 	String indexTerm = doc.get( FIELDNAME_TERM );
> 	if ( !returnValue.contains( indexTerm  ) )
> 	{
> 		returnValue.add( indexTerm );
> 	}
> }
> ----------
> The TermAbnalyzer is the same analyzer as the main index analyzer with the exception that a LowerCaseFilter is applied.

What is the Analyzer for the Main Index?  What is the tokenizer and token filters used?

Out of curiosity, what is the problem you are trying to solve?

--------------------------
Grant Ingersoll
http://www.lucidimagination.com/


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org