You are viewing a plain text version of this content. The canonical link for it is here.

Posted to java-user@lucene.apache.org by "amigo@max3d.com" <am...@max3d.com> on 2005/01/20 18:58:28 UTC

English and French documents together / analysis, indexing, searching

Greetings everyone

I wonder is there a solution for analyzing both English and French 
documents using the same analyzer.
Reason being is that we have predominantly English documents but there 
are some French, yet it all has to go into the same index
and be searchable from the same location during any perticular search. 
Is there a way to analyze both types of documents with
a same analyzer (and which one)?

I've looked around and I see there's a SnowBall analyzer but you have to 
specify the language of analysis, and I do not know that
ahead of time during indexing nor do I know it most of the time during 
searching (users would like to search in both document types).

There's also the issue of letter accents in french words and searching 
for the same (how are they indexed at the first place even)?
Has anyone dealt with this before and how did you solve the problem?

thanks

-pedja



---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org

Re: English and French documents together / analysis, indexing, searching

Posted by Bernhard Messer <be...@intrafind.de>.

>> you could try to create a more complex query and expand it into both 
>> languages using different analyzers. Would this solve your problem ?
>>
> Would that mean I would have to actually conduct two searches (one in 
> English and one in French) then merge the results and display them to 
> the user?
> It sounds to me like a long way around, so then actually writing an 
> analyzer that has the language guesser might be a better solution on 
> the long run?

It's no problem to guess the language based on the document corpus. But 
how do you want to guess the language of a simple Term Query ? What if 
your users are searching for names like "George Bush" ? You can't guess 
the language of such a query and you have to expand it into both 
languages. I don't see an easier way for solving that problem.

>
>>
>> This is a behaviour is implemented in StandardTokenizer used by 
>> StandardAnalyzer. Look at the documentation of StandardTokenizer:
>>
>> Many applications have specific tokenizer needs.  If this tokenizer 
>> does not suit your application, please consider copying this source code
>> directory to your project and maintaining your own grammar-based 
>> tokenizer.
>
>
> Hmm I feel this is beyond my abilities at the moment, writing my own 
> tokenizer, without more in-depth knowledge of everything else.
> Perhaps I'll try taking the StandardTokenizer and expand it or change 
> it based on other tokenziers available in Lucene such as 
> WhiteSpaceTokenizer.

What's about using the WhitespaceAnalyzer directly ? Maybe this fits 
more into your requirement and you could use it for both languages.

Bernhard


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org

Re: English and French documents together / analysis, indexing, searching

Posted by Otis Gospodnetic <ot...@yahoo.com>.

That would be a partial solution.  Accents will not be a problem any
more, but if you use an Analyzer than stems tokens, they will not rally
be tokenized properly.  Searches will probably work, but if you look at
the index you will see that some terms were not analyzed properly.  But
it may be sufficient for your needs, so try just with accent removal.

Otis


--- "amigo@max3d.com" <am...@max3d.com> wrote:

> Morus Walter said the following on 1/21/2005 2:14 AM:
> 
> > No. You could do a ( ( french-query ) or ( english-query ) )
> construct 
> > using
> >
> >one query. So query construction would be a bit more complex but
> querying
> >itself wouldn't change.
> >
> >The first thing I'd do in your case would be to look at the
> differences
> >in the output of english and french snowball stemmer.
> >I don't speak any french, but probably you might even use both
> stemmers
> >on all texts.
> >
> >Morus
> >
> 
> I've done some thinking afterwards, and instead of messing with
> complex 
> queries, would it make sense to
> replace all "special" characters such as "�", "�" with "e" during 
> indexing (I suppose write a custom analyzer)
> and then during searching parse the query and replace all occurances
> of 
> special characters (if any) with their
> normal latin equivalents?
> 
> This should produce the required results, no? Since the index would
> not 
> contain any French characters and
> searching for French words would return them since they were indexed
> as 
> normal words.
> 
> -pedja
> 
> 
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org

Re: English and French documents together / analysis, indexing, searching

Posted by "amigo@max3d.com" <am...@max3d.com>.

Morus Walter said the following on 1/21/2005 2:14 AM:

> No. You could do a ( ( french-query ) or ( english-query ) ) construct 
> using
>
>one query. So query construction would be a bit more complex but querying
>itself wouldn't change.
>
>The first thing I'd do in your case would be to look at the differences
>in the output of english and french snowball stemmer.
>I don't speak any french, but probably you might even use both stemmers
>on all texts.
>
>Morus
>

I've done some thinking afterwards, and instead of messing with complex 
queries, would it make sense to
replace all "special" characters such as "é", "è" with "e" during 
indexing (I suppose write a custom analyzer)
and then during searching parse the query and replace all occurances of 
special characters (if any) with their
normal latin equivalents?

This should produce the required results, no? Since the index would not 
contain any French characters and
searching for French words would return them since they were indexed as 
normal words.

-pedja

Re: English and French documents together / analysis, indexing, searching

Posted by Morus Walter <mo...@tanto.de>.

amigo@max3d.com writes:
> 
> > you could try to create a more complex query and expand it into both 
> > languages using different analyzers. Would this solve your problem ?
> >
> Would that mean I would have to actually conduct two searches (one in 
> English and one in French) then merge the results and display them to 
> the user?
No. You could do a ( ( french-query ) or ( english-query ) ) construct using
one query. So query construction would be a bit more complex but querying
itself wouldn't change.

The first thing I'd do in your case would be to look at the differences
in the output of english and french snowball stemmer.
I don't speak any french, but probably you might even use both stemmers
on all texts.

Morus

---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org

Re: English and French documents together / analysis, indexing, searching

Posted by "amigo@max3d.com" <am...@max3d.com>.

> you could try to create a more complex query and expand it into both 
> languages using different analyzers. Would this solve your problem ?
>
Would that mean I would have to actually conduct two searches (one in 
English and one in French) then merge the results and display them to 
the user?
It sounds to me like a long way around, so then actually writing an 
analyzer that has the language guesser might be a better solution on the 
long run?

>
> This is a behaviour is implemented in StandardTokenizer used by 
> StandardAnalyzer. Look at the documentation of StandardTokenizer:
>
> Many applications have specific tokenizer needs.  If this tokenizer 
> does not suit your application, please consider copying this source code
> directory to your project and maintaining your own grammar-based 
> tokenizer.

Hmm I feel this is beyond my abilities at the moment, writing my own 
tokenizer, without more in-depth knowledge of everything else.
Perhaps I'll try taking the StandardTokenizer and expand it or change it 
based on other tokenziers available in Lucene such as WhiteSpaceTokenizer.


thanks

-pedja


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org

Re: English and French documents together / analysis, indexing, searching

Posted by Bernhard Messer <be...@intrafind.de>.

> Right now I am using StandardAnalyzer but the results are not what I'd 
> hope for. Also since my understanding is that we should use the same 
> analyzer for searching that was used for indexing,
> even if I can manage to guess the language during indexing and apply 
> to the SnowBall analyzer I wouldn't be able to use SnowBall for 
> searching because users want to search through both
> English and French and I suppose I would not get the same results if 
> used with StandardAnalyzer?

you could try to create a more complex query and expand it into both 
languages using different analyzers. Would this solve your problem ?

>
>
> Another problem with StandardAnalyzer is that it breaks up some words 
> that should not be broken (in our case document identifiers such as 
> ABC-1234 etc) but that's a secondary issue...

This is a behaviour is implemented in StandardTokenizer used by 
StandardAnalyzer. Look at the documentation of StandardTokenizer:

Many applications have specific tokenizer needs.  If this tokenizer does
not suit your application, please consider copying this source code
directory to your project and maintaining your own grammar-based tokenizer.

Bernhard

---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org

Re: English and French documents together / analysis, indexing, searching

Posted by "amigo@max3d.com" <am...@max3d.com>.

Right now I am using StandardAnalyzer but the results are not what I'd 
hope for. Also since my understanding is that we should use the same 
analyzer for searching that was used for indexing,
even if I can manage to guess the language during indexing and apply to 
the SnowBall analyzer I wouldn't be able to use SnowBall for searching 
because users want to search through both
English and French and I suppose I would not get the same results if 
used with StandardAnalyzer?

Another problem with StandardAnalyzer is that it breaks up some words 
that should not be broken (in our case document identifiers such as 
ABC-1234 etc) but that's a secondary issue...


thanks

-pedja




Bernhard Messer said the following on 1/20/2005 1:05 PM:

> i think the easiest way ist to use Lucene's StandardAnalyzer. If you 
> want to use the snowball stemmers, you have to add a language guesser 
> to get the language for the particular document before creating the 
> analyzer.
>
> regards
> Bernhard
>
> amigo@max3d.com schrieb:
>
>> Greetings everyone
>>
>> I wonder is there a solution for analyzing both English and French 
>> documents using the same analyzer.
>> Reason being is that we have predominantly English documents but 
>> there are some French, yet it all has to go into the same index
>> and be searchable from the same location during any perticular 
>> search. Is there a way to analyze both types of documents with
>> a same analyzer (and which one)?
>>
>> I've looked around and I see there's a SnowBall analyzer but you have 
>> to specify the language of analysis, and I do not know that
>> ahead of time during indexing nor do I know it most of the time 
>> during searching (users would like to search in both document types).
>>
>> There's also the issue of letter accents in french words and 
>> searching for the same (how are they indexed at the first place even)?
>> Has anyone dealt with this before and how did you solve the problem?
>>
>> thanks
>>
>> -pedja
>>
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
>> For additional commands, e-mail: lucene-user-help@jakarta.apache.org
>>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: lucene-user-help@jakarta.apache.org
>
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org

Re: English and French documents together / analysis, indexing, searching

Posted by Bernhard Messer <be...@intrafind.de>.

i think the easiest way ist to use Lucene's StandardAnalyzer. If you 
want to use the snowball stemmers, you have to add a language guesser to 
get the language for the particular document before creating the analyzer.

regards
Bernhard

amigo@max3d.com schrieb:

> Greetings everyone
>
> I wonder is there a solution for analyzing both English and French 
> documents using the same analyzer.
> Reason being is that we have predominantly English documents but there 
> are some French, yet it all has to go into the same index
> and be searchable from the same location during any perticular search. 
> Is there a way to analyze both types of documents with
> a same analyzer (and which one)?
>
> I've looked around and I see there's a SnowBall analyzer but you have 
> to specify the language of analysis, and I do not know that
> ahead of time during indexing nor do I know it most of the time during 
> searching (users would like to search in both document types).
>
> There's also the issue of letter accents in french words and searching 
> for the same (how are they indexed at the first place even)?
> Has anyone dealt with this before and how did you solve the problem?
>
> thanks
>
> -pedja
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: lucene-user-help@jakarta.apache.org
>


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org