You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-user@lucene.apache.org by "amigo@max3d.com" <am...@max3d.com> on 2005/01/20 18:58:28 UTC
English and French documents together / analysis, indexing, searching
Greetings everyone
I wonder is there a solution for analyzing both English and French
documents using the same analyzer.
Reason being is that we have predominantly English documents but there
are some French, yet it all has to go into the same index
and be searchable from the same location during any perticular search.
Is there a way to analyze both types of documents with
a same analyzer (and which one)?
I've looked around and I see there's a SnowBall analyzer but you have to
specify the language of analysis, and I do not know that
ahead of time during indexing nor do I know it most of the time during
searching (users would like to search in both document types).
There's also the issue of letter accents in french words and searching
for the same (how are they indexed at the first place even)?
Has anyone dealt with this before and how did you solve the problem?
thanks
-pedja
---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org
Re: English and French documents together / analysis, indexing, searching
Posted by Bernhard Messer <be...@intrafind.de>.
>> you could try to create a more complex query and expand it into both
>> languages using different analyzers. Would this solve your problem ?
>>
> Would that mean I would have to actually conduct two searches (one in
> English and one in French) then merge the results and display them to
> the user?
> It sounds to me like a long way around, so then actually writing an
> analyzer that has the language guesser might be a better solution on
> the long run?
It's no problem to guess the language based on the document corpus. But
how do you want to guess the language of a simple Term Query ? What if
your users are searching for names like "George Bush" ? You can't guess
the language of such a query and you have to expand it into both
languages. I don't see an easier way for solving that problem.
>
>>
>> This is a behaviour is implemented in StandardTokenizer used by
>> StandardAnalyzer. Look at the documentation of StandardTokenizer:
>>
>> Many applications have specific tokenizer needs. If this tokenizer
>> does not suit your application, please consider copying this source code
>> directory to your project and maintaining your own grammar-based
>> tokenizer.
>
>
> Hmm I feel this is beyond my abilities at the moment, writing my own
> tokenizer, without more in-depth knowledge of everything else.
> Perhaps I'll try taking the StandardTokenizer and expand it or change
> it based on other tokenziers available in Lucene such as
> WhiteSpaceTokenizer.
What's about using the WhitespaceAnalyzer directly ? Maybe this fits
more into your requirement and you could use it for both languages.
Bernhard
---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org
Re: English and French documents together / analysis, indexing, searching
Posted by Otis Gospodnetic <ot...@yahoo.com>.
That would be a partial solution. Accents will not be a problem any
more, but if you use an Analyzer than stems tokens, they will not rally
be tokenized properly. Searches will probably work, but if you look at
the index you will see that some terms were not analyzed properly. But
it may be sufficient for your needs, so try just with accent removal.
Otis
--- "amigo@max3d.com" <am...@max3d.com> wrote:
> Morus Walter said the following on 1/21/2005 2:14 AM:
>
> > No. You could do a ( ( french-query ) or ( english-query ) )
> construct
> > using
> >
> >one query. So query construction would be a bit more complex but
> querying
> >itself wouldn't change.
> >
> >The first thing I'd do in your case would be to look at the
> differences
> >in the output of english and french snowball stemmer.
> >I don't speak any french, but probably you might even use both
> stemmers
> >on all texts.
> >
> >Morus
> >
>
> I've done some thinking afterwards, and instead of messing with
> complex
> queries, would it make sense to
> replace all "special" characters such as "�", "�" with "e" during
> indexing (I suppose write a custom analyzer)
> and then during searching parse the query and replace all occurances
> of
> special characters (if any) with their
> normal latin equivalents?
>
> This should produce the required results, no? Since the index would
> not
> contain any French characters and
> searching for French words would return them since they were indexed
> as
> normal words.
>
> -pedja
>
>
>
---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org
Re: English and French documents together / analysis, indexing, searching
Posted by "amigo@max3d.com" <am...@max3d.com>.
Morus Walter said the following on 1/21/2005 2:14 AM:
> No. You could do a ( ( french-query ) or ( english-query ) ) construct
> using
>
>one query. So query construction would be a bit more complex but querying
>itself wouldn't change.
>
>The first thing I'd do in your case would be to look at the differences
>in the output of english and french snowball stemmer.
>I don't speak any french, but probably you might even use both stemmers
>on all texts.
>
>Morus
>
I've done some thinking afterwards, and instead of messing with complex
queries, would it make sense to
replace all "special" characters such as "é", "è" with "e" during
indexing (I suppose write a custom analyzer)
and then during searching parse the query and replace all occurances of
special characters (if any) with their
normal latin equivalents?
This should produce the required results, no? Since the index would not
contain any French characters and
searching for French words would return them since they were indexed as
normal words.
-pedja
Re: English and French documents together / analysis, indexing, searching
Posted by Morus Walter <mo...@tanto.de>.
amigo@max3d.com writes:
>
> > you could try to create a more complex query and expand it into both
> > languages using different analyzers. Would this solve your problem ?
> >
> Would that mean I would have to actually conduct two searches (one in
> English and one in French) then merge the results and display them to
> the user?
No. You could do a ( ( french-query ) or ( english-query ) ) construct using
one query. So query construction would be a bit more complex but querying
itself wouldn't change.
The first thing I'd do in your case would be to look at the differences
in the output of english and french snowball stemmer.
I don't speak any french, but probably you might even use both stemmers
on all texts.
Morus
---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org
Re: English and French documents together / analysis, indexing, searching
Posted by "amigo@max3d.com" <am...@max3d.com>.
> you could try to create a more complex query and expand it into both
> languages using different analyzers. Would this solve your problem ?
>
Would that mean I would have to actually conduct two searches (one in
English and one in French) then merge the results and display them to
the user?
It sounds to me like a long way around, so then actually writing an
analyzer that has the language guesser might be a better solution on the
long run?
>
> This is a behaviour is implemented in StandardTokenizer used by
> StandardAnalyzer. Look at the documentation of StandardTokenizer:
>
> Many applications have specific tokenizer needs. If this tokenizer
> does not suit your application, please consider copying this source code
> directory to your project and maintaining your own grammar-based
> tokenizer.
Hmm I feel this is beyond my abilities at the moment, writing my own
tokenizer, without more in-depth knowledge of everything else.
Perhaps I'll try taking the StandardTokenizer and expand it or change it
based on other tokenziers available in Lucene such as WhiteSpaceTokenizer.
thanks
-pedja
---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org
Re: English and French documents together / analysis, indexing, searching
Posted by Bernhard Messer <be...@intrafind.de>.
> Right now I am using StandardAnalyzer but the results are not what I'd
> hope for. Also since my understanding is that we should use the same
> analyzer for searching that was used for indexing,
> even if I can manage to guess the language during indexing and apply
> to the SnowBall analyzer I wouldn't be able to use SnowBall for
> searching because users want to search through both
> English and French and I suppose I would not get the same results if
> used with StandardAnalyzer?
you could try to create a more complex query and expand it into both
languages using different analyzers. Would this solve your problem ?
>
>
> Another problem with StandardAnalyzer is that it breaks up some words
> that should not be broken (in our case document identifiers such as
> ABC-1234 etc) but that's a secondary issue...
This is a behaviour is implemented in StandardTokenizer used by
StandardAnalyzer. Look at the documentation of StandardTokenizer:
Many applications have specific tokenizer needs. If this tokenizer does
not suit your application, please consider copying this source code
directory to your project and maintaining your own grammar-based tokenizer.
Bernhard
---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org
Re: English and French documents together / analysis, indexing, searching
Posted by "amigo@max3d.com" <am...@max3d.com>.
Right now I am using StandardAnalyzer but the results are not what I'd
hope for. Also since my understanding is that we should use the same
analyzer for searching that was used for indexing,
even if I can manage to guess the language during indexing and apply to
the SnowBall analyzer I wouldn't be able to use SnowBall for searching
because users want to search through both
English and French and I suppose I would not get the same results if
used with StandardAnalyzer?
Another problem with StandardAnalyzer is that it breaks up some words
that should not be broken (in our case document identifiers such as
ABC-1234 etc) but that's a secondary issue...
thanks
-pedja
Bernhard Messer said the following on 1/20/2005 1:05 PM:
> i think the easiest way ist to use Lucene's StandardAnalyzer. If you
> want to use the snowball stemmers, you have to add a language guesser
> to get the language for the particular document before creating the
> analyzer.
>
> regards
> Bernhard
>
> amigo@max3d.com schrieb:
>
>> Greetings everyone
>>
>> I wonder is there a solution for analyzing both English and French
>> documents using the same analyzer.
>> Reason being is that we have predominantly English documents but
>> there are some French, yet it all has to go into the same index
>> and be searchable from the same location during any perticular
>> search. Is there a way to analyze both types of documents with
>> a same analyzer (and which one)?
>>
>> I've looked around and I see there's a SnowBall analyzer but you have
>> to specify the language of analysis, and I do not know that
>> ahead of time during indexing nor do I know it most of the time
>> during searching (users would like to search in both document types).
>>
>> There's also the issue of letter accents in french words and
>> searching for the same (how are they indexed at the first place even)?
>> Has anyone dealt with this before and how did you solve the problem?
>>
>> thanks
>>
>> -pedja
>>
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
>> For additional commands, e-mail: lucene-user-help@jakarta.apache.org
>>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: lucene-user-help@jakarta.apache.org
>
>
>
---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org
Re: English and French documents together / analysis, indexing, searching
Posted by Bernhard Messer <be...@intrafind.de>.
i think the easiest way ist to use Lucene's StandardAnalyzer. If you
want to use the snowball stemmers, you have to add a language guesser to
get the language for the particular document before creating the analyzer.
regards
Bernhard
amigo@max3d.com schrieb:
> Greetings everyone
>
> I wonder is there a solution for analyzing both English and French
> documents using the same analyzer.
> Reason being is that we have predominantly English documents but there
> are some French, yet it all has to go into the same index
> and be searchable from the same location during any perticular search.
> Is there a way to analyze both types of documents with
> a same analyzer (and which one)?
>
> I've looked around and I see there's a SnowBall analyzer but you have
> to specify the language of analysis, and I do not know that
> ahead of time during indexing nor do I know it most of the time during
> searching (users would like to search in both document types).
>
> There's also the issue of letter accents in french words and searching
> for the same (how are they indexed at the first place even)?
> Has anyone dealt with this before and how did you solve the problem?
>
> thanks
>
> -pedja
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: lucene-user-help@jakarta.apache.org
>
---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org