You are viewing a plain text version of this content. The canonical link for it is here.

Posted to java-user@lucene.apache.org by Bill Janssen <ja...@parc.com> on 2010/09/25 03:58:07 UTC

finding the analyzer for a language...

I thought that since I'm updating UpLib's Lucene code, I should tackle
the issue of document languages, as well.  Right now I'm using an
off-the-shelf language identifier, textcat, to figure out which language
a Web page or PDF is (mainly) written in.  I then want to analyze that
document with an appropriate analyzer.  I'd then like to map to the
correct Lucene analyzer for that language, falling back to
StandardAnalyzer if the installed Lucene library doesn't have an
analyzer for that language.

It would be *very* handy if Analyzer had a static method

  static Analyzer getAnalyzerForLanguage(String rfc_4646_lang_tag);

Right now I'm consulting a hand-compiled mapping of
langtag-to-Lucene-classname to figure out which Analyzer to use.
Wearisome, and it will be out-of-date for future releases of Lucenen
which will presumably support more languages.

Secondly, if I've got an instance of a SnowballAnalyzer, there's no way
to look "inside" it, and see what language it's for.  That's a problem
on the search side.  My QueryParser is a subclass of
MultiFieldQueryParser, and it looks for a "special" FieldQuery on the
field "_query_language", i.e., "_query_language:de" to tell the query
parser to use a German analyzer on this query.  What I'd like to be able
to do is interrogate the current analyzer attached to the query parser
instance, and throw an exception if it's not for the specified language.
I can do this for non-Snowball analyzers, because of the brittle
hand-compiled mapping mentioned above.  But if it's a SnowballAnalyzer,
there's no way to tell what the language inside it is.  So it would be
nice if SnowballAnalyzer grew a method

  String getLanguageName();

Even better would be

  String getLanguageTag();

And, it would be nice if QueryParser grew a method

  void setAnalyzer(Analyzer a);

which would allow me to simply replace the current analyzer for the
parsing of the rest of the query, instead of going through the rigmarole
of throwing an exception, catching it, recreating the QueryParser with a
different analyzer, and trying again.  What would break if you changed
the analyzer in midstream?  Wouldn't it simply be used for analyzing
remaining terms in the query?

I see that Robert Muir has been doing a lot of good work on the Snowball
code.  I'd really like to see the stopword work finished, so that a
SnowballAnalyzer for a particular language has a decent set of
stopwords.

Bill


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: finding the analyzer for a language...

Posted by Itamar Syn-Hershko <it...@code972.com>.

Shai, I was referring to your #2, which you already indicated in your 
reply wasn't part of the discussion.

Itamar.

On 26/9/2010 10:10 AM, Shai Erera wrote:
> The mapping is simply about returning the right Analyzer for the given
> Locale. You decide up front (as the Factory developer) what Analyzer /
> Tokenizer + TokenFilters combination you want to return for each language,
> and then when that language is input, you return it. That's it.
>
> Can you define mixed content? There are two possibilities:
>
> 1) Indexing documents of different languages. In that case, you need to know
> what's the document language, and then you use IndexWriter.addDocument(doc,
> analyzer) method, instead of relying on the default analyzer you pass to
> IndexWriterConfig.
>
> 2) Indexing documents that include text in multiple languages -- this is a
> complicated case and you need auto-language identification at the Tokenizer
> level. This is not the case where a Factory would be useful.
>
> Shai
>
> On Sun, Sep 26, 2010 at 12:19 AM, Itamar Syn-Hershko<it...@code972.com>wrote:
>
>    
>> I may be missing the point here, but how do you define an analyzer<->
>> language match? What do you do in cases of mixed content, for example?
>>
>> Itamar.
>>
>>
>> On 25/9/2010 10:27 PM, Shai Erera wrote:
>>
>>      
>>> Shai Erera brought a similar idea up before, to use Locale, but my
>>>        
>>>> concerns
>>>> are it would be limited by javas Locale mechanism... but we can figure
>>>> this
>>>> out.
>>>>
>>>>
>>>>
>>>>          
>>>   It really depends how sophisticated you want such an AnalyzerFactory
>>> (that's how I call it in my code) to be. We can
>>> define it to be a factory for predefined languages (Locale-based) for the
>>> most common use cases. If you want to
>>> have tighter control over the Analyzer you create, you can still
>>> instantiate
>>> your own, or create a new one with a custom
>>> TokenFilters chain.
>>>
>>> As long as things are well documented, I don't see a reason why we cannot
>>> start simple and only if we find out
>>> that most users don't use 'simple' and prefer to be allowed to specify
>>> more
>>> parameters (such as 'word' or 'ngram') we
>>> bring complication into the game.
>>>
>>> I'm offering Locale 'cause in most web applications that I know of, the
>>> Locale is defined on the request and is often
>>> used to parse the user's query, translating strings etc.
>>>
>>> Anyway, it'd be great to have any such Factory, be it Locale based or not,
>>> because we have so many Analyzers
>>> already, and the way things stand today, any user, even the simplest one,
>>> who wishes to support multi-lingual search
>>> has to sift through all of them and decide what combination to use for
>>> each
>>> language. And if the user ends up picking
>>> default values, then a Factory would simplify matters for him.
>>>
>>> Shai
>>>
>>> On Sat, Sep 25, 2010 at 9:29 PM, Bill Janssen<ja...@parc.com>   wrote:
>>>
>>>
>>>
>>>        
>>>> Robert Muir<rc...@gmail.com>   wrote:
>>>>
>>>>
>>>>
>>>>          
>>>>> On Fri, Sep 24, 2010 at 9:58 PM, Bill Janssen<ja...@parc.com>   wrote:
>>>>>
>>>>>
>>>>>
>>>>>            
>>>>>> I thought that since I'm updating UpLib's Lucene code, I should tackle
>>>>>> the issue of document languages, as well.  Right now I'm using an
>>>>>> off-the-shelf language identifier, textcat, to figure out which
>>>>>>
>>>>>>
>>>>>>              
>>>>> language
>>>>>            
>>>>
>>>>          
>>>>> a Web page or PDF is (mainly) written in.  I then want to analyze that
>>>>>            
>>>>>> document with an appropriate analyzer.  I'd then like to map to the
>>>>>> correct Lucene analyzer for that language, falling back to
>>>>>> StandardAnalyzer if the installed Lucene library doesn't have an
>>>>>> analyzer for that language.
>>>>>>
>>>>>> It would be *very* handy if Analyzer had a static method
>>>>>>
>>>>>>   static Analyzer getAnalyzerForLanguage(String rfc_4646_lang_tag);
>>>>>>
>>>>>>
>>>>>>
>>>>>>              
>>>>> I agree (not sure if it should be in Analyzer itself, maybe we could
>>>>> make
>>>>>
>>>>>
>>>>>            
>>>> an
>>>>
>>>>
>>>>          
>>>>> Analyzer for this)...
>>>>>
>>>>>
>>>>>            
>>>> Not sure I followed that...  I wanted to be able to retrieve an instance
>>>> of an instantiated Analyzer class, the class that's "designed" to work
>>>> with that language, if one exists, otherwise null.  And to have you guys
>>>> keep that list up-to-date, instead of having to do it myself :-).
>>>> Seemed to me that's the standard kind of thing you make a static method
>>>> on the top-level class.
>>>>
>>>>
>>>>
>>>>          
>>>>> i mean it sounds like what you want, is for it to work in a similar way
>>>>>
>>>>>
>>>>>            
>>>> to
>>>>
>>>>
>>>>          
>>>>> ResourceBundle's fallback mechanism?
>>>>>
>>>>>
>>>>>            
>>>> I'm not sure that's appropriate.  I just want to retrieve an Analyzer
>>>> for that language, if such a thing exists.  If by "fallback", you mean
>>>> that "en-US" should just return EnglishAnalyzer if there's no analyzer
>>>> specifically for US usage -- yes, that's fine.  On the other hand, I
>>>> don't think there should be a fallback for languages which have no
>>>> macrolanguage Analyzer -- it should just return null or throw an
>>>> exception.  The programmer can then explicitly decide how do deal with
>>>> that response.
>>>>
>>>>
>>>>
>>>>          
>>>>> And I agree with your idea of rfc3066/4646, e.g. you might want to
>>>>>
>>>>>
>>>>>            
>>>> specify
>>>>
>>>>
>>>>          
>>>>> subtags like "word" (SmartChineseAnalyzer) or "ngram" (CJKAnalyzer) for
>>>>> chinese somehow?
>>>>>
>>>>>
>>>>>            
>>>> Yes, good idea.  Might be interesting to see if those kind of subtags
>>>> can be registered with IANA, too.
>>>>
>>>> Although, if one is smart enough about Lucene and one's application to
>>>> make these kinds of judgement calls, I think one is probably smart
>>>> enough to know which class to use without consulting a generic
>>>> mechanism.
>>>>
>>>>
>>>>
>>>>          
>>>>> Shai Erera brought a similar idea up before, to use Locale, but my
>>>>>
>>>>>
>>>>>            
>>>> concerns
>>>>
>>>>
>>>>          
>>>>> are it would be limited by javas Locale mechanism... but we can figure
>>>>>
>>>>>
>>>>>            
>>>> this
>>>>
>>>>
>>>>          
>>>>> out.
>>>>>
>>>>> Maybe you want to create a JIRA issue to pursue this idea further? See
>>>>> http://wiki.apache.org/lucene-java/HowToContribute
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>            
>>>>>> Right now I'm consulting a hand-compiled mapping of
>>>>>> langtag-to-Lucene-classname to figure out which Analyzer to use.
>>>>>> Wearisome, and it will be out-of-date for future releases of Lucenen
>>>>>> which will presumably support more languages.
>>>>>>
>>>>>>
>>>>>>
>>>>>>              
>>>>> yes, but it also brings up interesting backwards compatibility
>>>>>
>>>>>
>>>>>            
>>>> challenges.
>>>>
>>>>
>>>>          
>>>>> Because if we add more analyzers, say EsperantoAnalyzer, if you upgrade
>>>>> lucene then suddenly your Esperanto queries are analyzed differently
>>>>> (whereas they were dealt with by StandardAnalyzer before).
>>>>>
>>>>>
>>>>>            
>>>> Yes, presumably the Version would need to be used with this, too.
>>>>
>>>>
>>>>
>>>>          
>>>>> But this becomes less of a problem as we work on modularizing lucene, so
>>>>>
>>>>>
>>>>>            
>>>> we
>>>>
>>>>
>>>>          
>>>>> can remove Version from analyzers,
>>>>>
>>>>>
>>>>>            
>>>> Oh goody, another API change to cope with in my code.
>>>>
>>>>
>>>>
>>>>          
>>>>> and so you can just use an old analyzers
>>>>> jar file (such as 4.1) but upgrade your lucene core jar to say version
>>>>>
>>>>>
>>>>>            
>>>> 4.3.
>>>>
>>>>
>>>>          
>>>>>
>>>>>
>>>>>            
>>>>>> Secondly, if I've got an instance of a SnowballAnalyzer, there's no way
>>>>>> to look "inside" it, and see what language it's for.  That's a problem
>>>>>> on the search side.  My QueryParser is a subclass of
>>>>>> MultiFieldQueryParser, and it looks for a "special" FieldQuery on the
>>>>>> field "_query_language", i.e., "_query_language:de" to tell the query
>>>>>> parser to use a German analyzer on this query.  What I'd like to be
>>>>>>
>>>>>>
>>>>>>              
>>>>> able
>>>>>            
>>>>
>>>>          
>>>>> to do is interrogate the current analyzer attached to the query parser
>>>>>            
>>>>>> instance, and throw an exception if it's not for the specified
>>>>>>
>>>>>>
>>>>>>              
>>>>> language.
>>>>>            
>>>>
>>>>          
>>>>> I can do this for non-Snowball analyzers, because of the brittle
>>>>>            
>>>>>> hand-compiled mapping mentioned above.  But if it's a SnowballAnalyzer,
>>>>>> there's no way to tell what the language inside it is.  So it would be
>>>>>> nice if SnowballAnalyzer grew a method
>>>>>>
>>>>>>
>>>>>>
>>>>>>              
>>>>> SnowballAnalyzer had more problems. its actually deprecated in
>>>>> trunk/branch_3x and instead there is an Analyzer for each language
>>>>>
>>>>>
>>>>>            
>>>> (English,
>>>>
>>>>
>>>>          
>>>>> Italian, etc), which now has stopwords lists, and sometimes special
>>>>>
>>>>>
>>>>>            
>>>> behavior
>>>>
>>>>
>>>>          
>>>>> (e.g. Turkish lowercases differently).
>>>>>
>>>>> Put more simply, its an implementation detail for ItalianAnalyzer that
>>>>> we
>>>>> implement the stemming with SnowballFilter. One day we might change it
>>>>> to
>>>>> use a less aggressive stemming algorithm (e.g. ItalianLightStemFilter)
>>>>> by
>>>>> default.
>>>>>
>>>>>
>>>>>            
>>>> Ah, good.  That will suit my purposes nicely.
>>>>
>>>>
>>>>
>>>>          
>>>>> I'd really like to see the stopword work finished, so that a
>>>>>
>>>>>
>>>>>            
>>>>>> SnowballAnalyzer for a particular language has a decent set of
>>>>>> stopwords.
>>>>>>
>>>>>>
>>>>>>
>>>>>>              
>>>>> See above, I think this is finished? The remaining work is actually Solr
>>>>> integration.
>>>>>
>>>>>
>>>>>            
>>>> Excellent.  I looked at the JIRA, but some discussions just seem to
>>>> peter out, and I'm having a hard time telling what the resolution is.
>>>>
>>>>
>>>>
>>>>          
>>>>> In trunk and branch_3x, all the analyzers have their own package, here's
>>>>> Italian:
>>>>>
>>>>> Source package: contains Analyzer that uses SnowballFilter(Italian) and
>>>>> loads Italian snowball stopwords by default. It also includes an
>>>>> alternative, less aggressive stemmer.
>>>>>
>>>>>
>>>>>
>>>>>            
>>>> http://svn.apache.org/viewvc/lucene/dev/trunk/modules/analysis/common/src/java/org/apache/lucene/analysis/it/
>>>>
>>>>
>>>>          
>>>>> The snowball stopwords were all added to the resources directory. This
>>>>> is
>>>>> where ItalianAnalyzer loads its set of stopwords from:
>>>>>
>>>>>
>>>>>
>>>>>            
>>>> http://svn.apache.org/viewvc/lucene/dev/trunk/modules/analysis/common/src/resources/org/apache/lucene/analysis/
>>>>
>>>>
>>>>          
>>>>> <
>>>>>
>>>>>
>>>>>            
>>>> http://svn.apache.org/viewvc/lucene/dev/trunk/modules/analysis/common/src/resources/org/apache/lucene/analysis/
>>>>
>>>>
>>>>          
>>>>>
>>>>>            
>>>> I see there's also an explicit EnglishAnalyzer -- never thought it made
>>>> sense to call that StandardAnalyzer.  Great work!
>>>>
>>>> Bill
>>>>
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>>>
>>>>
>>>>
>>>>
>>>>          
>>>
>>>        
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>>
>>      
>    

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: finding the analyzer for a language...

Posted by Shai Erera <se...@gmail.com>.

The mapping is simply about returning the right Analyzer for the given
Locale. You decide up front (as the Factory developer) what Analyzer /
Tokenizer + TokenFilters combination you want to return for each language,
and then when that language is input, you return it. That's it.

Can you define mixed content? There are two possibilities:

1) Indexing documents of different languages. In that case, you need to know
what's the document language, and then you use IndexWriter.addDocument(doc,
analyzer) method, instead of relying on the default analyzer you pass to
IndexWriterConfig.

2) Indexing documents that include text in multiple languages -- this is a
complicated case and you need auto-language identification at the Tokenizer
level. This is not the case where a Factory would be useful.

Shai

On Sun, Sep 26, 2010 at 12:19 AM, Itamar Syn-Hershko <it...@code972.com>wrote:

> I may be missing the point here, but how do you define an analyzer <->
> language match? What do you do in cases of mixed content, for example?
>
> Itamar.
>
>
> On 25/9/2010 10:27 PM, Shai Erera wrote:
>
>> Shai Erera brought a similar idea up before, to use Locale, but my
>>> concerns
>>> are it would be limited by javas Locale mechanism... but we can figure
>>> this
>>> out.
>>>
>>>
>>>
>>  It really depends how sophisticated you want such an AnalyzerFactory
>> (that's how I call it in my code) to be. We can
>> define it to be a factory for predefined languages (Locale-based) for the
>> most common use cases. If you want to
>> have tighter control over the Analyzer you create, you can still
>> instantiate
>> your own, or create a new one with a custom
>> TokenFilters chain.
>>
>> As long as things are well documented, I don't see a reason why we cannot
>> start simple and only if we find out
>> that most users don't use 'simple' and prefer to be allowed to specify
>> more
>> parameters (such as 'word' or 'ngram') we
>> bring complication into the game.
>>
>> I'm offering Locale 'cause in most web applications that I know of, the
>> Locale is defined on the request and is often
>> used to parse the user's query, translating strings etc.
>>
>> Anyway, it'd be great to have any such Factory, be it Locale based or not,
>> because we have so many Analyzers
>> already, and the way things stand today, any user, even the simplest one,
>> who wishes to support multi-lingual search
>> has to sift through all of them and decide what combination to use for
>> each
>> language. And if the user ends up picking
>> default values, then a Factory would simplify matters for him.
>>
>> Shai
>>
>> On Sat, Sep 25, 2010 at 9:29 PM, Bill Janssen<ja...@parc.com>  wrote:
>>
>>
>>
>>> Robert Muir<rc...@gmail.com>  wrote:
>>>
>>>
>>>
>>>> On Fri, Sep 24, 2010 at 9:58 PM, Bill Janssen<ja...@parc.com>  wrote:
>>>>
>>>>
>>>>
>>>>> I thought that since I'm updating UpLib's Lucene code, I should tackle
>>>>> the issue of document languages, as well.  Right now I'm using an
>>>>> off-the-shelf language identifier, textcat, to figure out which
>>>>>
>>>>>
>>>> language
>>>
>>>
>>>> a Web page or PDF is (mainly) written in.  I then want to analyze that
>>>>> document with an appropriate analyzer.  I'd then like to map to the
>>>>> correct Lucene analyzer for that language, falling back to
>>>>> StandardAnalyzer if the installed Lucene library doesn't have an
>>>>> analyzer for that language.
>>>>>
>>>>> It would be *very* handy if Analyzer had a static method
>>>>>
>>>>>  static Analyzer getAnalyzerForLanguage(String rfc_4646_lang_tag);
>>>>>
>>>>>
>>>>>
>>>> I agree (not sure if it should be in Analyzer itself, maybe we could
>>>> make
>>>>
>>>>
>>> an
>>>
>>>
>>>> Analyzer for this)...
>>>>
>>>>
>>> Not sure I followed that...  I wanted to be able to retrieve an instance
>>> of an instantiated Analyzer class, the class that's "designed" to work
>>> with that language, if one exists, otherwise null.  And to have you guys
>>> keep that list up-to-date, instead of having to do it myself :-).
>>> Seemed to me that's the standard kind of thing you make a static method
>>> on the top-level class.
>>>
>>>
>>>
>>>> i mean it sounds like what you want, is for it to work in a similar way
>>>>
>>>>
>>> to
>>>
>>>
>>>> ResourceBundle's fallback mechanism?
>>>>
>>>>
>>> I'm not sure that's appropriate.  I just want to retrieve an Analyzer
>>> for that language, if such a thing exists.  If by "fallback", you mean
>>> that "en-US" should just return EnglishAnalyzer if there's no analyzer
>>> specifically for US usage -- yes, that's fine.  On the other hand, I
>>> don't think there should be a fallback for languages which have no
>>> macrolanguage Analyzer -- it should just return null or throw an
>>> exception.  The programmer can then explicitly decide how do deal with
>>> that response.
>>>
>>>
>>>
>>>> And I agree with your idea of rfc3066/4646, e.g. you might want to
>>>>
>>>>
>>> specify
>>>
>>>
>>>> subtags like "word" (SmartChineseAnalyzer) or "ngram" (CJKAnalyzer) for
>>>> chinese somehow?
>>>>
>>>>
>>> Yes, good idea.  Might be interesting to see if those kind of subtags
>>> can be registered with IANA, too.
>>>
>>> Although, if one is smart enough about Lucene and one's application to
>>> make these kinds of judgement calls, I think one is probably smart
>>> enough to know which class to use without consulting a generic
>>> mechanism.
>>>
>>>
>>>
>>>> Shai Erera brought a similar idea up before, to use Locale, but my
>>>>
>>>>
>>> concerns
>>>
>>>
>>>> are it would be limited by javas Locale mechanism... but we can figure
>>>>
>>>>
>>> this
>>>
>>>
>>>> out.
>>>>
>>>> Maybe you want to create a JIRA issue to pursue this idea further? See
>>>> http://wiki.apache.org/lucene-java/HowToContribute
>>>>
>>>>
>>>>
>>>>
>>>>> Right now I'm consulting a hand-compiled mapping of
>>>>> langtag-to-Lucene-classname to figure out which Analyzer to use.
>>>>> Wearisome, and it will be out-of-date for future releases of Lucenen
>>>>> which will presumably support more languages.
>>>>>
>>>>>
>>>>>
>>>> yes, but it also brings up interesting backwards compatibility
>>>>
>>>>
>>> challenges.
>>>
>>>
>>>> Because if we add more analyzers, say EsperantoAnalyzer, if you upgrade
>>>> lucene then suddenly your Esperanto queries are analyzed differently
>>>> (whereas they were dealt with by StandardAnalyzer before).
>>>>
>>>>
>>> Yes, presumably the Version would need to be used with this, too.
>>>
>>>
>>>
>>>> But this becomes less of a problem as we work on modularizing lucene, so
>>>>
>>>>
>>> we
>>>
>>>
>>>> can remove Version from analyzers,
>>>>
>>>>
>>> Oh goody, another API change to cope with in my code.
>>>
>>>
>>>
>>>> and so you can just use an old analyzers
>>>> jar file (such as 4.1) but upgrade your lucene core jar to say version
>>>>
>>>>
>>> 4.3.
>>>
>>>
>>>>
>>>>
>>>>
>>>>> Secondly, if I've got an instance of a SnowballAnalyzer, there's no way
>>>>> to look "inside" it, and see what language it's for.  That's a problem
>>>>> on the search side.  My QueryParser is a subclass of
>>>>> MultiFieldQueryParser, and it looks for a "special" FieldQuery on the
>>>>> field "_query_language", i.e., "_query_language:de" to tell the query
>>>>> parser to use a German analyzer on this query.  What I'd like to be
>>>>>
>>>>>
>>>> able
>>>
>>>
>>>> to do is interrogate the current analyzer attached to the query parser
>>>>> instance, and throw an exception if it's not for the specified
>>>>>
>>>>>
>>>> language.
>>>
>>>
>>>> I can do this for non-Snowball analyzers, because of the brittle
>>>>> hand-compiled mapping mentioned above.  But if it's a SnowballAnalyzer,
>>>>> there's no way to tell what the language inside it is.  So it would be
>>>>> nice if SnowballAnalyzer grew a method
>>>>>
>>>>>
>>>>>
>>>> SnowballAnalyzer had more problems. its actually deprecated in
>>>> trunk/branch_3x and instead there is an Analyzer for each language
>>>>
>>>>
>>> (English,
>>>
>>>
>>>> Italian, etc), which now has stopwords lists, and sometimes special
>>>>
>>>>
>>> behavior
>>>
>>>
>>>> (e.g. Turkish lowercases differently).
>>>>
>>>> Put more simply, its an implementation detail for ItalianAnalyzer that
>>>> we
>>>> implement the stemming with SnowballFilter. One day we might change it
>>>> to
>>>> use a less aggressive stemming algorithm (e.g. ItalianLightStemFilter)
>>>> by
>>>> default.
>>>>
>>>>
>>> Ah, good.  That will suit my purposes nicely.
>>>
>>>
>>>
>>>> I'd really like to see the stopword work finished, so that a
>>>>
>>>>
>>>>> SnowballAnalyzer for a particular language has a decent set of
>>>>> stopwords.
>>>>>
>>>>>
>>>>>
>>>> See above, I think this is finished? The remaining work is actually Solr
>>>> integration.
>>>>
>>>>
>>> Excellent.  I looked at the JIRA, but some discussions just seem to
>>> peter out, and I'm having a hard time telling what the resolution is.
>>>
>>>
>>>
>>>> In trunk and branch_3x, all the analyzers have their own package, here's
>>>> Italian:
>>>>
>>>> Source package: contains Analyzer that uses SnowballFilter(Italian) and
>>>> loads Italian snowball stopwords by default. It also includes an
>>>> alternative, less aggressive stemmer.
>>>>
>>>>
>>>>
>>>
>>> http://svn.apache.org/viewvc/lucene/dev/trunk/modules/analysis/common/src/java/org/apache/lucene/analysis/it/
>>>
>>>
>>>> The snowball stopwords were all added to the resources directory. This
>>>> is
>>>> where ItalianAnalyzer loads its set of stopwords from:
>>>>
>>>>
>>>>
>>>
>>> http://svn.apache.org/viewvc/lucene/dev/trunk/modules/analysis/common/src/resources/org/apache/lucene/analysis/
>>>
>>>
>>>> <
>>>>
>>>>
>>>
>>> http://svn.apache.org/viewvc/lucene/dev/trunk/modules/analysis/common/src/resources/org/apache/lucene/analysis/
>>>
>>>
>>>>
>>>>
>>> I see there's also an explicit EnglishAnalyzer -- never thought it made
>>> sense to call that StandardAnalyzer.  Great work!
>>>
>>> Bill
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>>
>>>
>>>
>>>
>>
>>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

Re: finding the analyzer for a language...

Posted by Itamar Syn-Hershko <it...@code972.com>.

I may be missing the point here, but how do you define an analyzer <-> 
language match? What do you do in cases of mixed content, for example?

Itamar.

On 25/9/2010 10:27 PM, Shai Erera wrote:
>> Shai Erera brought a similar idea up before, to use Locale, but my concerns
>> are it would be limited by javas Locale mechanism... but we can figure this
>> out.
>>
>>      
>   It really depends how sophisticated you want such an AnalyzerFactory
> (that's how I call it in my code) to be. We can
> define it to be a factory for predefined languages (Locale-based) for the
> most common use cases. If you want to
> have tighter control over the Analyzer you create, you can still instantiate
> your own, or create a new one with a custom
> TokenFilters chain.
>
> As long as things are well documented, I don't see a reason why we cannot
> start simple and only if we find out
> that most users don't use 'simple' and prefer to be allowed to specify more
> parameters (such as 'word' or 'ngram') we
> bring complication into the game.
>
> I'm offering Locale 'cause in most web applications that I know of, the
> Locale is defined on the request and is often
> used to parse the user's query, translating strings etc.
>
> Anyway, it'd be great to have any such Factory, be it Locale based or not,
> because we have so many Analyzers
> already, and the way things stand today, any user, even the simplest one,
> who wishes to support multi-lingual search
> has to sift through all of them and decide what combination to use for each
> language. And if the user ends up picking
> default values, then a Factory would simplify matters for him.
>
> Shai
>
> On Sat, Sep 25, 2010 at 9:29 PM, Bill Janssen<ja...@parc.com>  wrote:
>
>    
>> Robert Muir<rc...@gmail.com>  wrote:
>>
>>      
>>> On Fri, Sep 24, 2010 at 9:58 PM, Bill Janssen<ja...@parc.com>  wrote:
>>>
>>>        
>>>> I thought that since I'm updating UpLib's Lucene code, I should tackle
>>>> the issue of document languages, as well.  Right now I'm using an
>>>> off-the-shelf language identifier, textcat, to figure out which
>>>>          
>> language
>>      
>>>> a Web page or PDF is (mainly) written in.  I then want to analyze that
>>>> document with an appropriate analyzer.  I'd then like to map to the
>>>> correct Lucene analyzer for that language, falling back to
>>>> StandardAnalyzer if the installed Lucene library doesn't have an
>>>> analyzer for that language.
>>>>
>>>> It would be *very* handy if Analyzer had a static method
>>>>
>>>>   static Analyzer getAnalyzerForLanguage(String rfc_4646_lang_tag);
>>>>
>>>>          
>>> I agree (not sure if it should be in Analyzer itself, maybe we could make
>>>        
>> an
>>      
>>> Analyzer for this)...
>>>        
>> Not sure I followed that...  I wanted to be able to retrieve an instance
>> of an instantiated Analyzer class, the class that's "designed" to work
>> with that language, if one exists, otherwise null.  And to have you guys
>> keep that list up-to-date, instead of having to do it myself :-).
>> Seemed to me that's the standard kind of thing you make a static method
>> on the top-level class.
>>
>>      
>>> i mean it sounds like what you want, is for it to work in a similar way
>>>        
>> to
>>      
>>> ResourceBundle's fallback mechanism?
>>>        
>> I'm not sure that's appropriate.  I just want to retrieve an Analyzer
>> for that language, if such a thing exists.  If by "fallback", you mean
>> that "en-US" should just return EnglishAnalyzer if there's no analyzer
>> specifically for US usage -- yes, that's fine.  On the other hand, I
>> don't think there should be a fallback for languages which have no
>> macrolanguage Analyzer -- it should just return null or throw an
>> exception.  The programmer can then explicitly decide how do deal with
>> that response.
>>
>>      
>>> And I agree with your idea of rfc3066/4646, e.g. you might want to
>>>        
>> specify
>>      
>>> subtags like "word" (SmartChineseAnalyzer) or "ngram" (CJKAnalyzer) for
>>> chinese somehow?
>>>        
>> Yes, good idea.  Might be interesting to see if those kind of subtags
>> can be registered with IANA, too.
>>
>> Although, if one is smart enough about Lucene and one's application to
>> make these kinds of judgement calls, I think one is probably smart
>> enough to know which class to use without consulting a generic
>> mechanism.
>>
>>      
>>> Shai Erera brought a similar idea up before, to use Locale, but my
>>>        
>> concerns
>>      
>>> are it would be limited by javas Locale mechanism... but we can figure
>>>        
>> this
>>      
>>> out.
>>>
>>> Maybe you want to create a JIRA issue to pursue this idea further? See
>>> http://wiki.apache.org/lucene-java/HowToContribute
>>>
>>>
>>>        
>>>> Right now I'm consulting a hand-compiled mapping of
>>>> langtag-to-Lucene-classname to figure out which Analyzer to use.
>>>> Wearisome, and it will be out-of-date for future releases of Lucenen
>>>> which will presumably support more languages.
>>>>
>>>>          
>>> yes, but it also brings up interesting backwards compatibility
>>>        
>> challenges.
>>      
>>> Because if we add more analyzers, say EsperantoAnalyzer, if you upgrade
>>> lucene then suddenly your Esperanto queries are analyzed differently
>>> (whereas they were dealt with by StandardAnalyzer before).
>>>        
>> Yes, presumably the Version would need to be used with this, too.
>>
>>      
>>> But this becomes less of a problem as we work on modularizing lucene, so
>>>        
>> we
>>      
>>> can remove Version from analyzers,
>>>        
>> Oh goody, another API change to cope with in my code.
>>
>>      
>>> and so you can just use an old analyzers
>>> jar file (such as 4.1) but upgrade your lucene core jar to say version
>>>        
>> 4.3.
>>      
>>>
>>>        
>>>> Secondly, if I've got an instance of a SnowballAnalyzer, there's no way
>>>> to look "inside" it, and see what language it's for.  That's a problem
>>>> on the search side.  My QueryParser is a subclass of
>>>> MultiFieldQueryParser, and it looks for a "special" FieldQuery on the
>>>> field "_query_language", i.e., "_query_language:de" to tell the query
>>>> parser to use a German analyzer on this query.  What I'd like to be
>>>>          
>> able
>>      
>>>> to do is interrogate the current analyzer attached to the query parser
>>>> instance, and throw an exception if it's not for the specified
>>>>          
>> language.
>>      
>>>> I can do this for non-Snowball analyzers, because of the brittle
>>>> hand-compiled mapping mentioned above.  But if it's a SnowballAnalyzer,
>>>> there's no way to tell what the language inside it is.  So it would be
>>>> nice if SnowballAnalyzer grew a method
>>>>
>>>>          
>>> SnowballAnalyzer had more problems. its actually deprecated in
>>> trunk/branch_3x and instead there is an Analyzer for each language
>>>        
>> (English,
>>      
>>> Italian, etc), which now has stopwords lists, and sometimes special
>>>        
>> behavior
>>      
>>> (e.g. Turkish lowercases differently).
>>>
>>> Put more simply, its an implementation detail for ItalianAnalyzer that we
>>> implement the stemming with SnowballFilter. One day we might change it to
>>> use a less aggressive stemming algorithm (e.g. ItalianLightStemFilter) by
>>> default.
>>>        
>> Ah, good.  That will suit my purposes nicely.
>>
>>      
>>> I'd really like to see the stopword work finished, so that a
>>>        
>>>> SnowballAnalyzer for a particular language has a decent set of
>>>> stopwords.
>>>>
>>>>          
>>> See above, I think this is finished? The remaining work is actually Solr
>>> integration.
>>>        
>> Excellent.  I looked at the JIRA, but some discussions just seem to
>> peter out, and I'm having a hard time telling what the resolution is.
>>
>>      
>>> In trunk and branch_3x, all the analyzers have their own package, here's
>>> Italian:
>>>
>>> Source package: contains Analyzer that uses SnowballFilter(Italian) and
>>> loads Italian snowball stopwords by default. It also includes an
>>> alternative, less aggressive stemmer.
>>>
>>>        
>> http://svn.apache.org/viewvc/lucene/dev/trunk/modules/analysis/common/src/java/org/apache/lucene/analysis/it/
>>      
>>> The snowball stopwords were all added to the resources directory. This is
>>> where ItalianAnalyzer loads its set of stopwords from:
>>>
>>>        
>> http://svn.apache.org/viewvc/lucene/dev/trunk/modules/analysis/common/src/resources/org/apache/lucene/analysis/
>>      
>>> <
>>>        
>> http://svn.apache.org/viewvc/lucene/dev/trunk/modules/analysis/common/src/resources/org/apache/lucene/analysis/
>>      
>>>        
>> I see there's also an explicit EnglishAnalyzer -- never thought it made
>> sense to call that StandardAnalyzer.  Great work!
>>
>> Bill
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>>
>>      
>    

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: finding the analyzer for a language...

Posted by Shai Erera <se...@gmail.com>.

>
> Shai Erera brought a similar idea up before, to use Locale, but my concerns
> are it would be limited by javas Locale mechanism... but we can figure this
> out.
>

 It really depends how sophisticated you want such an AnalyzerFactory
(that's how I call it in my code) to be. We can
define it to be a factory for predefined languages (Locale-based) for the
most common use cases. If you want to
have tighter control over the Analyzer you create, you can still instantiate
your own, or create a new one with a custom
TokenFilters chain.

As long as things are well documented, I don't see a reason why we cannot
start simple and only if we find out
that most users don't use 'simple' and prefer to be allowed to specify more
parameters (such as 'word' or 'ngram') we
bring complication into the game.

I'm offering Locale 'cause in most web applications that I know of, the
Locale is defined on the request and is often
used to parse the user's query, translating strings etc.

Anyway, it'd be great to have any such Factory, be it Locale based or not,
because we have so many Analyzers
already, and the way things stand today, any user, even the simplest one,
who wishes to support multi-lingual search
has to sift through all of them and decide what combination to use for each
language. And if the user ends up picking
default values, then a Factory would simplify matters for him.

Shai

On Sat, Sep 25, 2010 at 9:29 PM, Bill Janssen <ja...@parc.com> wrote:

> Robert Muir <rc...@gmail.com> wrote:
>
> > On Fri, Sep 24, 2010 at 9:58 PM, Bill Janssen <ja...@parc.com> wrote:
> >
> > > I thought that since I'm updating UpLib's Lucene code, I should tackle
> > > the issue of document languages, as well.  Right now I'm using an
> > > off-the-shelf language identifier, textcat, to figure out which
> language
> > > a Web page or PDF is (mainly) written in.  I then want to analyze that
> > > document with an appropriate analyzer.  I'd then like to map to the
> > > correct Lucene analyzer for that language, falling back to
> > > StandardAnalyzer if the installed Lucene library doesn't have an
> > > analyzer for that language.
> > >
> > > It would be *very* handy if Analyzer had a static method
> > >
> > >  static Analyzer getAnalyzerForLanguage(String rfc_4646_lang_tag);
> > >
> >
> > I agree (not sure if it should be in Analyzer itself, maybe we could make
> an
> > Analyzer for this)...
>
> Not sure I followed that...  I wanted to be able to retrieve an instance
> of an instantiated Analyzer class, the class that's "designed" to work
> with that language, if one exists, otherwise null.  And to have you guys
> keep that list up-to-date, instead of having to do it myself :-).
> Seemed to me that's the standard kind of thing you make a static method
> on the top-level class.
>
> > i mean it sounds like what you want, is for it to work in a similar way
> to
> > ResourceBundle's fallback mechanism?
>
> I'm not sure that's appropriate.  I just want to retrieve an Analyzer
> for that language, if such a thing exists.  If by "fallback", you mean
> that "en-US" should just return EnglishAnalyzer if there's no analyzer
> specifically for US usage -- yes, that's fine.  On the other hand, I
> don't think there should be a fallback for languages which have no
> macrolanguage Analyzer -- it should just return null or throw an
> exception.  The programmer can then explicitly decide how do deal with
> that response.
>
> > And I agree with your idea of rfc3066/4646, e.g. you might want to
> specify
> > subtags like "word" (SmartChineseAnalyzer) or "ngram" (CJKAnalyzer) for
> > chinese somehow?
>
> Yes, good idea.  Might be interesting to see if those kind of subtags
> can be registered with IANA, too.
>
> Although, if one is smart enough about Lucene and one's application to
> make these kinds of judgement calls, I think one is probably smart
> enough to know which class to use without consulting a generic
> mechanism.
>
> > Shai Erera brought a similar idea up before, to use Locale, but my
> concerns
> > are it would be limited by javas Locale mechanism... but we can figure
> this
> > out.
> >
> > Maybe you want to create a JIRA issue to pursue this idea further? See
> > http://wiki.apache.org/lucene-java/HowToContribute
> >
> >
> > > Right now I'm consulting a hand-compiled mapping of
> > > langtag-to-Lucene-classname to figure out which Analyzer to use.
> > > Wearisome, and it will be out-of-date for future releases of Lucenen
> > > which will presumably support more languages.
> > >
> >
> > yes, but it also brings up interesting backwards compatibility
> challenges.
> > Because if we add more analyzers, say EsperantoAnalyzer, if you upgrade
> > lucene then suddenly your Esperanto queries are analyzed differently
> > (whereas they were dealt with by StandardAnalyzer before).
>
> Yes, presumably the Version would need to be used with this, too.
>
> > But this becomes less of a problem as we work on modularizing lucene, so
> we
> > can remove Version from analyzers,
>
> Oh goody, another API change to cope with in my code.
>
> > and so you can just use an old analyzers
> > jar file (such as 4.1) but upgrade your lucene core jar to say version
> 4.3.
> >
> >
> > >
> > > Secondly, if I've got an instance of a SnowballAnalyzer, there's no way
> > > to look "inside" it, and see what language it's for.  That's a problem
> > > on the search side.  My QueryParser is a subclass of
> > > MultiFieldQueryParser, and it looks for a "special" FieldQuery on the
> > > field "_query_language", i.e., "_query_language:de" to tell the query
> > > parser to use a German analyzer on this query.  What I'd like to be
> able
> > > to do is interrogate the current analyzer attached to the query parser
> > > instance, and throw an exception if it's not for the specified
> language.
> > > I can do this for non-Snowball analyzers, because of the brittle
> > > hand-compiled mapping mentioned above.  But if it's a SnowballAnalyzer,
> > > there's no way to tell what the language inside it is.  So it would be
> > > nice if SnowballAnalyzer grew a method
> > >
> >
> > SnowballAnalyzer had more problems. its actually deprecated in
> > trunk/branch_3x and instead there is an Analyzer for each language
> (English,
> > Italian, etc), which now has stopwords lists, and sometimes special
> behavior
> > (e.g. Turkish lowercases differently).
> >
> > Put more simply, its an implementation detail for ItalianAnalyzer that we
> > implement the stemming with SnowballFilter. One day we might change it to
> > use a less aggressive stemming algorithm (e.g. ItalianLightStemFilter) by
> > default.
>
> Ah, good.  That will suit my purposes nicely.
>
> > I'd really like to see the stopword work finished, so that a
> > > SnowballAnalyzer for a particular language has a decent set of
> > > stopwords.
> > >
> >
> > See above, I think this is finished? The remaining work is actually Solr
> > integration.
>
> Excellent.  I looked at the JIRA, but some discussions just seem to
> peter out, and I'm having a hard time telling what the resolution is.
>
> > In trunk and branch_3x, all the analyzers have their own package, here's
> > Italian:
> >
> > Source package: contains Analyzer that uses SnowballFilter(Italian) and
> > loads Italian snowball stopwords by default. It also includes an
> > alternative, less aggressive stemmer.
> >
> http://svn.apache.org/viewvc/lucene/dev/trunk/modules/analysis/common/src/java/org/apache/lucene/analysis/it/
> >
> > The snowball stopwords were all added to the resources directory. This is
> > where ItalianAnalyzer loads its set of stopwords from:
> >
> http://svn.apache.org/viewvc/lucene/dev/trunk/modules/analysis/common/src/resources/org/apache/lucene/analysis/
> > <
> http://svn.apache.org/viewvc/lucene/dev/trunk/modules/analysis/common/src/resources/org/apache/lucene/analysis/
> >
>
> I see there's also an explicit EnglishAnalyzer -- never thought it made
> sense to call that StandardAnalyzer.  Great work!
>
> Bill
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

Re: finding the analyzer for a language...

Posted by Bill Janssen <ja...@parc.com>.

Robert Muir <rc...@gmail.com> wrote:

> On Fri, Sep 24, 2010 at 9:58 PM, Bill Janssen <ja...@parc.com> wrote:
> 
> > I thought that since I'm updating UpLib's Lucene code, I should tackle
> > the issue of document languages, as well.  Right now I'm using an
> > off-the-shelf language identifier, textcat, to figure out which language
> > a Web page or PDF is (mainly) written in.  I then want to analyze that
> > document with an appropriate analyzer.  I'd then like to map to the
> > correct Lucene analyzer for that language, falling back to
> > StandardAnalyzer if the installed Lucene library doesn't have an
> > analyzer for that language.
> >
> > It would be *very* handy if Analyzer had a static method
> >
> >  static Analyzer getAnalyzerForLanguage(String rfc_4646_lang_tag);
> >
> 
> I agree (not sure if it should be in Analyzer itself, maybe we could make an
> Analyzer for this)...

Not sure I followed that...  I wanted to be able to retrieve an instance
of an instantiated Analyzer class, the class that's "designed" to work
with that language, if one exists, otherwise null.  And to have you guys
keep that list up-to-date, instead of having to do it myself :-).
Seemed to me that's the standard kind of thing you make a static method
on the top-level class.

> i mean it sounds like what you want, is for it to work in a similar way to
> ResourceBundle's fallback mechanism?

I'm not sure that's appropriate.  I just want to retrieve an Analyzer
for that language, if such a thing exists.  If by "fallback", you mean
that "en-US" should just return EnglishAnalyzer if there's no analyzer
specifically for US usage -- yes, that's fine.  On the other hand, I
don't think there should be a fallback for languages which have no
macrolanguage Analyzer -- it should just return null or throw an
exception.  The programmer can then explicitly decide how do deal with
that response.

> And I agree with your idea of rfc3066/4646, e.g. you might want to specify
> subtags like "word" (SmartChineseAnalyzer) or "ngram" (CJKAnalyzer) for
> chinese somehow?

Yes, good idea.  Might be interesting to see if those kind of subtags
can be registered with IANA, too.

Although, if one is smart enough about Lucene and one's application to
make these kinds of judgement calls, I think one is probably smart
enough to know which class to use without consulting a generic
mechanism.

> Shai Erera brought a similar idea up before, to use Locale, but my concerns
> are it would be limited by javas Locale mechanism... but we can figure this
> out.
> 
> Maybe you want to create a JIRA issue to pursue this idea further? See
> http://wiki.apache.org/lucene-java/HowToContribute
> 
> 
> > Right now I'm consulting a hand-compiled mapping of
> > langtag-to-Lucene-classname to figure out which Analyzer to use.
> > Wearisome, and it will be out-of-date for future releases of Lucenen
> > which will presumably support more languages.
> >
> 
> yes, but it also brings up interesting backwards compatibility challenges.
> Because if we add more analyzers, say EsperantoAnalyzer, if you upgrade
> lucene then suddenly your Esperanto queries are analyzed differently
> (whereas they were dealt with by StandardAnalyzer before).

Yes, presumably the Version would need to be used with this, too.

> But this becomes less of a problem as we work on modularizing lucene, so we
> can remove Version from analyzers,

Oh goody, another API change to cope with in my code.

> and so you can just use an old analyzers
> jar file (such as 4.1) but upgrade your lucene core jar to say version 4.3.
> 
> 
> >
> > Secondly, if I've got an instance of a SnowballAnalyzer, there's no way
> > to look "inside" it, and see what language it's for.  That's a problem
> > on the search side.  My QueryParser is a subclass of
> > MultiFieldQueryParser, and it looks for a "special" FieldQuery on the
> > field "_query_language", i.e., "_query_language:de" to tell the query
> > parser to use a German analyzer on this query.  What I'd like to be able
> > to do is interrogate the current analyzer attached to the query parser
> > instance, and throw an exception if it's not for the specified language.
> > I can do this for non-Snowball analyzers, because of the brittle
> > hand-compiled mapping mentioned above.  But if it's a SnowballAnalyzer,
> > there's no way to tell what the language inside it is.  So it would be
> > nice if SnowballAnalyzer grew a method
> >
> 
> SnowballAnalyzer had more problems. its actually deprecated in
> trunk/branch_3x and instead there is an Analyzer for each language (English,
> Italian, etc), which now has stopwords lists, and sometimes special behavior
> (e.g. Turkish lowercases differently).
> 
> Put more simply, its an implementation detail for ItalianAnalyzer that we
> implement the stemming with SnowballFilter. One day we might change it to
> use a less aggressive stemming algorithm (e.g. ItalianLightStemFilter) by
> default.

Ah, good.  That will suit my purposes nicely.

> I'd really like to see the stopword work finished, so that a
> > SnowballAnalyzer for a particular language has a decent set of
> > stopwords.
> >
> 
> See above, I think this is finished? The remaining work is actually Solr
> integration.

Excellent.  I looked at the JIRA, but some discussions just seem to
peter out, and I'm having a hard time telling what the resolution is.

> In trunk and branch_3x, all the analyzers have their own package, here's
> Italian:
> 
> Source package: contains Analyzer that uses SnowballFilter(Italian) and
> loads Italian snowball stopwords by default. It also includes an
> alternative, less aggressive stemmer.
> http://svn.apache.org/viewvc/lucene/dev/trunk/modules/analysis/common/src/java/org/apache/lucene/analysis/it/
> 
> The snowball stopwords were all added to the resources directory. This is
> where ItalianAnalyzer loads its set of stopwords from:
> http://svn.apache.org/viewvc/lucene/dev/trunk/modules/analysis/common/src/resources/org/apache/lucene/analysis/
> <http://svn.apache.org/viewvc/lucene/dev/trunk/modules/analysis/common/src/resources/org/apache/lucene/analysis/>

I see there's also an explicit EnglishAnalyzer -- never thought it made
sense to call that StandardAnalyzer.  Great work!

Bill

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: finding the analyzer for a language...

Posted by Robert Muir <rc...@gmail.com>.

On Fri, Sep 24, 2010 at 9:58 PM, Bill Janssen <ja...@parc.com> wrote:

> I thought that since I'm updating UpLib's Lucene code, I should tackle
> the issue of document languages, as well.  Right now I'm using an
> off-the-shelf language identifier, textcat, to figure out which language
> a Web page or PDF is (mainly) written in.  I then want to analyze that
> document with an appropriate analyzer.  I'd then like to map to the
> correct Lucene analyzer for that language, falling back to
> StandardAnalyzer if the installed Lucene library doesn't have an
> analyzer for that language.
>
> It would be *very* handy if Analyzer had a static method
>
>  static Analyzer getAnalyzerForLanguage(String rfc_4646_lang_tag);
>

I agree (not sure if it should be in Analyzer itself, maybe we could make an
Analyzer for this)...
i mean it sounds like what you want, is for it to work in a similar way to
ResourceBundle's fallback mechanism?

And I agree with your idea of rfc3066/4646, e.g. you might want to specify
subtags like "word" (SmartChineseAnalyzer) or "ngram" (CJKAnalyzer) for
chinese somehow?

Shai Erera brought a similar idea up before, to use Locale, but my concerns
are it would be limited by javas Locale mechanism... but we can figure this
out.

Maybe you want to create a JIRA issue to pursue this idea further? See
http://wiki.apache.org/lucene-java/HowToContribute

> Right now I'm consulting a hand-compiled mapping of
> langtag-to-Lucene-classname to figure out which Analyzer to use.
> Wearisome, and it will be out-of-date for future releases of Lucenen
> which will presumably support more languages.
>

yes, but it also brings up interesting backwards compatibility challenges.
Because if we add more analyzers, say EsperantoAnalyzer, if you upgrade
lucene then suddenly your Esperanto queries are analyzed differently
(whereas they were dealt with by StandardAnalyzer before).

But this becomes less of a problem as we work on modularizing lucene, so we
can remove Version from analyzers, and so you can just use an old analyzers
jar file (such as 4.1) but upgrade your lucene core jar to say version 4.3.

>
> Secondly, if I've got an instance of a SnowballAnalyzer, there's no way
> to look "inside" it, and see what language it's for.  That's a problem
> on the search side.  My QueryParser is a subclass of
> MultiFieldQueryParser, and it looks for a "special" FieldQuery on the
> field "_query_language", i.e., "_query_language:de" to tell the query
> parser to use a German analyzer on this query.  What I'd like to be able
> to do is interrogate the current analyzer attached to the query parser
> instance, and throw an exception if it's not for the specified language.
> I can do this for non-Snowball analyzers, because of the brittle
> hand-compiled mapping mentioned above.  But if it's a SnowballAnalyzer,
> there's no way to tell what the language inside it is.  So it would be
> nice if SnowballAnalyzer grew a method
>

SnowballAnalyzer had more problems. its actually deprecated in
trunk/branch_3x and instead there is an Analyzer for each language (English,
Italian, etc), which now has stopwords lists, and sometimes special behavior
(e.g. Turkish lowercases differently).

Put more simply, its an implementation detail for ItalianAnalyzer that we
implement the stemming with SnowballFilter. One day we might change it to
use a less aggressive stemming algorithm (e.g. ItalianLightStemFilter) by
default.

I'd really like to see the stopword work finished, so that a
> SnowballAnalyzer for a particular language has a decent set of
> stopwords.
>

See above, I think this is finished? The remaining work is actually Solr
integration.

In trunk and branch_3x, all the analyzers have their own package, here's
Italian:

Source package: contains Analyzer that uses SnowballFilter(Italian) and
loads Italian snowball stopwords by default. It also includes an
alternative, less aggressive stemmer.
http://svn.apache.org/viewvc/lucene/dev/trunk/modules/analysis/common/src/java/org/apache/lucene/analysis/it/

The snowball stopwords were all added to the resources directory. This is
where ItalianAnalyzer loads its set of stopwords from:
http://svn.apache.org/viewvc/lucene/dev/trunk/modules/analysis/common/src/resources/org/apache/lucene/analysis/
<http://svn.apache.org/viewvc/lucene/dev/trunk/modules/analysis/common/src/resources/org/apache/lucene/analysis/>
-- 
Robert Muir
rcmuir@gmail.com