You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@nutch.apache.org by Apache Wiki <wi...@apache.org> on 2005/06/10 16:51:33 UTC
[Nutch Wiki] Update of "MultiLingualSupport" by JeromeCharron

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Nutch Wiki" for change notification.

The following page has been changed by JeromeCharron:
http://wiki.apache.org/nutch/MultiLingualSupport

The comment on the change is:
First proposal version

New page:
== Multi-Lingual Support in Nutch ==
'''Jérôme Charron'''

'''''10 June 2005'''''

'''DRAFT'''


== Introduction ==
The goal of this proposal is to provide a solution for multi-lingual support in Nutch. Multi-lingual support means to be able to use a language specific [http://lucene.apache.org/java/docs/api/org/apache/lucene/analysis/Analyzer.html Analyzer] during searching and analyzing.

== Configuration ==
The configuration of this behaviour is done using the `analyzis.analyzer.impl` and `searcher.analyzer.impl` configuration properties.
Each of this property can get the following values:
 * `default`: means that the default `Analyzer` implementation (ie `NutchAnalyzer`) will be used (similar to the actual implementation).
 * `auto`: means that the `Analyzer` implementation used will be determined by the `NutchAnalyzerFactory`.
 * `<classname>`: means that the `Analyzer` implementation used will be the those specified by the `<classname>` (add the ability to tune Nutch for a specified language and only this one).
__QUESTION__: Does it really make sense to have two configuration properties?

== NutchAnalyzerFactory ==
The `NutchAnalyzerFactory` class is responsible of instanciating the `Analyzer` implementation to use depending on the Nutch configuration and a specified language code. 

=== Implementation Finder ===

The `NutchAnalyzerFactory` policy for instanciating an `Analyzer` is as follow:
 * If `*.analyzer.impl` configuration parameter equals `default` then the `NutchAnalyzerFactory` simply returns the standard `NutchDocumentAnalyzer` implementation.
 * If `*.analyzer.impl` configuration parameter equals `auto` then the `NutchAnalyzerFactory` :
  * Returns the implementation specified in the `AnalyzerMap.properties` file for the specified language.
  * Returns the standard `NutchDocumentAnalyzer` implementation if the specified language is null, or if not mapping exists in the `AnalyzerMap.properties` file.

=== AnalyzerMap.properties ===
This properties file maintains mapping between languages codes and `Analyzer` implementation to use:
{{{
fr.analyzer=org.apache.lucene.analyzis.fr.FrenchAnalyzer
de.analyzer=org.apache.lucene.analyzis.de.GermanAnalyzer
...
}}}

== Analyzis ==

The language specific analyzis is based on the result of the LanguageIdentifierPlugin.

The only impact on the analyzis code is on the part of the code of the `IndexSegment` that add a document to the index:
{{{
indexWriter.addDocument(doc);
}}}
should be replaced by
{{{
indexWriter.addDocument(doc, NutchAnalyzerFactory.get(doc.get("lang")));
}}}
so that, the `IndexWriter` is called with the good `Analyzer` implementation.

== Searcher ==

The language specific searcher will be based on a ##lang## attribute like for the Analysis. But this `lang` attribute in this case must be retrieved from the front-end using the following policy:
 1. Use an optional `lang` attribute provided by the search interface.
 2. If no such attribute is provided by the search interface, then uses the Browser language.

The only impact on the searcher code is to add the following method in the `Query` class:
{{{
public static Query parse(String queryString, String lang) throws IOException;
}}}
This method then uses the `NutchAnalyzerFactory` to retrieve the analyzer to use for parsing the specified query.