You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by Teruhiko Kurosaka <Ku...@basistech.com> on 2006/05/19 21:14:19 UTC
Status of language plugin
Hello Jérôme,
Because of other issues at work, I was away from Nutch.
Now I'm back, and I see you are making progresses according
to your notes in jira.
Is there an API doc or design doc that I can read to
understand where you are? Is the language plugin architecture
already in the main trunk?
Here are some issues that I've been worried about:
* Support of multilingual plugin?
** If one plugin can support more than one languages,
the language needs to be passed at each analyzsis.
** This assumes language identification is done before
analysis. Is it the case ?
* Support of a different analyzer for query than index
** Analyzer for query may need to behave differently than
analyzer for indexinging. Can your architecture
specify different analyzers for indexing and query?
Thanks.
-kuro
Re: Status of language plugin
Posted by Jérôme Charron <je...@gmail.com>.
> Is there an API doc or design doc that I can read to
> understand where you are? Is the language plugin architecture
> already in the main trunk?
The only available document is
http://wiki.apache.org/nutch/MultiLingualSupport
and sometimes I maintain this page
http://wiki.apache.org/nutch/JeromeCharron
> Here are some issues that I've been worried about:
> * Support of multilingual plugin?
> ** If one plugin can support more than one languages,
> the language needs to be passed at each analyzsis.
I don't understand your need.
But if you have an analysis plugin that can handle many languages, you
can simply define many implementations in your plugin xml, ie
<extension id="org.apache.nutch.analysis.cjk"
name="CJKAnalyzer"
point="org.apache.nutch.analysis.NutchAnalyzer">
<implementation id="org.apache.nutch.analysis.cn.ChineseAnalyzer"
class="org.apache.nutch.analysis.cjk.CJKAnalyzer ">
<parameter name="lang" value="cn"/>
</implementation>
<implementation id="org.apache.nutch.analysis.kr.KoreanAnalyzer"
class="org.apache.nutch.analysis.cjk.CJKAnalyzer">
<parameter name="lang" value="kr"/>
</implementation>
<implementation id="org.apache.nutch.analysis.jp.JapaneseAnalyzer"
class="org.apache.nutch.analysis.cjk.CJKAnalyzer">
<parameter name="lang" value="jp"/>
</implementation>
</extension>
> ** This assumes language identification is done before
> analysis. Is it the case ?
Yes.
> * Support of a different analyzer for query than index
> ** Analyzer for query may need to behave differently than
> analyzer for indexinging. Can your architecture
> specify different analyzers for indexing and query?
In fact, to avoid adding a QueryAnalyser extension point,
the Query use the same Analyzer implementation that the one
for document analysis.
Jérôme
--
http://motrech.free.fr/
http://www.frutch.org/