You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@nutch.apache.org by lucuser4851 <lu...@log1.net> on 2005/06/10 19:43:00 UTC

Re: [Nutch-dev] Multi-Lingual support

Hi Jerome,
 This is an interesting proposal. One thought that comes to mind is that
analysing which involves stemming or stop wording does represent a loss
of information. 

If you store only the language specific stemmed form of the document,
you might not be able to search for specific word forms or you might get
false positives for an exact quoted search that contains stop words;
neither can you do a language neutral search.  

There might be two ways you could tackle this, one could be to store
both a version of the document analysed in a language specific way and a
version analysed with standard analyser; but this would make the index
a lot bigger.

Another alternative might be to make no change to the index, storing
just the standardanalyzed document in the index, and doing some kind of
query expansion at query time, perhaps matching a language specific
analysed version of the query term(s) against a list of terms from the
index which stem to the same root; and then querying using those terms
rather than the original query. I am not sure if I am totally clear...

Just my two cents....

On Fri, 2005-06-10 at 17:02 +0200, Jérôme Charron wrote:
> I was thinking about it for a while: Multi-Lingual support in Nutch.
> After looking at Nutch code, I write a proposal on the Wiki (
> http://wiki.apache.org/nutch/MultiLingualSupport).
> Since I'm not yet an expert of the Nutch core (hope to become one), this 
> mail is a kind of request for comments about the proposal.
> 
> Thanks,
> 
> Jerome

-- 
lucuser4851 <lu...@log1.net>