You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by Jérôme Charron <je...@gmail.com> on 2005/08/17 15:29:02 UTC

Re: Language Identifier in Nutch

Hi Olena

I'm currently starting my work with Nutch. My goal is to have a topic
> specific (or at least language specific) crawler tool. Is it possible
> to apply the LanguageIdentifier plugin on webpages that are not yet
> fetched, so that e.g. only French or German pages are crawled?

No. The reason is very simple: The LanguageIdentifier needs to analyze the 
content (fetched) of a document to find the document's language. 

Do you
> know, where I can find information on how to create and use my own
> n-gram models from my training corpus?

There's a lot of materials on this subject on the Web (feel free to use 
Google).
If you are fluent in French, here's a short presentation a made on my blog:
http://motrech.blogspot.com/2005/02/science-do-you-habla-franzsisch.html

If you have any source codes in Java, which I could use as examples -
> it would be a great help for me.

In the Nutch's LanguageIdentifier plugin for instance:
http://svn.apache.org/viewcvs.cgi/lucene/nutch/trunk/src/plugin/languageidentifier/src/java/org/apache/nutch/analysis/lang/
The NGramProfile class can be used on the command line to create some 
NGramProfiles (ngp files)

Jérôme

-- 
http://motrech.free.fr/
http://www.frutch.org/