You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by Jérôme Charron <je...@gmail.com> on 2005/08/17 15:29:02 UTC
Re: Language Identifier in Nutch
Hi Olena
I'm currently starting my work with Nutch. My goal is to have a topic
> specific (or at least language specific) crawler tool. Is it possible
> to apply the LanguageIdentifier plugin on webpages that are not yet
> fetched, so that e.g. only French or German pages are crawled?
No. The reason is very simple: The LanguageIdentifier needs to analyze the
content (fetched) of a document to find the document's language.
Do you
> know, where I can find information on how to create and use my own
> n-gram models from my training corpus?
There's a lot of materials on this subject on the Web (feel free to use
Google).
If you are fluent in French, here's a short presentation a made on my blog:
http://motrech.blogspot.com/2005/02/science-do-you-habla-franzsisch.html
If you have any source codes in Java, which I could use as examples -
> it would be a great help for me.
In the Nutch's LanguageIdentifier plugin for instance:
http://svn.apache.org/viewcvs.cgi/lucene/nutch/trunk/src/plugin/languageidentifier/src/java/org/apache/nutch/analysis/lang/
The NGramProfile class can be used on the command line to create some
NGramProfiles (ngp files)
Jérôme
--
http://motrech.free.fr/
http://www.frutch.org/