You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by KishoreKumar Bairi <pr...@gmail.com> on 2008/05/30 10:22:09 UTC

Does nutch serve my purpose?

Hi,


I'm new to nutch. I don't even know if this serves my purpose.

I'm working on a machine learning problem for which I need corpus.can be
obtained by crawling web. (required dataset is not available.)
but my requirements are as follows:

Crawler should crawl links of only certain pattern, (www.domain.com/id)
it should fetch only specific data from the page crawled(instead of entire
content of page).
  say <div id="reqd1"></div> from page of pattern1 and <div
id="reqd2"></div> from page of pattern2.  then merge both. that will be
one    example data I require.Like wise I need few hundreds/thousands of
pages(examples).

And finally all the fetched text should be store in some kind of
database/XML files, So that I can use it for training my program.

Please can any one tell me, Is nutch the right choice for me? If not what
would be the best method to accomplish my task?


Regards,
KishoreKumar.