You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@nutch.apache.org by bruce <be...@earthlink.net> on 2006/09/19 16:47:16 UTC

nutch questions...

hi...

the docs/sites i've seen tell me that nutch is a crawling app, that i can
specify sites, as well as constraints on the sites (to contain the crawling
within the site). however, i haven't seen any docs that state how to
actually parse information from sites that are behind forms...

here's my targeted goal:
 to have an app that i can point to a section of a site
 to iteratively process through the site (and the descendent
  sections of the site)
 to handle form processing
 to be able to then use XPath queries (or something similar)
   to extract the information that i need from the given
   pages
 each targeted site will be different regarding the
   layout/structure, so i'm going to need to be able to have
   some kind of "plugin" approach to handle the fine grained
   data processing/extraction process..

is there anyone who's doing anything close to this with nutch that i can
talk to to get a feel for the difficulty of using nutch in this regard?

thanks