You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by bruce <be...@earthlink.net> on 2006/09/22 00:06:59 UTC

basic nutch questions...

hi...

this is a continuation of some basic questions that i have regarding nutch
that i've never been able to really get answered. perhaps the question will
stimulate a dialogue/answers that might be useful to others.

i'm considering a crawling kind of application that will be used to extract
information from various sites. the goal is a screen scraping/data
extraction app that will target ~5000 sites. each site has a different
layout, which seems to imply that there is no really generalizable solution
to extracting/parsing the information.

my question, is whether nutch can play a role in creating a solution to this
issue?

the needs as i see them are to be able to:

-handle crawl through forms
-handle user/passwd form/sites
-handle the unique layout of the various sites
-possibly handle/incorporate XPath/Dom extraction functionality
-interface with database for storing/maintaining extracted data
-potentially handle/use plugins for various sites in order to
 handle the data parsing/extraction
-somehow be able to allow for testing of the data that's
 parsed/extracted from the sites
 (we need to know that we actually have the correct data!!!)
-be able to reparse/extract the data from the sites...
-be able to run in a parallel/distributed manner

an example run of the targeted sites, would be
 -go to main site...
  -parse main site, extracting the list of items(1) that form
   the child urls to parse
   -write the list of items(1) to the DB
   -create the urls for the items
   -fetch the pages for the urls, and parse the page content
    -for each page, get the 2nd list of items(2)....
     -repeat the process...

in this example, the items are derived from drop down lists
 that are used to generate the url for the post/form in the
 site...

so how can nutch play a role in something like this?

i've looked at various articles/forums, and thought it might makes sense to
repost the questions here..

thanks