You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by bruce <be...@earthlink.net> on 2006/09/22 00:06:59 UTC
basic nutch questions...
hi...
this is a continuation of some basic questions that i have regarding nutch
that i've never been able to really get answered. perhaps the question will
stimulate a dialogue/answers that might be useful to others.
i'm considering a crawling kind of application that will be used to extract
information from various sites. the goal is a screen scraping/data
extraction app that will target ~5000 sites. each site has a different
layout, which seems to imply that there is no really generalizable solution
to extracting/parsing the information.
my question, is whether nutch can play a role in creating a solution to this
issue?
the needs as i see them are to be able to:
-handle crawl through forms
-handle user/passwd form/sites
-handle the unique layout of the various sites
-possibly handle/incorporate XPath/Dom extraction functionality
-interface with database for storing/maintaining extracted data
-potentially handle/use plugins for various sites in order to
handle the data parsing/extraction
-somehow be able to allow for testing of the data that's
parsed/extracted from the sites
(we need to know that we actually have the correct data!!!)
-be able to reparse/extract the data from the sites...
-be able to run in a parallel/distributed manner
an example run of the targeted sites, would be
-go to main site...
-parse main site, extracting the list of items(1) that form
the child urls to parse
-write the list of items(1) to the DB
-create the urls for the items
-fetch the pages for the urls, and parse the page content
-for each page, get the 2nd list of items(2)....
-repeat the process...
in this example, the items are derived from drop down lists
that are used to generate the url for the post/form in the
site...
so how can nutch play a role in something like this?
i've looked at various articles/forums, and thought it might makes sense to
repost the questions here..
thanks