You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Marko Asplund <ma...@gmail.com> on 2015/03/25 08:54:00 UTC

Custom crawling application design questions

Hi,

I'm designing an application that needs to extract and analyse content and
metadata for a selected set of web pages. The page URLs are generated by an
existing application component based on information collected in a
database. After initial page analysis the application should be able to
detect changes in the pages and redo the analysis, as needed. The number of
pages is expected to grow gradually up to about 20 million.

The application considers certain subsets of URLs to be related and these
sets should be analysed and processed together when fetched content is
available. Typically, the number of URLs per subset will be from 10 to 20
and they'll often be located in the same website. For each URL subset the
application will decide which links to follow and fetch. The application
needs to extract information from the content, it can do this in the same
pass as Nutch or in an additional parsing step.

Being new to Nutch, I'd like ask input from this mailing list on what would
be the best way to use and customise Nutch in this kind of scenario. Can
you give any extension point documentation and pointers to other resources
that would be relevant in this case?

So far, I've identified the following extension points that could
potentially be useful:
* URLFilter, SegmentMergeFilter or ScoringFilter: could be used for
selecting new links to fetch
* indexing filter: analysing and processing page content
* index writer: post-processing a segment and writing results

Do you think these are valid in this case?

Which Nutch version would you recommend using for new development projects?
Version 1.9 or 2.3?

Can the fetch list be dynamically changed during a crawl or can new URLs
only be added for the next crawl?

How should the URL subset concept relate to the segment concept in Nutch?
>From the application point of view, it would be simpler to make Nutch
process the subset of URLs as one segment, but this design would probably
tradeoff performance if throttling is used. Does Nutch typically process
just one segment at a time or can multiple segments be processed
concurrently with throttling performed across all segments? Or should the
concept of URL subset be implemented in the application instead and the
application be made to track when content for a particular URL subset will
be available?

best regards,

marko