You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by Jake Dodd <ja...@ontopic.io> on 2014/06/17 19:30:07 UTC

Nutch Extension for realtime processing

Hi all,

My organization is mulling the creation of a Nutch Extension Point that would enable realtime processing of Nutch documents as they’re fetched. We have the desire to pass Nutch-fetched documents to a realtime framework such as Storm or Spark. Currently, it’s trivial to implement a custom Indexer plugin that sort of gets the job done. However, this doesn’t really meet the realtime requirement—you must wait for the fetch, parse, updateddb, index cycle to complete.

Our idea is to create a FetcherDisseminator extension point. A FetcherDisseminator would implement a disseminate() method that would take care of serialization (JSON, Avro, etc) and disseminating the data to an external entity (for example a REST interface, or a Kafka broker).

The FetcherDisseminators would be called from within the org.apache.nutch.fetcher.Fetcher.FetcherThread class. The implementation would be such that the normal fetch-parse-update-index cycle would be unaffected, even in the case of disseminator failure. 

My first question is whether something like this has been discussed before by the Nutch developers, and if so, if there is any current work on the project.

My second question is whether there is any interest from the community in such a feature. If so, we’d love your input on how to go about contributing to the Nutch project.

Cheers

Jake