You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-dev@hadoop.apache.org by Doug Judd <do...@zvents.com> on 2007/01/26 02:23:55 UTC

Heritrix-Hadoop connector (hdfs-writer-processor)

Hello,

I've written an extension to the Internet Archive's open source "Heritrix"
crawler that extends it to write into HDFS in SequenceFile format.  The key
is the URL and the value is the HTTP response with some additional
metadata.  It's actually quite simple to use, just drop a few jar files into
the Heritrix lib/ directory and you're good to go.  Here's a link to the
download page:  http://www.zvents.com/labs/hdfs_writer_processor .  For
those of you who are interested, give it a whirl and feel free to send me
feedback.

- Doug Judd
  doug@zvents.com