You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-dev@hadoop.apache.org by Doug Judd <do...@zvents.com> on 2007/01/16 01:43:29 UTC

Heritrix / Hadoop integration

Hello,

I've developed an extension to Heritrix (The Internet Archive open source
crawler) that allows it to write directly into HDFS.  It looks like the
developers over there are interested in including it into their project.
I've designed it to write SequenceFiles and use the URL as the key and the
HTTP response as the value.  I've got a couple of questions that I could use
a little help on:

1. I can't seem to set the replication factor on a SequenceFile.  There's no
way to pass it in and when I call the createWriter factory and then call
FileSystem.setReplication, it still seems to use the default value.  Is
there anyway to do this, or should I file an enhancement request?

2. It appears that the Configuration class looks for the conf/ directory in
the CLASSPATH.  This makes it difficult to integrate with Heritrix.  For
now, I've modified the heritrix launch script by hardcoding the hadoop
configuration directory into the CLASSPATH.  It seems like a better way to
go would be to provide a text box on the Heritrix settings page that allows
the user to enter the path to the Hadoop configuration directory.

- Doug Judd
  doug@zvents.com
  http://www.zvents.com/