You are viewing a plain text version of this content. The canonical link for it is here.

Posted to common-user@hadoop.apache.org by Phil Hagelberg <ph...@hagelb.org> on 2009/08/18 02:25:28 UTC

Re-using output directories

I'm trying to write a Hadoop job that will add documents to an existing
lucene index. My initial idea was to set the index as the output
directory and create and IndexWriter based on
FileOutputFormat.getOutputPath(context), but this requires that the
output path not exist when the job begins. I also had the idea to use
the job's working directory instead, but it appears the job _must_ be
configured with an output path; it can't be left unset.

I'm thinking the answer would be to set it to a bogus tempfile and
delete that, but that seems awful hacky. There's got to be a better way
to handle this, right?

-Phil

Re: Re-using output directories

Posted by Enis Soztutar <en...@gmail.com>.

Phil Hagelberg wrote:
> I'm trying to write a Hadoop job that will add documents to an existing
> lucene index. My initial idea was to set the index as the output
> directory and create and IndexWriter based on
> FileOutputFormat.getOutputPath(context), but this requires that the
> output path not exist when the job begins. I also had the idea to use
> the job's working directory instead, but it appears the job _must_ be
> configured with an output path; it can't be left unset.
>
> I'm thinking the answer would be to set it to a bogus tempfile and
> delete that, but that seems awful hacky. There's got to be a better way
> to handle this, right?
>
> -Phil
>   
You may write your own output format for this. Please check the index 
contrib module in hadoop/src/contrib and the Indexer class in Nutch.