You are viewing a plain text version of this content. The canonical link for it is here.

Posted to common-user@hadoop.apache.org by Huy Phan <da...@gmail.com> on 2009/09/18 04:17:09 UTC

Prepare input data for Hadoop

Hi all,

I have a question about strategy to prepare data for Hadoop to run their 
MapReduce job, we have to (somehow) copy input files from our local 
filesystem to HDFS, how can we make sure that one input file is not 
processed twice in different executions of the same MapReduce job (let's 
say my MapReduce job runs once each 30 mins) ?
I don't want to delete my input files after finishing the MR job because 
I may want to re-use it later.

Re: Prepare input data for Hadoop

Posted by Aaron Kimball <aa...@cloudera.com>.

Use an external database (e.g., mysql) or some other transactional
bookkeeping system to record the state of all your datasets (STAGING,
UPLOADED, PROCESSED)

- Aaron


On Thu, Sep 17, 2009 at 7:17 PM, Huy Phan <da...@gmail.com> wrote:

> Hi all,
>
> I have a question about strategy to prepare data for Hadoop to run their
> MapReduce job, we have to (somehow) copy input files from our local
> filesystem to HDFS, how can we make sure that one input file is not
> processed twice in different executions of the same MapReduce job (let's say
> my MapReduce job runs once each 30 mins) ?
> I don't want to delete my input files after finishing the MR job because I
> may want to re-use it later.
>
>
>
>