You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-user@hadoop.apache.org by Huy Phan <da...@gmail.com> on 2009/09/18 04:17:09 UTC
Prepare input data for Hadoop
Hi all,
I have a question about strategy to prepare data for Hadoop to run their
MapReduce job, we have to (somehow) copy input files from our local
filesystem to HDFS, how can we make sure that one input file is not
processed twice in different executions of the same MapReduce job (let's
say my MapReduce job runs once each 30 mins) ?
I don't want to delete my input files after finishing the MR job because
I may want to re-use it later.
Re: Prepare input data for Hadoop
Posted by Aaron Kimball <aa...@cloudera.com>.
Use an external database (e.g., mysql) or some other transactional
bookkeeping system to record the state of all your datasets (STAGING,
UPLOADED, PROCESSED)
- Aaron
On Thu, Sep 17, 2009 at 7:17 PM, Huy Phan <da...@gmail.com> wrote:
> Hi all,
>
> I have a question about strategy to prepare data for Hadoop to run their
> MapReduce job, we have to (somehow) copy input files from our local
> filesystem to HDFS, how can we make sure that one input file is not
> processed twice in different executions of the same MapReduce job (let's say
> my MapReduce job runs once each 30 mins) ?
> I don't want to delete my input files after finishing the MR job because I
> may want to re-use it later.
>
>
>
>