You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@hadoop.apache.org by Ivan Ryndin <ir...@gmail.com> on 2012/12/11 06:59:51 UTC

Best practices forking with files in Hadop MR jobs

Hi all,

I have following question:
What are the best practices working with files in Hadoop?

I need to process a lot of log files, that arrive to Hadoop every minute.
And I have multiple jobs for each file.
Each file have unique name which includes name of front-end node and a
timestamp.
File is considered to be fully processed if and only if all jobs are
completed okay.

Curently I put all files into a single HDFS input directory, e.g.
/user/logp/input
Then I run a bunch of jobs against files.
After successfull completion I need to remove processed files anywhere from
HDFS directory  /user/logp/input (e.g. on AWS S3 or Glacier or smth else)

How should I work with such a problem?
Currently I have two approaches:

1) Each job should copy files into its separate HDFS input directory (e.g.
/user/logp/job/Job1/input/{timestamp}) and then read these files from
there. When it process files okay, then it should remove files from there.
Main jobs driver removes files from input directory once all jobs have
copies of these files.

2nd aproach
2) Each job should have a table in HBase and check processed files there.
Once job completed againts a file okay, it then writes filename into its
HBase table.
Main jobs driver then checks files in input directory and removes those of
them which are processed by all jobs.

Perhaps, I miss any other approach which is considred to be best practice?
Can you please tell what do you think about all this?

Thank you in advance!!


-- 
Best regards,
Ivan

Re: Best practices forking with files in Hadop MR jobs

Posted by Mahesh Balija <ba...@gmail.com>.

 One more approach I would prefer is,

c) Once your job completes processing an input file move the file to
another path (say /input/processed)
and then delete the files in that path after all the jobs have finished its
execution.

If this solution doesn't work for you just stick to first one.

Best,
Mahesh Balija,
CalSoft Labs.

On Tue, Dec 11, 2012 at 11:29 AM, Ivan Ryndin <ir...@gmail.com> wrote:

> Hi all,
>
> I have following question:
> What are the best practices working with files in Hadoop?
>
> I need to process a lot of log files, that arrive to Hadoop every minute.
> And I have multiple jobs for each file.
> Each file have unique name which includes name of front-end node and a
> timestamp.
> File is considered to be fully processed if and only if all jobs are
> completed okay.
>
> Curently I put all files into a single HDFS input directory, e.g.
> /user/logp/input
> Then I run a bunch of jobs against files.
> After successfull completion I need to remove processed files anywhere
> from HDFS directory  /user/logp/input (e.g. on AWS S3 or Glacier or smth
> else)
>
> How should I work with such a problem?
> Currently I have two approaches:
>
> 1) Each job should copy files into its separate HDFS input directory (e.g.
> /user/logp/job/Job1/input/{timestamp}) and then read these files from
> there. When it process files okay, then it should remove files from there.
> Main jobs driver removes files from input directory once all jobs have
> copies of these files.
>
> 2nd aproach
> 2) Each job should have a table in HBase and check processed files there.
> Once job completed againts a file okay, it then writes filename into its
> HBase table.
> Main jobs driver then checks files in input directory and removes those of
> them which are processed by all jobs.
>
> Perhaps, I miss any other approach which is considred to be best practice?
> Can you please tell what do you think about all this?
>
> Thank you in advance!!
>
>
> --
> Best regards,
> Ivan
>
>

Re: Best practices forking with files in Hadop MR jobs

Posted by Mahesh Balija <ba...@gmail.com>.

 One more approach I would prefer is,

c) Once your job completes processing an input file move the file to
another path (say /input/processed)
and then delete the files in that path after all the jobs have finished its
execution.

If this solution doesn't work for you just stick to first one.

Best,
Mahesh Balija,
CalSoft Labs.

On Tue, Dec 11, 2012 at 11:29 AM, Ivan Ryndin <ir...@gmail.com> wrote:

> Hi all,
>
> I have following question:
> What are the best practices working with files in Hadoop?
>
> I need to process a lot of log files, that arrive to Hadoop every minute.
> And I have multiple jobs for each file.
> Each file have unique name which includes name of front-end node and a
> timestamp.
> File is considered to be fully processed if and only if all jobs are
> completed okay.
>
> Curently I put all files into a single HDFS input directory, e.g.
> /user/logp/input
> Then I run a bunch of jobs against files.
> After successfull completion I need to remove processed files anywhere
> from HDFS directory  /user/logp/input (e.g. on AWS S3 or Glacier or smth
> else)
>
> How should I work with such a problem?
> Currently I have two approaches:
>
> 1) Each job should copy files into its separate HDFS input directory (e.g.
> /user/logp/job/Job1/input/{timestamp}) and then read these files from
> there. When it process files okay, then it should remove files from there.
> Main jobs driver removes files from input directory once all jobs have
> copies of these files.
>
> 2nd aproach
> 2) Each job should have a table in HBase and check processed files there.
> Once job completed againts a file okay, it then writes filename into its
> HBase table.
> Main jobs driver then checks files in input directory and removes those of
> them which are processed by all jobs.
>
> Perhaps, I miss any other approach which is considred to be best practice?
> Can you please tell what do you think about all this?
>
> Thank you in advance!!
>
>
> --
> Best regards,
> Ivan
>
>

Re: Best practices forking with files in Hadop MR jobs

Posted by Mahesh Balija <ba...@gmail.com>.

 One more approach I would prefer is,

c) Once your job completes processing an input file move the file to
another path (say /input/processed)
and then delete the files in that path after all the jobs have finished its
execution.

If this solution doesn't work for you just stick to first one.

Best,
Mahesh Balija,
CalSoft Labs.

On Tue, Dec 11, 2012 at 11:29 AM, Ivan Ryndin <ir...@gmail.com> wrote:

> Hi all,
>
> I have following question:
> What are the best practices working with files in Hadoop?
>
> I need to process a lot of log files, that arrive to Hadoop every minute.
> And I have multiple jobs for each file.
> Each file have unique name which includes name of front-end node and a
> timestamp.
> File is considered to be fully processed if and only if all jobs are
> completed okay.
>
> Curently I put all files into a single HDFS input directory, e.g.
> /user/logp/input
> Then I run a bunch of jobs against files.
> After successfull completion I need to remove processed files anywhere
> from HDFS directory  /user/logp/input (e.g. on AWS S3 or Glacier or smth
> else)
>
> How should I work with such a problem?
> Currently I have two approaches:
>
> 1) Each job should copy files into its separate HDFS input directory (e.g.
> /user/logp/job/Job1/input/{timestamp}) and then read these files from
> there. When it process files okay, then it should remove files from there.
> Main jobs driver removes files from input directory once all jobs have
> copies of these files.
>
> 2nd aproach
> 2) Each job should have a table in HBase and check processed files there.
> Once job completed againts a file okay, it then writes filename into its
> HBase table.
> Main jobs driver then checks files in input directory and removes those of
> them which are processed by all jobs.
>
> Perhaps, I miss any other approach which is considred to be best practice?
> Can you please tell what do you think about all this?
>
> Thank you in advance!!
>
>
> --
> Best regards,
> Ivan
>
>

Re: Best practices forking with files in Hadop MR jobs

Posted by Mahesh Balija <ba...@gmail.com>.

 One more approach I would prefer is,

c) Once your job completes processing an input file move the file to
another path (say /input/processed)
and then delete the files in that path after all the jobs have finished its
execution.

If this solution doesn't work for you just stick to first one.

Best,
Mahesh Balija,
CalSoft Labs.

On Tue, Dec 11, 2012 at 11:29 AM, Ivan Ryndin <ir...@gmail.com> wrote:

> Hi all,
>
> I have following question:
> What are the best practices working with files in Hadoop?
>
> I need to process a lot of log files, that arrive to Hadoop every minute.
> And I have multiple jobs for each file.
> Each file have unique name which includes name of front-end node and a
> timestamp.
> File is considered to be fully processed if and only if all jobs are
> completed okay.
>
> Curently I put all files into a single HDFS input directory, e.g.
> /user/logp/input
> Then I run a bunch of jobs against files.
> After successfull completion I need to remove processed files anywhere
> from HDFS directory  /user/logp/input (e.g. on AWS S3 or Glacier or smth
> else)
>
> How should I work with such a problem?
> Currently I have two approaches:
>
> 1) Each job should copy files into its separate HDFS input directory (e.g.
> /user/logp/job/Job1/input/{timestamp}) and then read these files from
> there. When it process files okay, then it should remove files from there.
> Main jobs driver removes files from input directory once all jobs have
> copies of these files.
>
> 2nd aproach
> 2) Each job should have a table in HBase and check processed files there.
> Once job completed againts a file okay, it then writes filename into its
> HBase table.
> Main jobs driver then checks files in input directory and removes those of
> them which are processed by all jobs.
>
> Perhaps, I miss any other approach which is considred to be best practice?
> Can you please tell what do you think about all this?
>
> Thank you in advance!!
>
>
> --
> Best regards,
> Ivan
>
>