You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@oozie.apache.org by Harshal Vora <ha...@komli.com> on 2011/11/23 15:09:42 UTC

Versioning of file during re-run

Hi,

We have job for processing logs. There are multiple log servers that dump file into hdfs and map reduce jobs process these files.
This process happens every half hour.
Sometimes it may happen that one of the log servers is down and few files are missing. At that moment, we will go ahead with processing of whatever files are available. But when the missing files are available say after 5 hours, we want to re run all the jobs that ran for the past 5 hours. 

We want to do this, because the output dependent on the output of previous instance of the job and we are keeping a running count in between time intervals and also across time intervals. 

>From what I understand, I will have to re-run each co-ordinator or bundle instance within the last 5 hours. At the same time I will have to stop any new instances from running until the last 5 hours files are processed and they catch up till all new files are processed.

But the issue that we are facing is, for the previous co-ordinator instances to re-run we have to delete the previous output files of those co-ordinator instances in hdfs and the re-run will produce new files. We done want to do that. We want to have something like {path}/{timestamp}/{rev-1} and for the re-run we want {path}/{timestamp}/{rev-2}. And in the following job when it looks for coord:current(-1) it should pick up rev2 file. 

>From what I understand this is not possible with oozie. i.e. there is no revisioning of files if the job has re-run. Or is there any possibility?

Or is there a better approach to do this? using coord:latest(-1)?

Regards,

Re: Versioning of file during re-run

Posted by Mayank Bansal <ma...@gmail.com>.
Hi Harshal,

Please find the answers below.

Thanks,
Mayank

On Wed, Nov 23, 2011 at 6:09 AM, Harshal Vora <ha...@komli.com>wrote:

>
> Hi,
>
> We have job for processing logs. There are multiple log servers that dump
> file into hdfs and map reduce jobs process these files.
> This process happens every half hour.
> Sometimes it may happen that one of the log servers is down and few files
> are missing. At that moment, we will go ahead with processing of whatever
> files are available. But when the missing files are available say after 5
> hours, we want to re run all the jobs that ran for the past 5 hours.
>

> We want to do this, because the output dependent on the output of previous
> instance of the job and we are keeping a running count in between time
> intervals and also across time intervals.
>
> From what I understand, I will have to re-run each co-ordinator or bundle
> instance within the last 5 hours. At the same time I will have to stop any
> new instances from running until the last 5 hours files are processed and
> they catch up till all new files are processed.
>
[MAYANK]  You can use Bundle if you do not want to work on multiple
coordinators, So Bundle would be single point of entry. I do not think you
need to stop any new instances however those instances should not belong to
5 hour window.

>
> But the issue that we are facing is, for the previous co-ordinator
> instances to re-run we have to delete the previous output files of those
> co-ordinator instances in hdfs and the re-run will produce new files. We
> done want to do that. We want to have something like
> {path}/{timestamp}/{rev-1} and for the re-run we want
> {path}/{timestamp}/{rev-2}. And in the following job when it looks for
> coord:current(-1) it should pick up rev2 file.
>
[MAYANK] Unfortunately Oozie does not support data versioning as of now,
but in future after HCatalog integration we should be able to achieve that.

>
> From what I understand this is not possible with oozie. i.e. there is no
> revisioning of files if the job has re-run. Or is there any possibility?
>
> Or is there a better approach to do this? using coord:latest(-1)?
>
[MAYANK] If you use latest, then it would be hard to manage the inputs for
the subsequent jobs.

>
> Regards,
>



-- 
Thanks and Regards,
Mayank
Cell: 408-718-9370