You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@flink.apache.org by Paulo Cezar <pa...@gogeo.io> on 2016/09/27 20:42:55 UTC

Failures on DataSet programs

Hi Folks,

I was wondering if it's possible to keep partial outputs from dataset
programs.
I have a batch pipeline that writes its output on HDFS using
writeAsFormattedText. When it fails the output file is deleted but I would
like to keep it so that I can generate new inputs for the pipeline to avoid
reprocessing.

[]'s
Paulo Cezar

Re: Failures on DataSet programs

Posted by Ufuk Celebi <uc...@apache.org>.
Hey Paulo! I think it's not possible out of the box at the moment, but
you can try the following as a work around:

1) Create a custom OutputFormat that extends TextOutputFormat and
override the clean up method:

public class NoCleanupTextOutputFormat<T> extends TextOutputFormat<T> {

    @Override
    public void tryCleanupOnError() {
       // ignore cleanup on error
    }

}

2) writeAsFormattedText is actually a map + writeAsText (if you look
into DataSet.java). Instead of that you should manually do:

dataSet.map(new FormattingMapper<>(clean(formatter))).output(new
NoCleanupTextOutputFormat(..))


This should work as expected. You can furthermore open an issue with a
feature request to allow configuring Flink's TextOutputFormat to
ignore cleanup.

Best,

Ufuk


On Tue, Sep 27, 2016 at 10:42 PM, Paulo Cezar <pa...@gogeo.io> wrote:
> Hi Folks,
>
> I was wondering if it's possible to keep partial outputs from dataset
> programs.
> I have a batch pipeline that writes its output on HDFS using
> writeAsFormattedText. When it fails the output file is deleted but I would
> like to keep it so that I can generate new inputs for the pipeline to avoid
> reprocessing.
>
> []'s
> Paulo Cezar