You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@flink.apache.org by "Jakub Nowacki (JIRA)" <ji...@apache.org> on 2017/04/06 12:03:41 UTC

[jira] [Created] (FLINK-6272) Rolling file sink saves incomplete lines on failure

Jakub Nowacki created FLINK-6272:
------------------------------------

             Summary: Rolling file sink saves incomplete lines on failure
                 Key: FLINK-6272
                 URL: https://issues.apache.org/jira/browse/FLINK-6272
             Project: Flink
          Issue Type: Bug
          Components: filesystem-connector, Streaming Connectors
    Affects Versions: 1.2.0
         Environment: Flink 1.2.0, Scala 2.11, Debian GNU/Linux 8.7 (jessie), CDH 5.8, YARN
            Reporter: Jakub Nowacki


We have simple pipeline with Kafka source (0.9), which transforms data and writes to Rolling File Sink, which runs on YARN. The sink is a plain HDFS sink with StringWriter configured as follows:
{code:java}
val fileSink = new BucketingSink[String]("some_path")
        fileSink.setBucketer(new DateTimeBucketer[String]("yyyy-MM-dd"))
        fileSink.setWriter(new StringWriter())
        fileSink.setBatchSize(1024 * 1024 * 1024) // this is 1 GB
{code}
Checkpoint is on. Both Kafka source and File sink are in theory with [exactly-once guarantee|https://ci.apache.org/projects/flink/flink-docs-release-1.2/dev/connectors/guarantees.html].

On failure in some files, which seem to be complete (not {{in_progress}} files ore something, but under 1 GB and confirmed to be created on failure), it comes out that the last line is cut. In our case it shows because we save the data in line-by-line JSON and this creates invalid JSON line. This does not happen always when the  but I noticed at least 3 incidents like that at least.

Also, I am not sure if it is a separate bug but we see some data duplication in this case coming from Kafka. I.e.after the pipeline is restarted some number of messages come out from Kafka source, which already have been saved in the previous file. We can check that the messages are duplicated as they have same data but different timestamp, which is added within Flink pipeline. This should not happen in theory as the sink and source have [exactly-once guarantee|https://ci.apache.org/projects/flink/flink-docs-release-1.2/dev/connectors/guarantees.html].



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)