You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@flink.apache.org by "Flink Jira Bot (Jira)" <ji...@apache.org> on 2022/01/03 22:39:00 UTC

[jira] [Updated] (FLINK-6272) Rolling file sink saves incomplete lines on failure

     [ https://issues.apache.org/jira/browse/FLINK-6272?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Flink Jira Bot updated FLINK-6272:
----------------------------------
      Labels: auto-deprioritized-major auto-deprioritized-minor  (was: auto-deprioritized-major stale-minor)
    Priority: Not a Priority  (was: Minor)

This issue was labeled "stale-minor" 7 days ago and has not received any updates so it is being deprioritized. If this ticket is actually Minor, please raise the priority and ask a committer to assign you the issue or revive the public discussion.


> Rolling file sink saves incomplete lines on failure
> ---------------------------------------------------
>
>                 Key: FLINK-6272
>                 URL: https://issues.apache.org/jira/browse/FLINK-6272
>             Project: Flink
>          Issue Type: Bug
>          Components: Connectors / Common, Connectors / FileSystem
>    Affects Versions: 1.2.0
>         Environment: Flink 1.2.0, Scala 2.11, Debian GNU/Linux 8.7 (jessie), CDH 5.8, YARN
>            Reporter: Jakub Nowacki
>            Priority: Not a Priority
>              Labels: auto-deprioritized-major, auto-deprioritized-minor
>
> We have simple pipeline with Kafka source (0.9), which transforms data and writes to Rolling File Sink, which runs on YARN. The sink is a plain HDFS sink with StringWriter configured as follows:
> {code:java}
> val fileSink = new BucketingSink[String]("some_path")
>         fileSink.setBucketer(new DateTimeBucketer[String]("yyyy-MM-dd"))
>         fileSink.setWriter(new StringWriter())
>         fileSink.setBatchSize(1024 * 1024 * 1024) // this is 1 GB
> {code}
> Checkpoint is on. Both Kafka source and File sink are in theory with [exactly-once guarantee|https://ci.apache.org/projects/flink/flink-docs-release-1.2/dev/connectors/guarantees.html].
> On failure in some files, which seem to be complete (not {{in_progress}} files ore something, but under 1 GB and confirmed to be created on failure), it comes out that the last line is cut. In our case it shows because we save the data in line-by-line JSON and this creates invalid JSON line. This does not happen always when the  but I noticed at least 3 incidents like that at least.
> Also, I am not sure if it is a separate bug but we see some data duplication in this case coming from Kafka. I.e.after the pipeline is restarted some number of messages come out from Kafka source, which already have been saved in the previous file. We can check that the messages are duplicated as they have same data but different timestamp, which is added within Flink pipeline. This should not happen in theory as the sink and source have [exactly-once guarantee|https://ci.apache.org/projects/flink/flink-docs-release-1.2/dev/connectors/guarantees.html].



--
This message was sent by Atlassian Jira
(v8.20.1#820001)