You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@flink.apache.org by "Jakub Nowacki (JIRA)" <ji...@apache.org> on 2017/04/06 12:03:41 UTC
[jira] [Created] (FLINK-6272) Rolling file sink saves incomplete
lines on failure
Jakub Nowacki created FLINK-6272:
------------------------------------
Summary: Rolling file sink saves incomplete lines on failure
Key: FLINK-6272
URL: https://issues.apache.org/jira/browse/FLINK-6272
Project: Flink
Issue Type: Bug
Components: filesystem-connector, Streaming Connectors
Affects Versions: 1.2.0
Environment: Flink 1.2.0, Scala 2.11, Debian GNU/Linux 8.7 (jessie), CDH 5.8, YARN
Reporter: Jakub Nowacki
We have simple pipeline with Kafka source (0.9), which transforms data and writes to Rolling File Sink, which runs on YARN. The sink is a plain HDFS sink with StringWriter configured as follows:
{code:java}
val fileSink = new BucketingSink[String]("some_path")
fileSink.setBucketer(new DateTimeBucketer[String]("yyyy-MM-dd"))
fileSink.setWriter(new StringWriter())
fileSink.setBatchSize(1024 * 1024 * 1024) // this is 1 GB
{code}
Checkpoint is on. Both Kafka source and File sink are in theory with [exactly-once guarantee|https://ci.apache.org/projects/flink/flink-docs-release-1.2/dev/connectors/guarantees.html].
On failure in some files, which seem to be complete (not {{in_progress}} files ore something, but under 1 GB and confirmed to be created on failure), it comes out that the last line is cut. In our case it shows because we save the data in line-by-line JSON and this creates invalid JSON line. This does not happen always when the but I noticed at least 3 incidents like that at least.
Also, I am not sure if it is a separate bug but we see some data duplication in this case coming from Kafka. I.e.after the pipeline is restarted some number of messages come out from Kafka source, which already have been saved in the previous file. We can check that the messages are duplicated as they have same data but different timestamp, which is added within Flink pipeline. This should not happen in theory as the sink and source have [exactly-once guarantee|https://ci.apache.org/projects/flink/flink-docs-release-1.2/dev/connectors/guarantees.html].
--
This message was sent by Atlassian JIRA
(v6.3.15#6346)