You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@flink.apache.org by amran dean <ad...@gmail.com> on 2019/10/16 01:54:24 UTC

Verifying correctness of StreamingFileSink (Kafka -> S3)

I am evaluating StreamingFileSink (Kafka 0.10.11) as a production-ready
alternative to a current Kafka -> S3 solution.

Is there any way to verify the integrity of data written in S3? I'm
confused how the file names (e.g part-1-17) map to Kafka partitions, and
further unsure how to ensure that no Kafka records are lost (I know Flink
guarantees exactly-once, but this is more of a sanity check).

Fwd: Verifying correctness of StreamingFileSink (Kafka -> S3)

Posted by Kostas Kloudas <kk...@apache.org>.
Hi Amran,

I am including also the message to the public ML because this may be
interesting to other users as well.

You can always write your input stream to two sinks, as shown below:

DataStream<...> myInputFromKafka = env.addSource(new
FlinkKafkaConsumerVERSION(...));

final StreamingFileSink<String> sinkA = StreamingFileSink
    .forRowFormat(new Path(outputPathA), new SimpleStringEncoder<>("UTF-8"))
    .withBucketAssigner(assignerA)
    .build();

final StreamingFileSink<String> sinkB = StreamingFileSink
    .forRowFormat(new Path(outputPathB), new SimpleStringEncoder<>("UTF-8"))
    .withBucketAssigner(assignerB)
    .build();

myInputFromKafka.addSink(sinkA)
myInputFromKafka.addSink(sinkB)

Cheers,
Kostas

---------- Forwarded message ---------
From: amran dean <ad...@gmail.com>
Date: Wed, Oct 16, 2019 at 8:43 PM
Subject: Re: Verifying correctness of StreamingFileSink (Kafka -> S3)
To: Kostas Kloudas <kk...@apache.org>


Hi Kostas,
Suppose there is a requirement that current S3 object format cannot
change (avoid client migration).
Would it be possible to achieve a sort of dual-write, where Kafka
records are written to two separate S3 prefixes:

/bucket/topic/data/dt=2019-10-16/part-x-xx...
- One bucket holds raw record data as clients previously expect

/bucket/topic/metadata/dt=2019-10-16/partition_x/part-x-xx...
- A second bucket holds only offset data, for data integrity
verification purposes via periodic checks (i.e no "holes")

Does StreamingFileSink permit this?

On Wed, Oct 16, 2019 at 12:34 AM Kostas Kloudas <kk...@apache.org> wrote:
>
> Hi Amran,
>
> If you want to know from which partition your input data come from,
> you can always have a separate bucket for each partition.
> As described in [1], you can extract the offset/partition/topic
> information for an incoming record and based on this, decide the
> appropriate bucket to put the record.
>
> Cheers,
> Kostas
>
> [1] https://ci.apache.org/projects/flink/flink-docs-stable/dev/connectors/kafka.html
>
> On Wed, Oct 16, 2019 at 4:00 AM amran dean <ad...@gmail.com> wrote:
> >
> > I am evaluating StreamingFileSink (Kafka 0.10.11) as a production-ready alternative to a current Kafka -> S3 solution.
> >
> > Is there any way to verify the integrity of data written in S3? I'm confused how the file names (e.g part-1-17) map to Kafka partitions, and further unsure how to ensure that no Kafka records are lost (I know Flink guarantees exactly-once, but this is more of a sanity check).
> >
> >
> >
> >

Re: Verifying correctness of StreamingFileSink (Kafka -> S3)

Posted by Kostas Kloudas <kk...@apache.org>.
Hi Amran,

If you want to know from which partition your input data come from,
you can always have a separate bucket for each partition.
As described in [1], you can extract the offset/partition/topic
information for an incoming record and based on this, decide the
appropriate bucket to put the record.

Cheers,
Kostas

[1] https://ci.apache.org/projects/flink/flink-docs-stable/dev/connectors/kafka.html

On Wed, Oct 16, 2019 at 4:00 AM amran dean <ad...@gmail.com> wrote:
>
> I am evaluating StreamingFileSink (Kafka 0.10.11) as a production-ready alternative to a current Kafka -> S3 solution.
>
> Is there any way to verify the integrity of data written in S3? I'm confused how the file names (e.g part-1-17) map to Kafka partitions, and further unsure how to ensure that no Kafka records are lost (I know Flink guarantees exactly-once, but this is more of a sanity check).
>
>
>
>