You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@flink.apache.org by Andra Lungu <lu...@gmail.com> on 2015/10/21 11:40:50 UTC

Using Flink Streaming to write to multiple output files in HDFS

Hey guys,

Long time, no see :). I recently started a new job and it involves
performing a set of real-time data analytics using Apache Kafka, Storm
and Flume.

What happens, on a very high level, is that set of signals is
collected, stored into a Kafka topic and then Storm is used to filter
certain fields out or to enrich the fields with other
meta-information. Finally, Flume writes the output into mutiple HDFS
files depending on the date, hour etc.

Now, I saw that Flink can play with a similar pipeline, but without
needing Flume for the writing to HDFS part (see
http://data-artisans.com/kafka-flink-a-practical-how-to/). Which
brings me to my question: jow does Flink handle writing to multiple
files in a streaming fashion? -until now, I was playing with batch and
writeAsCsv just took one file as a parameter-

Next question: What are the prerequisites to deploy a Flink Streaming
job on a cluster? Yarn, HDFS, anything else?

Final question, more of a request: I'd like to play around with Flink
Streaming to state whether it can substitute Storm in this use case
and whether it can outrun it :P. To this end, I'll need some starting
points: docs, blog posts, examples to read. Any input would be useful.

I wanted to dig for a newbie task in the streaming area, but I could
not find one... can we think of something easy to get me started?

Thanks! Hope you guys had fun at Flink Forward!
Andra

Re: Using Flink Streaming to write to multiple output files in HDFS

Posted by Nyamath Ulla Khan <ul...@gmail.com>.

Hi Andra,

You could find very intersting example for Flink streaming and with Kafka
(input/Output).

https://flink.apache.org/news/2015/02/09/streaming-example.html.
http://dataartisans.github.io/flink-training/exercises/ ( Contains most the
different Operator Example)
http://dataartisans.github.io/flink-training/exercises/rideCleansing.html

I hope this help you start  with Flink Streaming API.

Cheer's
Nyamath Ulla Khan

On Mon, Nov 9, 2015 at 11:41 PM, Robert Metzger <rm...@apache.org> wrote:

> Hey Andra,
>
> were you able to answer your questions from Aljoschas and Fabians links?
>
> Flink's streaming file sink is quite unique (compared to Flume) because it
> supports exactly-once semantics. Also, the performance compared to Storm is
> probably much better, so you can save a lot of resources.
>
>
> On Wed, Oct 21, 2015 at 2:35 PM, Fabian Hueske <fh...@gmail.com> wrote:
>
> > There are also training slides and programming exercises (incl. reference
> > solutions) for the DataStream API at
> >
> > --> http://dataartisans.github.io/flink-training/
> >
> > Cheers, Fabian
> >
> > 2015-10-21 14:03 GMT+02:00 Aljoscha Krettek <al...@apache.org>:
> >
> > > Hi,
> > > the documentation has a guide about the Streaming API:
> > >
> > >
> >
> https://ci.apache.org/projects/flink/flink-docs-master/apis/streaming_guide.html
> > >
> > > This also contains a section about the rolling (HDFS) FileSystem sink:
> > >
> > >
> >
> https://ci.apache.org/projects/flink/flink-docs-master/apis/streaming_guide.html#hadoop-filesystem
> > >
> > > For blog entries I would suggest these:
> > >  -
> > >
> >
> http://data-artisans.com/real-time-stream-processing-the-next-step-for-apache-flink/
> > >  -
> > >
> >
> http://data-artisans.com/high-throughput-low-latency-and-exactly-once-stream-processing-with-apache-flink/
> > >  - http://data-artisans.com/kafka-flink-a-practical-how-to/
> > >
> > > I don’t think we have an easy starter issues right now on the Streaming
> > > API. But some might come up in the future. :D
> > >
> > > Cheers,
> > > Aljoscha
> > > > On 21 Oct 2015, at 11:40, Andra Lungu <lu...@gmail.com> wrote:
> > > >
> > > > Hey guys,
> > > >
> > > > Long time, no see :). I recently started a new job and it involves
> > > > performing a set of real-time data analytics using Apache Kafka,
> Storm
> > > > and Flume.
> > > >
> > > > What happens, on a very high level, is that set of signals is
> > > > collected, stored into a Kafka topic and then Storm is used to filter
> > > > certain fields out or to enrich the fields with other
> > > > meta-information. Finally, Flume writes the output into mutiple HDFS
> > > > files depending on the date, hour etc.
> > > >
> > > > Now, I saw that Flink can play with a similar pipeline, but without
> > > > needing Flume for the writing to HDFS part (see
> > > > http://data-artisans.com/kafka-flink-a-practical-how-to/). Which
> > > > brings me to my question: jow does Flink handle writing to multiple
> > > > files in a streaming fashion? -until now, I was playing with batch
> and
> > > > writeAsCsv just took one file as a parameter-
> > > >
> > > > Next question: What are the prerequisites to deploy a Flink Streaming
> > > > job on a cluster? Yarn, HDFS, anything else?
> > > >
> > > > Final question, more of a request: I'd like to play around with Flink
> > > > Streaming to state whether it can substitute Storm in this use case
> > > > and whether it can outrun it :P. To this end, I'll need some starting
> > > > points: docs, blog posts, examples to read. Any input would be
> useful.
> > > >
> > > > I wanted to dig for a newbie task in the streaming area, but I could
> > > > not find one... can we think of something easy to get me started?
> > > >
> > > > Thanks! Hope you guys had fun at Flink Forward!
> > > > Andra
> > >
> > >
> >
>



-- 
Thanks and Regards
Nyamath Ulla Khan

Re: Using Flink Streaming to write to multiple output files in HDFS

Posted by Robert Metzger <rm...@apache.org>.

Hey Andra,

were you able to answer your questions from Aljoschas and Fabians links?

Flink's streaming file sink is quite unique (compared to Flume) because it
supports exactly-once semantics. Also, the performance compared to Storm is
probably much better, so you can save a lot of resources.


On Wed, Oct 21, 2015 at 2:35 PM, Fabian Hueske <fh...@gmail.com> wrote:

> There are also training slides and programming exercises (incl. reference
> solutions) for the DataStream API at
>
> --> http://dataartisans.github.io/flink-training/
>
> Cheers, Fabian
>
> 2015-10-21 14:03 GMT+02:00 Aljoscha Krettek <al...@apache.org>:
>
> > Hi,
> > the documentation has a guide about the Streaming API:
> >
> >
> https://ci.apache.org/projects/flink/flink-docs-master/apis/streaming_guide.html
> >
> > This also contains a section about the rolling (HDFS) FileSystem sink:
> >
> >
> https://ci.apache.org/projects/flink/flink-docs-master/apis/streaming_guide.html#hadoop-filesystem
> >
> > For blog entries I would suggest these:
> >  -
> >
> http://data-artisans.com/real-time-stream-processing-the-next-step-for-apache-flink/
> >  -
> >
> http://data-artisans.com/high-throughput-low-latency-and-exactly-once-stream-processing-with-apache-flink/
> >  - http://data-artisans.com/kafka-flink-a-practical-how-to/
> >
> > I don’t think we have an easy starter issues right now on the Streaming
> > API. But some might come up in the future. :D
> >
> > Cheers,
> > Aljoscha
> > > On 21 Oct 2015, at 11:40, Andra Lungu <lu...@gmail.com> wrote:
> > >
> > > Hey guys,
> > >
> > > Long time, no see :). I recently started a new job and it involves
> > > performing a set of real-time data analytics using Apache Kafka, Storm
> > > and Flume.
> > >
> > > What happens, on a very high level, is that set of signals is
> > > collected, stored into a Kafka topic and then Storm is used to filter
> > > certain fields out or to enrich the fields with other
> > > meta-information. Finally, Flume writes the output into mutiple HDFS
> > > files depending on the date, hour etc.
> > >
> > > Now, I saw that Flink can play with a similar pipeline, but without
> > > needing Flume for the writing to HDFS part (see
> > > http://data-artisans.com/kafka-flink-a-practical-how-to/). Which
> > > brings me to my question: jow does Flink handle writing to multiple
> > > files in a streaming fashion? -until now, I was playing with batch and
> > > writeAsCsv just took one file as a parameter-
> > >
> > > Next question: What are the prerequisites to deploy a Flink Streaming
> > > job on a cluster? Yarn, HDFS, anything else?
> > >
> > > Final question, more of a request: I'd like to play around with Flink
> > > Streaming to state whether it can substitute Storm in this use case
> > > and whether it can outrun it :P. To this end, I'll need some starting
> > > points: docs, blog posts, examples to read. Any input would be useful.
> > >
> > > I wanted to dig for a newbie task in the streaming area, but I could
> > > not find one... can we think of something easy to get me started?
> > >
> > > Thanks! Hope you guys had fun at Flink Forward!
> > > Andra
> >
> >
>

Re: Using Flink Streaming to write to multiple output files in HDFS

Posted by Fabian Hueske <fh...@gmail.com>.

There are also training slides and programming exercises (incl. reference
solutions) for the DataStream API at

--> http://dataartisans.github.io/flink-training/

Cheers, Fabian

2015-10-21 14:03 GMT+02:00 Aljoscha Krettek <al...@apache.org>:

> Hi,
> the documentation has a guide about the Streaming API:
>
> https://ci.apache.org/projects/flink/flink-docs-master/apis/streaming_guide.html
>
> This also contains a section about the rolling (HDFS) FileSystem sink:
>
> https://ci.apache.org/projects/flink/flink-docs-master/apis/streaming_guide.html#hadoop-filesystem
>
> For blog entries I would suggest these:
>  -
> http://data-artisans.com/real-time-stream-processing-the-next-step-for-apache-flink/
>  -
> http://data-artisans.com/high-throughput-low-latency-and-exactly-once-stream-processing-with-apache-flink/
>  - http://data-artisans.com/kafka-flink-a-practical-how-to/
>
> I don’t think we have an easy starter issues right now on the Streaming
> API. But some might come up in the future. :D
>
> Cheers,
> Aljoscha
> > On 21 Oct 2015, at 11:40, Andra Lungu <lu...@gmail.com> wrote:
> >
> > Hey guys,
> >
> > Long time, no see :). I recently started a new job and it involves
> > performing a set of real-time data analytics using Apache Kafka, Storm
> > and Flume.
> >
> > What happens, on a very high level, is that set of signals is
> > collected, stored into a Kafka topic and then Storm is used to filter
> > certain fields out or to enrich the fields with other
> > meta-information. Finally, Flume writes the output into mutiple HDFS
> > files depending on the date, hour etc.
> >
> > Now, I saw that Flink can play with a similar pipeline, but without
> > needing Flume for the writing to HDFS part (see
> > http://data-artisans.com/kafka-flink-a-practical-how-to/). Which
> > brings me to my question: jow does Flink handle writing to multiple
> > files in a streaming fashion? -until now, I was playing with batch and
> > writeAsCsv just took one file as a parameter-
> >
> > Next question: What are the prerequisites to deploy a Flink Streaming
> > job on a cluster? Yarn, HDFS, anything else?
> >
> > Final question, more of a request: I'd like to play around with Flink
> > Streaming to state whether it can substitute Storm in this use case
> > and whether it can outrun it :P. To this end, I'll need some starting
> > points: docs, blog posts, examples to read. Any input would be useful.
> >
> > I wanted to dig for a newbie task in the streaming area, but I could
> > not find one... can we think of something easy to get me started?
> >
> > Thanks! Hope you guys had fun at Flink Forward!
> > Andra
>
>

Re: Using Flink Streaming to write to multiple output files in HDFS

Posted by Aljoscha Krettek <al...@apache.org>.

Hi,
the documentation has a guide about the Streaming API:
https://ci.apache.org/projects/flink/flink-docs-master/apis/streaming_guide.html

This also contains a section about the rolling (HDFS) FileSystem sink:
https://ci.apache.org/projects/flink/flink-docs-master/apis/streaming_guide.html#hadoop-filesystem

For blog entries I would suggest these:
 - http://data-artisans.com/real-time-stream-processing-the-next-step-for-apache-flink/
 - http://data-artisans.com/high-throughput-low-latency-and-exactly-once-stream-processing-with-apache-flink/
 - http://data-artisans.com/kafka-flink-a-practical-how-to/

I don’t think we have an easy starter issues right now on the Streaming API. But some might come up in the future. :D

Cheers,
Aljoscha
> On 21 Oct 2015, at 11:40, Andra Lungu <lu...@gmail.com> wrote:
> 
> Hey guys,
> 
> Long time, no see :). I recently started a new job and it involves
> performing a set of real-time data analytics using Apache Kafka, Storm
> and Flume.
> 
> What happens, on a very high level, is that set of signals is
> collected, stored into a Kafka topic and then Storm is used to filter
> certain fields out or to enrich the fields with other
> meta-information. Finally, Flume writes the output into mutiple HDFS
> files depending on the date, hour etc.
> 
> Now, I saw that Flink can play with a similar pipeline, but without
> needing Flume for the writing to HDFS part (see
> http://data-artisans.com/kafka-flink-a-practical-how-to/). Which
> brings me to my question: jow does Flink handle writing to multiple
> files in a streaming fashion? -until now, I was playing with batch and
> writeAsCsv just took one file as a parameter-
> 
> Next question: What are the prerequisites to deploy a Flink Streaming
> job on a cluster? Yarn, HDFS, anything else?
> 
> Final question, more of a request: I'd like to play around with Flink
> Streaming to state whether it can substitute Storm in this use case
> and whether it can outrun it :P. To this end, I'll need some starting
> points: docs, blog posts, examples to read. Any input would be useful.
> 
> I wanted to dig for a newbie task in the streaming area, but I could
> not find one... can we think of something easy to get me started?
> 
> Thanks! Hope you guys had fun at Flink Forward!
> Andra