You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@flink.apache.org by Piper Piper <pi...@gmail.com> on 2020/01/29 06:06:41 UTC

Flink+YARN HDFS replication factor

Hello,

When using Flink+YARN (with HDFS) and having a long running Flink session
(mode) cluster with a Flink client submitting jobs, the HDFS could have a
replication factor greater than 1 (example 3).

So, I would like to know when and how any of the data (like event-data or
batch-data) or code (like JAR) in a Flink job is saved to the HDFS and is
replicated in the entire YARN cluster of nodes?

For example, in streaming applications, would all the event-data only be in
memory (RAM) until it reaches the DAG's sink and then must be saved into
HDFS?

Thank you,

Piper

Re: Flink+YARN HDFS replication factor

Posted by Till Rohrmann <tr...@apache.org>.
The same applies to Flink. Transient data will only be stored on local
disks.

Cheers,
Till

On Thu, Jan 30, 2020 at 9:10 PM Piper Piper <pi...@gmail.com> wrote:

> Please disregard my previous email. I found the answer online.
>
> I thought writing data to local disk automatically meant the data would be
> persisted to HDFS. However, Spark writes data (in between shuffles) to
> local disk only.
>
> Thanks
>
> On Thu, Jan 30, 2020, 2:00 PM Piper Piper <pi...@gmail.com> wrote:
>
>> Hi Till,
>>
>> Thank you for the information!
>>
>> In case of wide transformations, Spark stores input data onto disk
>> between shuffles. So, I was wondering if Flink does that as well (even for
>> windows of streaming data), and whether that "storing to disk" is persisted
>> to the HDFS and honors the replication factor.
>>
>> Best,
>>
>> Pankaj
>>
>> On Wed, Jan 29, 2020 at 9:56 AM Till Rohrmann <tr...@apache.org>
>> wrote:
>>
>>> Hi Piper,
>>>
>>> in general, Flink does not store transient data such as event data on
>>> HDFS. Event data (data which is sent between the TaskManager's to process
>>> it) is only kept in memory and if becoming too big spilled by some
>>> operators to local disk.
>>>
>>> What Flink stores on HDFS (given it is configured this way), is the
>>> state data which is part of the jobs checkpoints. Moreover, Flink stores
>>> the job information such as the JobGraph and the corresponding blobs (Jars
>>> and job artifacts) on HDFS if configured so.
>>>
>>> Cheers,
>>> Till
>>>
>>> On Wed, Jan 29, 2020 at 7:07 AM Piper Piper <pi...@gmail.com>
>>> wrote:
>>>
>>>> Hello,
>>>>
>>>> When using Flink+YARN (with HDFS) and having a long running Flink
>>>> session (mode) cluster with a Flink client submitting jobs, the HDFS could
>>>> have a replication factor greater than 1 (example 3).
>>>>
>>>> So, I would like to know when and how any of the data (like event-data
>>>> or batch-data) or code (like JAR) in a Flink job is saved to the HDFS and
>>>> is replicated in the entire YARN cluster of nodes?
>>>>
>>>> For example, in streaming applications, would all the event-data only
>>>> be in memory (RAM) until it reaches the DAG's sink and then must be saved
>>>> into HDFS?
>>>>
>>>> Thank you,
>>>>
>>>> Piper
>>>>
>>>

Re: Flink+YARN HDFS replication factor

Posted by Piper Piper <pi...@gmail.com>.
Please disregard my previous email. I found the answer online.

I thought writing data to local disk automatically meant the data would be
persisted to HDFS. However, Spark writes data (in between shuffles) to
local disk only.

Thanks

On Thu, Jan 30, 2020, 2:00 PM Piper Piper <pi...@gmail.com> wrote:

> Hi Till,
>
> Thank you for the information!
>
> In case of wide transformations, Spark stores input data onto disk between
> shuffles. So, I was wondering if Flink does that as well (even for windows
> of streaming data), and whether that "storing to disk" is persisted to the
> HDFS and honors the replication factor.
>
> Best,
>
> Pankaj
>
> On Wed, Jan 29, 2020 at 9:56 AM Till Rohrmann <tr...@apache.org>
> wrote:
>
>> Hi Piper,
>>
>> in general, Flink does not store transient data such as event data on
>> HDFS. Event data (data which is sent between the TaskManager's to process
>> it) is only kept in memory and if becoming too big spilled by some
>> operators to local disk.
>>
>> What Flink stores on HDFS (given it is configured this way), is the state
>> data which is part of the jobs checkpoints. Moreover, Flink stores the job
>> information such as the JobGraph and the corresponding blobs (Jars and job
>> artifacts) on HDFS if configured so.
>>
>> Cheers,
>> Till
>>
>> On Wed, Jan 29, 2020 at 7:07 AM Piper Piper <pi...@gmail.com> wrote:
>>
>>> Hello,
>>>
>>> When using Flink+YARN (with HDFS) and having a long running Flink
>>> session (mode) cluster with a Flink client submitting jobs, the HDFS could
>>> have a replication factor greater than 1 (example 3).
>>>
>>> So, I would like to know when and how any of the data (like event-data
>>> or batch-data) or code (like JAR) in a Flink job is saved to the HDFS and
>>> is replicated in the entire YARN cluster of nodes?
>>>
>>> For example, in streaming applications, would all the event-data only be
>>> in memory (RAM) until it reaches the DAG's sink and then must be saved into
>>> HDFS?
>>>
>>> Thank you,
>>>
>>> Piper
>>>
>>

Re: Flink+YARN HDFS replication factor

Posted by Piper Piper <pi...@gmail.com>.
Hi Till,

Thank you for the information!

In case of wide transformations, Spark stores input data onto disk between
shuffles. So, I was wondering if Flink does that as well (even for windows
of streaming data), and whether that "storing to disk" is persisted to the
HDFS and honors the replication factor.

Best,

Pankaj

On Wed, Jan 29, 2020 at 9:56 AM Till Rohrmann <tr...@apache.org> wrote:

> Hi Piper,
>
> in general, Flink does not store transient data such as event data on
> HDFS. Event data (data which is sent between the TaskManager's to process
> it) is only kept in memory and if becoming too big spilled by some
> operators to local disk.
>
> What Flink stores on HDFS (given it is configured this way), is the state
> data which is part of the jobs checkpoints. Moreover, Flink stores the job
> information such as the JobGraph and the corresponding blobs (Jars and job
> artifacts) on HDFS if configured so.
>
> Cheers,
> Till
>
> On Wed, Jan 29, 2020 at 7:07 AM Piper Piper <pi...@gmail.com> wrote:
>
>> Hello,
>>
>> When using Flink+YARN (with HDFS) and having a long running Flink session
>> (mode) cluster with a Flink client submitting jobs, the HDFS could have a
>> replication factor greater than 1 (example 3).
>>
>> So, I would like to know when and how any of the data (like event-data or
>> batch-data) or code (like JAR) in a Flink job is saved to the HDFS and is
>> replicated in the entire YARN cluster of nodes?
>>
>> For example, in streaming applications, would all the event-data only be
>> in memory (RAM) until it reaches the DAG's sink and then must be saved into
>> HDFS?
>>
>> Thank you,
>>
>> Piper
>>
>

Re: Flink+YARN HDFS replication factor

Posted by Till Rohrmann <tr...@apache.org>.
Hi Piper,

in general, Flink does not store transient data such as event data on HDFS.
Event data (data which is sent between the TaskManager's to process it) is
only kept in memory and if becoming too big spilled by some operators to
local disk.

What Flink stores on HDFS (given it is configured this way), is the state
data which is part of the jobs checkpoints. Moreover, Flink stores the job
information such as the JobGraph and the corresponding blobs (Jars and job
artifacts) on HDFS if configured so.

Cheers,
Till

On Wed, Jan 29, 2020 at 7:07 AM Piper Piper <pi...@gmail.com> wrote:

> Hello,
>
> When using Flink+YARN (with HDFS) and having a long running Flink session
> (mode) cluster with a Flink client submitting jobs, the HDFS could have a
> replication factor greater than 1 (example 3).
>
> So, I would like to know when and how any of the data (like event-data or
> batch-data) or code (like JAR) in a Flink job is saved to the HDFS and is
> replicated in the entire YARN cluster of nodes?
>
> For example, in streaming applications, would all the event-data only be
> in memory (RAM) until it reaches the DAG's sink and then must be saved into
> HDFS?
>
> Thank you,
>
> Piper
>