You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@flink.apache.org by Chiwan Park <ch...@apache.org> on 2016/06/22 10:22:12 UTC

Re: Code related to spilling data to disk

Hi Tae-Geon,

AFAIK, spilling *data* to disk happens only when managed memory is used. Currently, streaming API (DataStream) doesn’t use managed memory yet. `MutableHashTable` is one of representative usage of managed memory with disk spilling. Note that some special structures such as `CompactingHashTable` doesn’t spill data to disk even though they use the manage memory to achieve high performance.

About spilling *states*, I think that it depends on how state backends is implemented. For example, `FsStateBackend` saves states to file system but `MemoryStateBackend` doesn’t. `RocksDBStateBackend` uses memory first and also can spill states to disk.

Regards,
Chiwan Park

> On Jun 22, 2016, at 3:27 PM, Tae-Geon Um <ta...@gmail.com> wrote:
> 
> I have another question. 
> Is the spilling only executed on batch mode? 
> What happen on streaming mode?  
> 
>> On Jun 22, 2016, at 1:48 PM, Tae-Geon Um <ta...@gmail.com> wrote:
>> 
>> Hi, all
>> 
>> As far as I know, Flink spills data (states?) to disk if the data exceeds memory threshold or there exists memory pressure.
>> i’d like to know the detail of how Flink spills data to disk. 
>> 
>> Could you please let me know which codes do I have to investigate? 
>> 
>> Thanks,
>> Taegeon
>

Re: Code related to spilling data to disk

Posted by Aljoscha Krettek <al...@apache.org>.

Hi,
Chiwan is correct. The reason why we're (not yet) using managed memory in
the streaming API (DataStream) is that it was easier to get things up and
running by just using JVM heap. We're hoping to change this at some point
in the future, though.

Cheers,
Aljoscha

On Wed, 22 Jun 2016 at 14:05 Chiwan Park <ch...@apache.org> wrote:

> Hi,
>
> I’m not sure about the reason to use JVM heap instead of managed memory,
> but It seems that the reason is using JVM heap makes development easier.
> Maybe Stephan can give exact answer to you. I think managed memory still
> has benefit in terms of GC  time and memory utilization.
>
> The Flink community has a plan [1] to move data structures for streaming
> operators to managed memory.
>
> [1]:
> https://docs.google.com/document/d/1ExmtVpeVVT3TIhO1JoBpC5JKXm-778DAD7eqw5GANwE/edit#
>
> Regards,
> Chiwan Park
>
> > On Jun 22, 2016, at 8:39 PM, Tae-Geon Um <ta...@gmail.com> wrote:
> >
> > Thank you for your answer to my question, Chiwan :)
> > Can I ask another question?
> >
> >
> >> On Jun 22, 2016, at 7:22 PM, Chiwan Park <ch...@apache.org> wrote:
> >>
> >> Hi Tae-Geon,
> >>
> >> AFAIK, spilling *data* to disk happens only when managed memory is
> used. Currently, streaming API (DataStream) doesn’t use managed memory yet.
> `MutableHashTable` is one of representative usage of managed memory with
> disk spilling. Note that some special structures such as
> `CompactingHashTable` doesn’t spill data to disk even though they use the
> manage memory to achieve high performance.
> >
> > As far as I understand, spilling data is only performed on batch mode.
> > Do you know why streaming mode does not use managed memory?
> > Is this because the performance gain is negligible?
> >
> >>
> >> About spilling *states*, I think that it depends on how state backends
> is implemented. For example, `FsStateBackend` saves states to file system
> but `MemoryStateBackend` doesn’t. `RocksDBStateBackend` uses memory first
> and also can spill states to disk.
> >
> > I’ve found a nice document on the state backend [1]. I will take a look
> at this doc to know the detail.
> > Thanks!
> >
> > Taegeon
> >
> > [1]:
> https://ci.apache.org/projects/flink/flink-docs-master/apis/streaming/state_backends.html#state-backends
> >
> >>
> >> Regards,
> >> Chiwan Park
> >>
> >>> On Jun 22, 2016, at 3:27 PM, Tae-Geon Um <ta...@gmail.com> wrote:
> >>>
> >>> I have another question.
> >>> Is the spilling only executed on batch mode?
> >>> What happen on streaming mode?
> >>>
> >>>> On Jun 22, 2016, at 1:48 PM, Tae-Geon Um <ta...@gmail.com> wrote:
> >>>>
> >>>> Hi, all
> >>>>
> >>>> As far as I know, Flink spills data (states?) to disk if the data
> exceeds memory threshold or there exists memory pressure.
> >>>> i’d like to know the detail of how Flink spills data to disk.
> >>>>
> >>>> Could you please let me know which codes do I have to investigate?
> >>>>
> >>>> Thanks,
> >>>> Taegeon
> >>>
> >>
>
>

Re: Code related to spilling data to disk

Posted by Chiwan Park <ch...@apache.org>.

Hi,

I’m not sure about the reason to use JVM heap instead of managed memory, but It seems that the reason is using JVM heap makes development easier. Maybe Stephan can give exact answer to you. I think managed memory still has benefit in terms of GC  time and memory utilization.

The Flink community has a plan [1] to move data structures for streaming operators to managed memory.

[1]: https://docs.google.com/document/d/1ExmtVpeVVT3TIhO1JoBpC5JKXm-778DAD7eqw5GANwE/edit#

Regards,
Chiwan Park

> On Jun 22, 2016, at 8:39 PM, Tae-Geon Um <ta...@gmail.com> wrote:
> 
> Thank you for your answer to my question, Chiwan :)  
> Can I ask another question?  
> 
> 
>> On Jun 22, 2016, at 7:22 PM, Chiwan Park <ch...@apache.org> wrote:
>> 
>> Hi Tae-Geon,
>> 
>> AFAIK, spilling *data* to disk happens only when managed memory is used. Currently, streaming API (DataStream) doesn’t use managed memory yet. `MutableHashTable` is one of representative usage of managed memory with disk spilling. Note that some special structures such as `CompactingHashTable` doesn’t spill data to disk even though they use the manage memory to achieve high performance.
> 
> As far as I understand, spilling data is only performed on batch mode. 
> Do you know why streaming mode does not use managed memory? 
> Is this because the performance gain is negligible?
> 
>> 
>> About spilling *states*, I think that it depends on how state backends is implemented. For example, `FsStateBackend` saves states to file system but `MemoryStateBackend` doesn’t. `RocksDBStateBackend` uses memory first and also can spill states to disk.
> 
> I’ve found a nice document on the state backend [1]. I will take a look at this doc to know the detail. 
> Thanks! 
> 
> Taegeon
> 
> [1]: https://ci.apache.org/projects/flink/flink-docs-master/apis/streaming/state_backends.html#state-backends
> 
>> 
>> Regards,
>> Chiwan Park
>> 
>>> On Jun 22, 2016, at 3:27 PM, Tae-Geon Um <ta...@gmail.com> wrote:
>>> 
>>> I have another question. 
>>> Is the spilling only executed on batch mode? 
>>> What happen on streaming mode?  
>>> 
>>>> On Jun 22, 2016, at 1:48 PM, Tae-Geon Um <ta...@gmail.com> wrote:
>>>> 
>>>> Hi, all
>>>> 
>>>> As far as I know, Flink spills data (states?) to disk if the data exceeds memory threshold or there exists memory pressure.
>>>> i’d like to know the detail of how Flink spills data to disk. 
>>>> 
>>>> Could you please let me know which codes do I have to investigate? 
>>>> 
>>>> Thanks,
>>>> Taegeon
>>> 
>>

Re: Code related to spilling data to disk

Posted by Tae-Geon Um <ta...@gmail.com>.

Thank you for your answer to my question, Chiwan :)  
Can I ask another question?  


> On Jun 22, 2016, at 7:22 PM, Chiwan Park <ch...@apache.org> wrote:
> 
> Hi Tae-Geon,
> 
> AFAIK, spilling *data* to disk happens only when managed memory is used. Currently, streaming API (DataStream) doesn’t use managed memory yet. `MutableHashTable` is one of representative usage of managed memory with disk spilling. Note that some special structures such as `CompactingHashTable` doesn’t spill data to disk even though they use the manage memory to achieve high performance.

As far as I understand, spilling data is only performed on batch mode. 
Do you know why streaming mode does not use managed memory? 
Is this because the performance gain is negligible?

> 
> About spilling *states*, I think that it depends on how state backends is implemented. For example, `FsStateBackend` saves states to file system but `MemoryStateBackend` doesn’t. `RocksDBStateBackend` uses memory first and also can spill states to disk.

I’ve found a nice document on the state backend [1]. I will take a look at this doc to know the detail. 
Thanks! 

Taegeon

[1]: https://ci.apache.org/projects/flink/flink-docs-master/apis/streaming/state_backends.html#state-backends <https://ci.apache.org/projects/flink/flink-docs-master/apis/streaming/state_backends.html#state-backends>

> 
> Regards,
> Chiwan Park
> 
>> On Jun 22, 2016, at 3:27 PM, Tae-Geon Um <ta...@gmail.com> wrote:
>> 
>> I have another question. 
>> Is the spilling only executed on batch mode? 
>> What happen on streaming mode?  
>> 
>>> On Jun 22, 2016, at 1:48 PM, Tae-Geon Um <ta...@gmail.com> wrote:
>>> 
>>> Hi, all
>>> 
>>> As far as I know, Flink spills data (states?) to disk if the data exceeds memory threshold or there exists memory pressure.
>>> i’d like to know the detail of how Flink spills data to disk. 
>>> 
>>> Could you please let me know which codes do I have to investigate? 
>>> 
>>> Thanks,
>>> Taegeon
>> 
>