You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@beam.apache.org by Rakesh Kumar <ra...@lyft.com> on 2018/11/15 21:08:18 UTC

Need help regarding memory leak issue

I am using *Beam Python SDK *to run my app in production. The app is
running machine learning models. I am noticing some memory leak which
eventually kills the application. I am not sure the source of memory leak.
Currently, I am using object graph
<https://mg.pov.lt/objgraph/#memory-leak-example> to dump the memory stats.
I hope I will get some useful information out of this. I have also looked
into Guppy library <https://pypi.org/project/guppy/> and they are almost
the same.

Do you guys have any recommendation for debugging this issue? Do we have
any tooling in the SDK that can help to debug it?
Please feel free to share your experience if you have debugged similar
issues in past.

Thank you,
Rakesh

Re: Need help regarding memory leak issue

Posted by Rakesh Kumar <ra...@lyft.com>.

On Fri, Nov 16, 2018 at 3:08 PM Ruoyun Huang <ru...@google.com> wrote:

> Even tough the algorithm works on your batch system, did you verify
> anything that can rule out the possibility where it is the underlying ML
> package causing the memory leak?
>
It is possible that ML packages can cause the memory leak but when we tried
to call the model function in a for loop for 1000 iteration, we didn't
notice any memory increase.  Since the memory leak is small and it grows
over several hours. I am thinking of running the model for 10,000 times and
also observe the reference count and memory profile after each thousand
run. I hope this will give some hint.
 We have also tried to explicitly delete the model object, input obect and
output object once their usage is done in the operator method. After
deleting these objects we also called `*gc.collect*`. But we still notice
that the memory usage is increasing over time. We also tried to log the
most common object counts but the number of references are almost constant
for each iteration. We are going to log the memrory size of these objects
to get more information.


> If not, maybe replace your prediction with a dummy function which does not
> load any model at all, and always just give the same prediction. Then do
> the same plotting, let us see what it looks like. And a plus with version
> two: still a dummy prediction, but with model loaded.    Given we don't
> have much clue at this stage, at least this probably can give us more
> confidence in whether it is the underlying ML package causing the issue, or
> from beam sdk. just my 2 cents.
>
>
We have tried one version to allocate a huge memory  in an operator method.
We observe that the memory usage oscillate but it doesn't increase over
time. We will try your suggested ideas and report it here. When Beam and
Model methods are run in isolation they don't show increase in memory
consumption so we feel that it is the intraction which is causing the issue.


On Thu, Nov 15, 2018 at 4:54 PM Rakesh Kumar <ra...@lyft.com> wrote:
>
>> Thanks for responding Ruoyun,
>>
>> We are not sure yet who is causing the leak, but once we run out of the
>> memory then sdk worker crashes and pipeline is forced to restart. Check the
>> memory usage patterns in the attached image. Each line in that graph is
>> representing one task manager host.
>>  You are right we are running the models for predictions.
>>
>> Here are few observations:
>>
>> 1. All the tasks manager memory usage climb over time but some of the
>> task managers' memory climb really fast because they are running the ML
>> models. These models are definitely using memory intensive data structure
>> (pandas data frame etc) hence their memory usage climb really fast.
>> 2. We had almost the same code running in different infrastructure
>> (non-streaming) that doesn't cause any memory issue.
>> 3. Even when the pipeline has restarted, the memory is not released. It
>> is still hogged by something. You can notice in the attached image that
>> pipeline restarted around 13:30. At that time it is definitely released
>> some portion of the memory but didn't completely released all memory.
>> Notice that, when the pipeline was originally started, it started with 30%
>> of the memory but when got restarted by the job manager it started with 60%
>> of the memory.
>>
>>
>>
>> On Thu, Nov 15, 2018 at 3:31 PM Ruoyun Huang <ru...@google.com> wrote:
>>
>>> trying to understand the situation you are having.
>>>
>>> By saying 'kills the appllication', is that a leak in the application
>>> itself, or the workers being the root cause?  Also are you running ML
>>> models inside Python SDK DoFn's?  Then I suppose it is running some
>>> predictions rather than model training?
>>>
>>> On Thu, Nov 15, 2018 at 1:08 PM Rakesh Kumar <ra...@lyft.com>
>>> wrote:
>>>
>>>> I am using *Beam Python SDK *to run my app in production. The app is
>>>> running machine learning models. I am noticing some memory leak which
>>>> eventually kills the application. I am not sure the source of memory leak.
>>>> Currently, I am using object graph
>>>> <https://mg.pov.lt/objgraph/#memory-leak-example> to dump the memory
>>>> stats. I hope I will get some useful information out of this. I have also
>>>> looked into Guppy library <https://pypi.org/project/guppy/> and they
>>>> are almost the same.
>>>>
>>>> Do you guys have any recommendation for debugging this issue? Do we
>>>> have any tooling in the SDK that can help to debug it?
>>>> Please feel free to share your experience if you have debugged similar
>>>> issues in past.
>>>>
>>>> Thank you,
>>>> Rakesh
>>>>
>>>
>>>
>>> --
>>> ================
>>> Ruoyun  Huang
>>>
>>>
>
> --
> ================
> Ruoyun  Huang
>
>

Re: Need help regarding memory leak issue

Posted by Rakesh Kumar <ra...@lyft.com>.

On Fri, Nov 16, 2018 at 3:36 PM Udi Meiri <eh...@google.com> wrote:

> If you're working with Dataflow, it supports this flag:
> https://github.com/apache/beam/blob/75e9f645c7bec940b87b93f416823b020e4c5f69/sdks/python/apache_beam/options/pipeline_options.py#L602
> which uses guppy for heap profiling.
>

This is really useful flag. Unfortunetly, we are using Beam + Flink.  It
would be really useful to have similar flag for other Streaming engines.


> On Fri, Nov 16, 2018 at 3:08 PM Ruoyun Huang <ru...@google.com> wrote:
>
>> Even tough the algorithm works on your batch system, did you verify
>> anything that can rule out the possibility where it is the underlying ML
>> package causing the memory leak?
>>
>> If not, maybe replace your prediction with a dummy function which does
>> not load any model at all, and always just give the same prediction. Then
>> do the same plotting, let us see what it looks like. And a plus with
>> version two: still a dummy prediction, but with model loaded.    Given we
>> don't have much clue at this stage, at least this probably can give us more
>> confidence in whether it is the underlying ML package causing the issue, or
>> from beam sdk. just my 2 cents.
>>
>>
>> On Thu, Nov 15, 2018 at 4:54 PM Rakesh Kumar <ra...@lyft.com>
>> wrote:
>>
>>> Thanks for responding Ruoyun,
>>>
>>> We are not sure yet who is causing the leak, but once we run out of the
>>> memory then sdk worker crashes and pipeline is forced to restart. Check the
>>> memory usage patterns in the attached image. Each line in that graph is
>>> representing one task manager host.
>>>  You are right we are running the models for predictions.
>>>
>>> Here are few observations:
>>>
>>> 1. All the tasks manager memory usage climb over time but some of the
>>> task managers' memory climb really fast because they are running the ML
>>> models. These models are definitely using memory intensive data structure
>>> (pandas data frame etc) hence their memory usage climb really fast.
>>> 2. We had almost the same code running in different infrastructure
>>> (non-streaming) that doesn't cause any memory issue.
>>> 3. Even when the pipeline has restarted, the memory is not released. It
>>> is still hogged by something. You can notice in the attached image that
>>> pipeline restarted around 13:30. At that time it is definitely released
>>> some portion of the memory but didn't completely released all memory.
>>> Notice that, when the pipeline was originally started, it started with 30%
>>> of the memory but when got restarted by the job manager it started with 60%
>>> of the memory.
>>>
>>>
>>>
>>> On Thu, Nov 15, 2018 at 3:31 PM Ruoyun Huang <ru...@google.com> wrote:
>>>
>>>> trying to understand the situation you are having.
>>>>
>>>> By saying 'kills the appllication', is that a leak in the application
>>>> itself, or the workers being the root cause?  Also are you running ML
>>>> models inside Python SDK DoFn's?  Then I suppose it is running some
>>>> predictions rather than model training?
>>>>
>>>> On Thu, Nov 15, 2018 at 1:08 PM Rakesh Kumar <ra...@lyft.com>
>>>> wrote:
>>>>
>>>>> I am using *Beam Python SDK *to run my app in production. The app is
>>>>> running machine learning models. I am noticing some memory leak which
>>>>> eventually kills the application. I am not sure the source of memory leak.
>>>>> Currently, I am using object graph
>>>>> <https://mg.pov.lt/objgraph/#memory-leak-example> to dump the memory
>>>>> stats. I hope I will get some useful information out of this. I have also
>>>>> looked into Guppy library <https://pypi.org/project/guppy/> and they
>>>>> are almost the same.
>>>>>
>>>>> Do you guys have any recommendation for debugging this issue? Do we
>>>>> have any tooling in the SDK that can help to debug it?
>>>>> Please feel free to share your experience if you have debugged similar
>>>>> issues in past.
>>>>>
>>>>> Thank you,
>>>>> Rakesh
>>>>>
>>>>
>>>>
>>>> --
>>>> ================
>>>> Ruoyun  Huang
>>>>
>>>>
>>
>> --
>> ================
>> Ruoyun  Huang
>>
>>

Re: Need help regarding memory leak issue

Posted by Udi Meiri <eh...@google.com>.

If you're working with Dataflow, it supports this flag:
https://github.com/apache/beam/blob/75e9f645c7bec940b87b93f416823b020e4c5f69/sdks/python/apache_beam/options/pipeline_options.py#L602
which uses guppy for heap profiling.

On Fri, Nov 16, 2018 at 3:08 PM Ruoyun Huang <ru...@google.com> wrote:

> Even tough the algorithm works on your batch system, did you verify
> anything that can rule out the possibility where it is the underlying ML
> package causing the memory leak?
>
> If not, maybe replace your prediction with a dummy function which does not
> load any model at all, and always just give the same prediction. Then do
> the same plotting, let us see what it looks like. And a plus with version
> two: still a dummy prediction, but with model loaded.    Given we don't
> have much clue at this stage, at least this probably can give us more
> confidence in whether it is the underlying ML package causing the issue, or
> from beam sdk. just my 2 cents.
>
>
> On Thu, Nov 15, 2018 at 4:54 PM Rakesh Kumar <ra...@lyft.com> wrote:
>
>> Thanks for responding Ruoyun,
>>
>> We are not sure yet who is causing the leak, but once we run out of the
>> memory then sdk worker crashes and pipeline is forced to restart. Check the
>> memory usage patterns in the attached image. Each line in that graph is
>> representing one task manager host.
>>  You are right we are running the models for predictions.
>>
>> Here are few observations:
>>
>> 1. All the tasks manager memory usage climb over time but some of the
>> task managers' memory climb really fast because they are running the ML
>> models. These models are definitely using memory intensive data structure
>> (pandas data frame etc) hence their memory usage climb really fast.
>> 2. We had almost the same code running in different infrastructure
>> (non-streaming) that doesn't cause any memory issue.
>> 3. Even when the pipeline has restarted, the memory is not released. It
>> is still hogged by something. You can notice in the attached image that
>> pipeline restarted around 13:30. At that time it is definitely released
>> some portion of the memory but didn't completely released all memory.
>> Notice that, when the pipeline was originally started, it started with 30%
>> of the memory but when got restarted by the job manager it started with 60%
>> of the memory.
>>
>>
>>
>> On Thu, Nov 15, 2018 at 3:31 PM Ruoyun Huang <ru...@google.com> wrote:
>>
>>> trying to understand the situation you are having.
>>>
>>> By saying 'kills the appllication', is that a leak in the application
>>> itself, or the workers being the root cause?  Also are you running ML
>>> models inside Python SDK DoFn's?  Then I suppose it is running some
>>> predictions rather than model training?
>>>
>>> On Thu, Nov 15, 2018 at 1:08 PM Rakesh Kumar <ra...@lyft.com>
>>> wrote:
>>>
>>>> I am using *Beam Python SDK *to run my app in production. The app is
>>>> running machine learning models. I am noticing some memory leak which
>>>> eventually kills the application. I am not sure the source of memory leak.
>>>> Currently, I am using object graph
>>>> <https://mg.pov.lt/objgraph/#memory-leak-example> to dump the memory
>>>> stats. I hope I will get some useful information out of this. I have also
>>>> looked into Guppy library <https://pypi.org/project/guppy/> and they
>>>> are almost the same.
>>>>
>>>> Do you guys have any recommendation for debugging this issue? Do we
>>>> have any tooling in the SDK that can help to debug it?
>>>> Please feel free to share your experience if you have debugged similar
>>>> issues in past.
>>>>
>>>> Thank you,
>>>> Rakesh
>>>>
>>>
>>>
>>> --
>>> ================
>>> Ruoyun  Huang
>>>
>>>
>
> --
> ================
> Ruoyun  Huang
>
>

Re: Need help regarding memory leak issue

Posted by Ruoyun Huang <ru...@google.com>.

Even tough the algorithm works on your batch system, did you verify
anything that can rule out the possibility where it is the underlying ML
package causing the memory leak?

If not, maybe replace your prediction with a dummy function which does not
load any model at all, and always just give the same prediction. Then do
the same plotting, let us see what it looks like. And a plus with version
two: still a dummy prediction, but with model loaded.    Given we don't
have much clue at this stage, at least this probably can give us more
confidence in whether it is the underlying ML package causing the issue, or
from beam sdk. just my 2 cents.


On Thu, Nov 15, 2018 at 4:54 PM Rakesh Kumar <ra...@lyft.com> wrote:

> Thanks for responding Ruoyun,
>
> We are not sure yet who is causing the leak, but once we run out of the
> memory then sdk worker crashes and pipeline is forced to restart. Check the
> memory usage patterns in the attached image. Each line in that graph is
> representing one task manager host.
>  You are right we are running the models for predictions.
>
> Here are few observations:
>
> 1. All the tasks manager memory usage climb over time but some of the task
> managers' memory climb really fast because they are running the ML models.
> These models are definitely using memory intensive data structure (pandas
> data frame etc) hence their memory usage climb really fast.
> 2. We had almost the same code running in different infrastructure
> (non-streaming) that doesn't cause any memory issue.
> 3. Even when the pipeline has restarted, the memory is not released. It is
> still hogged by something. You can notice in the attached image that
> pipeline restarted around 13:30. At that time it is definitely released
> some portion of the memory but didn't completely released all memory.
> Notice that, when the pipeline was originally started, it started with 30%
> of the memory but when got restarted by the job manager it started with 60%
> of the memory.
>
>
>
> On Thu, Nov 15, 2018 at 3:31 PM Ruoyun Huang <ru...@google.com> wrote:
>
>> trying to understand the situation you are having.
>>
>> By saying 'kills the appllication', is that a leak in the application
>> itself, or the workers being the root cause?  Also are you running ML
>> models inside Python SDK DoFn's?  Then I suppose it is running some
>> predictions rather than model training?
>>
>> On Thu, Nov 15, 2018 at 1:08 PM Rakesh Kumar <ra...@lyft.com>
>> wrote:
>>
>>> I am using *Beam Python SDK *to run my app in production. The app is
>>> running machine learning models. I am noticing some memory leak which
>>> eventually kills the application. I am not sure the source of memory leak.
>>> Currently, I am using object graph
>>> <https://mg.pov.lt/objgraph/#memory-leak-example> to dump the memory
>>> stats. I hope I will get some useful information out of this. I have also
>>> looked into Guppy library <https://pypi.org/project/guppy/> and they
>>> are almost the same.
>>>
>>> Do you guys have any recommendation for debugging this issue? Do we have
>>> any tooling in the SDK that can help to debug it?
>>> Please feel free to share your experience if you have debugged similar
>>> issues in past.
>>>
>>> Thank you,
>>> Rakesh
>>>
>>
>>
>> --
>> ================
>> Ruoyun  Huang
>>
>>

-- 
================
Ruoyun  Huang

Re: Need help regarding memory leak issue

Posted by Rakesh Kumar <ra...@lyft.com>.

Thanks for responding Ruoyun,

We are not sure yet who is causing the leak, but once we run out of the
memory then sdk worker crashes and pipeline is forced to restart. Check the
memory usage patterns in the attached image. Each line in that graph is
representing one task manager host.
 You are right we are running the models for predictions.

Here are few observations:

1. All the tasks manager memory usage climb over time but some of the task
managers' memory climb really fast because they are running the ML models.
These models are definitely using memory intensive data structure (pandas
data frame etc) hence their memory usage climb really fast.
2. We had almost the same code running in different infrastructure
(non-streaming) that doesn't cause any memory issue.
3. Even when the pipeline has restarted, the memory is not released. It is
still hogged by something. You can notice in the attached image that
pipeline restarted around 13:30. At that time it is definitely released
some portion of the memory but didn't completely released all memory.
Notice that, when the pipeline was originally started, it started with 30%
of the memory but when got restarted by the job manager it started with 60%
of the memory.

On Thu, Nov 15, 2018 at 3:31 PM Ruoyun Huang <ru...@google.com> wrote:

> trying to understand the situation you are having.
>
> By saying 'kills the appllication', is that a leak in the application
> itself, or the workers being the root cause?  Also are you running ML
> models inside Python SDK DoFn's?  Then I suppose it is running some
> predictions rather than model training?
>
> On Thu, Nov 15, 2018 at 1:08 PM Rakesh Kumar <ra...@lyft.com> wrote:
>
>> I am using *Beam Python SDK *to run my app in production. The app is
>> running machine learning models. I am noticing some memory leak which
>> eventually kills the application. I am not sure the source of memory leak.
>> Currently, I am using object graph
>> <https://mg.pov.lt/objgraph/#memory-leak-example> to dump the memory
>> stats. I hope I will get some useful information out of this. I have also
>> looked into Guppy library <https://pypi.org/project/guppy/> and they are
>> almost the same.
>>
>> Do you guys have any recommendation for debugging this issue? Do we have
>> any tooling in the SDK that can help to debug it?
>> Please feel free to share your experience if you have debugged similar
>> issues in past.
>>
>> Thank you,
>> Rakesh
>>
>
>
> --
> ================
> Ruoyun  Huang
>
>

Re: Need help regarding memory leak issue

Posted by Ruoyun Huang <ru...@google.com>.

trying to understand the situation you are having.

By saying 'kills the appllication', is that a leak in the application
itself, or the workers being the root cause?  Also are you running ML
models inside Python SDK DoFn's?  Then I suppose it is running some
predictions rather than model training?

On Thu, Nov 15, 2018 at 1:08 PM Rakesh Kumar <ra...@lyft.com> wrote:

> I am using *Beam Python SDK *to run my app in production. The app is
> running machine learning models. I am noticing some memory leak which
> eventually kills the application. I am not sure the source of memory leak.
> Currently, I am using object graph
> <https://mg.pov.lt/objgraph/#memory-leak-example> to dump the memory
> stats. I hope I will get some useful information out of this. I have also
> looked into Guppy library <https://pypi.org/project/guppy/> and they are
> almost the same.
>
> Do you guys have any recommendation for debugging this issue? Do we have
> any tooling in the SDK that can help to debug it?
> Please feel free to share your experience if you have debugged similar
> issues in past.
>
> Thank you,
> Rakesh
>

-- 
================
Ruoyun  Huang