You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@reef.apache.org by Andrew Chung <af...@gmail.com> on 2015/09/23 01:15:25 UTC

Memory leak in REEF .NET Driver on HDInsight

Hi,

There seems to be somewhere in the YARN .NET driver code stack that leaks
memory. Note that the leak could be on any of the Java, C++, C#, or even
the Hadoop layer. I have written a small sample REEF application that
immediately fails an Evaluator and requests a new one in its place. The
Driver logic itself uses O(1) space. Over a very long time (~10000
Evaluators), the memory used grows from around 300MB to 500+MB. This has
first been observed in ASA, where we observed faulty logic in our Evaluator
recovery code that causes the Evaluator to continuously crash. This is
probably a really extreme scenario, but I think it's best to scan through
our code base to check that our Java and .NET Collections are bounded in
usage and properly cleaned up, and our calls to "new" in C++ code is always
followed with a "delete." I've also observed that the same application
written with a 0.12 NuGet increases in memory usage a lot faster than an
application written with the current version, so we have probably already
done something right along the way :).

Thanks,
Andrew

Re: Memory leak in REEF .NET Driver on HDInsight

Posted by Markus Weimer <ma...@weimo.de>.
Interesting finding. Can you add the code of your test to the code
base? It would be good to have a place to accumulate these
pathological, yet immensely useful REEF applications such that we have
them ready for the next debug session :-)

Maybe a good next step would be to create a Java-only version of the
same test case and see whether it has the same problem. That way, we
could half the search space for the issue.

Markus


On Tue, Sep 22, 2015 at 4:15 PM, Andrew Chung <af...@gmail.com> wrote:
> Hi,
>
> There seems to be somewhere in the YARN .NET driver code stack that leaks
> memory. Note that the leak could be on any of the Java, C++, C#, or even
> the Hadoop layer. I have written a small sample REEF application that
> immediately fails an Evaluator and requests a new one in its place. The
> Driver logic itself uses O(1) space. Over a very long time (~10000
> Evaluators), the memory used grows from around 300MB to 500+MB. This has
> first been observed in ASA, where we observed faulty logic in our Evaluator
> recovery code that causes the Evaluator to continuously crash. This is
> probably a really extreme scenario, but I think it's best to scan through
> our code base to check that our Java and .NET Collections are bounded in
> usage and properly cleaned up, and our calls to "new" in C++ code is always
> followed with a "delete." I've also observed that the same application
> written with a 0.12 NuGet increases in memory usage a lot faster than an
> application written with the current version, so we have probably already
> done something right along the way :).
>
> Thanks,
> Andrew