You are viewing a plain text version of this content. The canonical link for it is here.

Posted to yarn-dev@hadoop.apache.org by Chris Nauroth <cn...@hortonworks.com> on 2016/02/04 19:20:21 UTC

Re: ProcFsBasedProcessTree and clean pages in smaps

Hello Jan,

I am moving this thread from user@hadoop.apache.org to
yarn-dev@hadoop.apache.org, since it's less a question of general usage
and more a question of internal code implementation details and possible
enhancements.

I think the issue is that it's not guaranteed in the general case that
Private_Clean pages are easily evictable from page cache by the kernel.
For example, if the pages have been pinned into RAM by calling mlock [1],
then the kernel cannot evict them.  Since YARN can execute any code
submitted by an application, including possibly code that calls mlock, it
takes a cautious approach and assumes that these pages must be counted
towards the process footprint.  Although your Spark use case won't mlock
the pages (I assume), YARN doesn't have a way to identify this.

Perhaps there is room for improvement here.  If there is a reliable way to
count only mlock'ed pages, then perhaps that behavior could be added as
another option in ProcfsBasedProcessTree.  Off the top of my head, I can't
think of a reliable way to do this, and I can't research it further
immediately.  Do others on the thread have ideas?

--Chris Nauroth

[1] http://linux.die.net/man/2/mlock

On 2/4/16, 5:11 AM, "Jan Lukavský" <ja...@firma.seznam.cz> wrote:

>Hello,
>
>I have a question about the way LinuxResourceCalculatorPlugin calculates
>memory consumed by process tree (it is calculated via
>ProcfsBasedProcessTree class). When we enable caching (disk) in apache
>spark jobs run on YARN cluster, the node manager starts to kill the
>containers while reading the cached data, because of "Container is
>running beyond memory limits ...". The reason is that even if we enable
>parsing of the smaps file
>(yarn.nodemanager.container-monitor.procfs-tree.smaps-based-rss.enabled)
>the ProcfsBasedProcessTree calculates mmaped read-only pages as consumed
>by the process tree, while spark uses FileChannel.map(MapMode.READ_ONLY)
>to read the cached data. The JVM then consumes *a lot* more memory than
>the configured heap size (and it cannot be really controlled), but this
>memory is IMO not really consumed by the process, the kernel can reclaim
>these pages, if needed. My question is - is there any explicit reason
>why "Private_Clean" pages are calculated as consumed by process tree? I
>patched the ProcfsBasedProcessTree not to calculate them, but I don't
>know if this is the "correct" solution.
>
>Thanks for opinions,
>  cheers,
>  Jan
>
>
>---------------------------------------------------------------------
>To unsubscribe, e-mail: user-unsubscribe@hadoop.apache.org
>For additional commands, e-mail: user-help@hadoop.apache.org
>
>

Re: ProcFsBasedProcessTree and clean pages in smaps

Posted by Chris Nauroth <cn...@hortonworks.com>.

Thank you for the follow-up, Jan.  I'll join the discussion on YARN-4681.

--Chris Nauroth




On 2/9/16, 3:22 AM, "Jan Lukavský" <ja...@firma.seznam.cz> wrote:

>Hi Chris and Varun,
>
>thanks for you suggestions. I played around with the cgroups, and I
>think that although it kind of resolves memory issues, I think it
>doesn't fit our needs, because of other restrictions enforced on the
>container (mainly the CPU restrictions). I created
>https://issues.apache.org/jira/browse/YARN-4681 and submitted a very
>simplistic version of the patch.
>
>Thanks for comments,
>  Jan
>
>On 02/05/2016 06:10 PM, Chris Nauroth wrote:
>> Interesting, I didn't know about "Locked" in smaps.  Thanks for pointing
>> that out.
>>
>> At this point, if Varun's suggestion to check out YARN-1856 doesn't
>>solve
>> the problem, then I suggest opening a JIRA to track further design
>> discussion.
>>
>> --Chris Nauroth
>>
>>
>>
>>
>> On 2/5/16, 6:10 AM, "Varun Vasudev" <vv...@apache.org> wrote:
>>
>>> Hi Jan,
>>>
>>> YARN-1856 was recently committed which allows admins to use cgroups
>>> instead the ProcFsBasedProcessTree monitory. Would that solve your
>>> problem? However, that requires usage of the LinuxContainerExecutor.
>>>
>>> -Varun
>>>
>>>
>>>
>>> On 2/5/16, 6:45 PM, "Jan Lukavský" <ja...@firma.seznam.cz>
>>>wrote:
>>>
>>>> Hi Chris,
>>>>
>>>> thanks for your reply. As far as I can see right, new linux kernels
>>>>show
>>>> the locked memory in "Locked" field.
>>>>
>>>> If mmap file a mlock it, I see the following in 'smaps' file:
>>>>
>>>> 7efd20aeb000-7efd2172b000 r--p 00000000 103:04 1870
>>>> /tmp/file.bin
>>>> Size:              12544 kB
>>>> Rss:               12544 kB
>>>> Pss:               12544 kB
>>>> Shared_Clean:          0 kB
>>>> Shared_Dirty:          0 kB
>>>> Private_Clean:     12544 kB
>>>> Private_Dirty:         0 kB
>>>> Referenced:        12544 kB
>>>> Anonymous:             0 kB
>>>> AnonHugePages:         0 kB
>>>> Swap:                  0 kB
>>>> KernelPageSize:        4 kB
>>>> MMUPageSize:           4 kB
>>>> Locked:            12544 kB
>>>>
>>>> ...
>>>> # uname -a
>>>> Linux XXXXXX 3.2.0-4-amd64 #1 SMP Debian 3.2.68-1+deb7u3 x86_64
>>>>GNU/Linux
>>>>
>>>> If I do this on an older kernel (2.6.x), the Locked field is missing.
>>>>
>>>> I can make a patch for the ProcfsBasedProcessTree that will calculate
>>>> the "Locked" pages instead of the "Private_Clean" (based on
>>>> configuration option). The question is - should there be made even
>>>>more
>>>> changes in the way the memory footprint is calculated? For instance, I
>>>> believe the kernel can write to disk even all dirty pages (if they are
>>>> backed by a file), making them clean and therefore can later free
>>>>them.
>>>> Should I open a JIRA for this to have some discussion on this topic?
>>>>
>>>> Regards,
>>>>   Jan
>>>>
>>>>
>>>> On 02/04/2016 07:20 PM, Chris Nauroth wrote:
>>>>> Hello Jan,
>>>>>
>>>>> I am moving this thread from user@hadoop.apache.org to
>>>>> yarn-dev@hadoop.apache.org, since it's less a question of general
>>>>>usage
>>>>> and more a question of internal code implementation details and
>>>>> possible
>>>>> enhancements.
>>>>>
>>>>> I think the issue is that it's not guaranteed in the general case
>>>>>that
>>>>> Private_Clean pages are easily evictable from page cache by the
>>>>>kernel.
>>>>> For example, if the pages have been pinned into RAM by calling mlock
>>>>> [1],
>>>>> then the kernel cannot evict them.  Since YARN can execute any code
>>>>> submitted by an application, including possibly code that calls
>>>>>mlock,
>>>>> it
>>>>> takes a cautious approach and assumes that these pages must be
>>>>>counted
>>>>> towards the process footprint.  Although your Spark use case won't
>>>>> mlock
>>>>> the pages (I assume), YARN doesn't have a way to identify this.
>>>>>
>>>>> Perhaps there is room for improvement here.  If there is a reliable
>>>>> way to
>>>>> count only mlock'ed pages, then perhaps that behavior could be added
>>>>>as
>>>>> another option in ProcfsBasedProcessTree.  Off the top of my head, I
>>>>> can't
>>>>> think of a reliable way to do this, and I can't research it further
>>>>> immediately.  Do others on the thread have ideas?
>>>>>
>>>>> --Chris Nauroth
>>>>>
>>>>> [1] http://linux.die.net/man/2/mlock
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> On 2/4/16, 5:11 AM, "Jan Lukavský" <ja...@firma.seznam.cz>
>>>>> wrote:
>>>>>
>>>>>> Hello,
>>>>>>
>>>>>> I have a question about the way LinuxResourceCalculatorPlugin
>>>>>> calculates
>>>>>> memory consumed by process tree (it is calculated via
>>>>>> ProcfsBasedProcessTree class). When we enable caching (disk) in
>>>>>>apache
>>>>>> spark jobs run on YARN cluster, the node manager starts to kill the
>>>>>> containers while reading the cached data, because of "Container is
>>>>>> running beyond memory limits ...". The reason is that even if we
>>>>>> enable
>>>>>> parsing of the smaps file
>>>>>>
>>>>>> 
>>>>>>(yarn.nodemanager.container-monitor.procfs-tree.smaps-based-rss.enabl
>>>>>>ed
>>>>>> )
>>>>>> the ProcfsBasedProcessTree calculates mmaped read-only pages as
>>>>>> consumed
>>>>>> by the process tree, while spark uses
>>>>>> FileChannel.map(MapMode.READ_ONLY)
>>>>>> to read the cached data. The JVM then consumes *a lot* more memory
>>>>>> than
>>>>>> the configured heap size (and it cannot be really controlled), but
>>>>>> this
>>>>>> memory is IMO not really consumed by the process, the kernel can
>>>>>> reclaim
>>>>>> these pages, if needed. My question is - is there any explicit
>>>>>>reason
>>>>>> why "Private_Clean" pages are calculated as consumed by process
>>>>>>tree?
>>>>>> I
>>>>>> patched the ProcfsBasedProcessTree not to calculate them, but I
>>>>>>don't
>>>>>> know if this is the "correct" solution.
>>>>>>
>>>>>> Thanks for opinions,
>>>>>>    cheers,
>>>>>>    Jan
>>>>>>
>>>>>>
>>>>>> 
>>>>>>---------------------------------------------------------------------
>>>>>> To unsubscribe, e-mail: user-unsubscribe@hadoop.apache.org
>>>>>> For additional commands, e-mail: user-help@hadoop.apache.org
>>>>>>
>>>>>>
>>>
>
>
>-- 
>
>Jan Lukavský
>Vedoucí týmu vývoje
>Seznam.cz, a.s.
>Radlická 3294/10
>15000, Praha 5
>
>jan.lukavsky@firma.seznam.cz
>http://www.seznam.cz
>
>

Re: ProcFsBasedProcessTree and clean pages in smaps

Posted by Jan Lukavský <ja...@firma.seznam.cz>.

Hi Chris and Varun,

thanks for you suggestions. I played around with the cgroups, and I 
think that although it kind of resolves memory issues, I think it 
doesn't fit our needs, because of other restrictions enforced on the 
container (mainly the CPU restrictions). I created 
https://issues.apache.org/jira/browse/YARN-4681 and submitted a very 
simplistic version of the patch.

Thanks for comments,
  Jan

On 02/05/2016 06:10 PM, Chris Nauroth wrote:
> Interesting, I didn't know about "Locked" in smaps.  Thanks for pointing
> that out.
>
> At this point, if Varun's suggestion to check out YARN-1856 doesn't solve
> the problem, then I suggest opening a JIRA to track further design
> discussion.
>
> --Chris Nauroth
>
>
>
>
> On 2/5/16, 6:10 AM, "Varun Vasudev" <vv...@apache.org> wrote:
>
>> Hi Jan,
>>
>> YARN-1856 was recently committed which allows admins to use cgroups
>> instead the ProcFsBasedProcessTree monitory. Would that solve your
>> problem? However, that requires usage of the LinuxContainerExecutor.
>>
>> -Varun
>>
>>
>>
>> On 2/5/16, 6:45 PM, "Jan Lukavský" <ja...@firma.seznam.cz> wrote:
>>
>>> Hi Chris,
>>>
>>> thanks for your reply. As far as I can see right, new linux kernels show
>>> the locked memory in "Locked" field.
>>>
>>> If mmap file a mlock it, I see the following in 'smaps' file:
>>>
>>> 7efd20aeb000-7efd2172b000 r--p 00000000 103:04 1870
>>> /tmp/file.bin
>>> Size:              12544 kB
>>> Rss:               12544 kB
>>> Pss:               12544 kB
>>> Shared_Clean:          0 kB
>>> Shared_Dirty:          0 kB
>>> Private_Clean:     12544 kB
>>> Private_Dirty:         0 kB
>>> Referenced:        12544 kB
>>> Anonymous:             0 kB
>>> AnonHugePages:         0 kB
>>> Swap:                  0 kB
>>> KernelPageSize:        4 kB
>>> MMUPageSize:           4 kB
>>> Locked:            12544 kB
>>>
>>> ...
>>> # uname -a
>>> Linux XXXXXX 3.2.0-4-amd64 #1 SMP Debian 3.2.68-1+deb7u3 x86_64 GNU/Linux
>>>
>>> If I do this on an older kernel (2.6.x), the Locked field is missing.
>>>
>>> I can make a patch for the ProcfsBasedProcessTree that will calculate
>>> the "Locked" pages instead of the "Private_Clean" (based on
>>> configuration option). The question is - should there be made even more
>>> changes in the way the memory footprint is calculated? For instance, I
>>> believe the kernel can write to disk even all dirty pages (if they are
>>> backed by a file), making them clean and therefore can later free them.
>>> Should I open a JIRA for this to have some discussion on this topic?
>>>
>>> Regards,
>>>   Jan
>>>
>>>
>>> On 02/04/2016 07:20 PM, Chris Nauroth wrote:
>>>> Hello Jan,
>>>>
>>>> I am moving this thread from user@hadoop.apache.org to
>>>> yarn-dev@hadoop.apache.org, since it's less a question of general usage
>>>> and more a question of internal code implementation details and
>>>> possible
>>>> enhancements.
>>>>
>>>> I think the issue is that it's not guaranteed in the general case that
>>>> Private_Clean pages are easily evictable from page cache by the kernel.
>>>> For example, if the pages have been pinned into RAM by calling mlock
>>>> [1],
>>>> then the kernel cannot evict them.  Since YARN can execute any code
>>>> submitted by an application, including possibly code that calls mlock,
>>>> it
>>>> takes a cautious approach and assumes that these pages must be counted
>>>> towards the process footprint.  Although your Spark use case won't
>>>> mlock
>>>> the pages (I assume), YARN doesn't have a way to identify this.
>>>>
>>>> Perhaps there is room for improvement here.  If there is a reliable
>>>> way to
>>>> count only mlock'ed pages, then perhaps that behavior could be added as
>>>> another option in ProcfsBasedProcessTree.  Off the top of my head, I
>>>> can't
>>>> think of a reliable way to do this, and I can't research it further
>>>> immediately.  Do others on the thread have ideas?
>>>>
>>>> --Chris Nauroth
>>>>
>>>> [1] http://linux.die.net/man/2/mlock
>>>>
>>>>
>>>>
>>>>
>>>> On 2/4/16, 5:11 AM, "Jan Lukavský" <ja...@firma.seznam.cz>
>>>> wrote:
>>>>
>>>>> Hello,
>>>>>
>>>>> I have a question about the way LinuxResourceCalculatorPlugin
>>>>> calculates
>>>>> memory consumed by process tree (it is calculated via
>>>>> ProcfsBasedProcessTree class). When we enable caching (disk) in apache
>>>>> spark jobs run on YARN cluster, the node manager starts to kill the
>>>>> containers while reading the cached data, because of "Container is
>>>>> running beyond memory limits ...". The reason is that even if we
>>>>> enable
>>>>> parsing of the smaps file
>>>>>
>>>>> (yarn.nodemanager.container-monitor.procfs-tree.smaps-based-rss.enabled
>>>>> )
>>>>> the ProcfsBasedProcessTree calculates mmaped read-only pages as
>>>>> consumed
>>>>> by the process tree, while spark uses
>>>>> FileChannel.map(MapMode.READ_ONLY)
>>>>> to read the cached data. The JVM then consumes *a lot* more memory
>>>>> than
>>>>> the configured heap size (and it cannot be really controlled), but
>>>>> this
>>>>> memory is IMO not really consumed by the process, the kernel can
>>>>> reclaim
>>>>> these pages, if needed. My question is - is there any explicit reason
>>>>> why "Private_Clean" pages are calculated as consumed by process tree?
>>>>> I
>>>>> patched the ProcfsBasedProcessTree not to calculate them, but I don't
>>>>> know if this is the "correct" solution.
>>>>>
>>>>> Thanks for opinions,
>>>>>    cheers,
>>>>>    Jan
>>>>>
>>>>>
>>>>> ---------------------------------------------------------------------
>>>>> To unsubscribe, e-mail: user-unsubscribe@hadoop.apache.org
>>>>> For additional commands, e-mail: user-help@hadoop.apache.org
>>>>>
>>>>>
>>


-- 

Jan Lukavský
Vedoucí týmu vývoje
Seznam.cz, a.s.
Radlická 3294/10
15000, Praha 5

jan.lukavsky@firma.seznam.cz
http://www.seznam.cz

Re: ProcFsBasedProcessTree and clean pages in smaps

Posted by Chris Nauroth <cn...@hortonworks.com>.

Interesting, I didn't know about "Locked" in smaps.  Thanks for pointing
that out.

At this point, if Varun's suggestion to check out YARN-1856 doesn't solve
the problem, then I suggest opening a JIRA to track further design
discussion.

--Chris Nauroth




On 2/5/16, 6:10 AM, "Varun Vasudev" <vv...@apache.org> wrote:

>Hi Jan,
>
>YARN-1856 was recently committed which allows admins to use cgroups
>instead the ProcFsBasedProcessTree monitory. Would that solve your
>problem? However, that requires usage of the LinuxContainerExecutor.
>
>-Varun
>
>
>
>On 2/5/16, 6:45 PM, "Jan Lukavský" <ja...@firma.seznam.cz> wrote:
>
>>Hi Chris,
>>
>>thanks for your reply. As far as I can see right, new linux kernels show
>>the locked memory in "Locked" field.
>>
>>If mmap file a mlock it, I see the following in 'smaps' file:
>>
>>7efd20aeb000-7efd2172b000 r--p 00000000 103:04 1870
>>/tmp/file.bin
>>Size:              12544 kB
>>Rss:               12544 kB
>>Pss:               12544 kB
>>Shared_Clean:          0 kB
>>Shared_Dirty:          0 kB
>>Private_Clean:     12544 kB
>>Private_Dirty:         0 kB
>>Referenced:        12544 kB
>>Anonymous:             0 kB
>>AnonHugePages:         0 kB
>>Swap:                  0 kB
>>KernelPageSize:        4 kB
>>MMUPageSize:           4 kB
>>Locked:            12544 kB
>>
>>...
>># uname -a
>>Linux XXXXXX 3.2.0-4-amd64 #1 SMP Debian 3.2.68-1+deb7u3 x86_64 GNU/Linux
>>
>>If I do this on an older kernel (2.6.x), the Locked field is missing.
>>
>>I can make a patch for the ProcfsBasedProcessTree that will calculate
>>the "Locked" pages instead of the "Private_Clean" (based on
>>configuration option). The question is - should there be made even more
>>changes in the way the memory footprint is calculated? For instance, I
>>believe the kernel can write to disk even all dirty pages (if they are
>>backed by a file), making them clean and therefore can later free them.
>>Should I open a JIRA for this to have some discussion on this topic?
>>
>>Regards,
>>  Jan
>>
>>
>>On 02/04/2016 07:20 PM, Chris Nauroth wrote:
>>> Hello Jan,
>>>
>>> I am moving this thread from user@hadoop.apache.org to
>>> yarn-dev@hadoop.apache.org, since it's less a question of general usage
>>> and more a question of internal code implementation details and
>>>possible
>>> enhancements.
>>>
>>> I think the issue is that it's not guaranteed in the general case that
>>> Private_Clean pages are easily evictable from page cache by the kernel.
>>> For example, if the pages have been pinned into RAM by calling mlock
>>>[1],
>>> then the kernel cannot evict them.  Since YARN can execute any code
>>> submitted by an application, including possibly code that calls mlock,
>>>it
>>> takes a cautious approach and assumes that these pages must be counted
>>> towards the process footprint.  Although your Spark use case won't
>>>mlock
>>> the pages (I assume), YARN doesn't have a way to identify this.
>>>
>>> Perhaps there is room for improvement here.  If there is a reliable
>>>way to
>>> count only mlock'ed pages, then perhaps that behavior could be added as
>>> another option in ProcfsBasedProcessTree.  Off the top of my head, I
>>>can't
>>> think of a reliable way to do this, and I can't research it further
>>> immediately.  Do others on the thread have ideas?
>>>
>>> --Chris Nauroth
>>>
>>> [1] http://linux.die.net/man/2/mlock
>>>
>>>
>>>
>>>
>>> On 2/4/16, 5:11 AM, "Jan Lukavský" <ja...@firma.seznam.cz>
>>>wrote:
>>>
>>>> Hello,
>>>>
>>>> I have a question about the way LinuxResourceCalculatorPlugin
>>>>calculates
>>>> memory consumed by process tree (it is calculated via
>>>> ProcfsBasedProcessTree class). When we enable caching (disk) in apache
>>>> spark jobs run on YARN cluster, the node manager starts to kill the
>>>> containers while reading the cached data, because of "Container is
>>>> running beyond memory limits ...". The reason is that even if we
>>>>enable
>>>> parsing of the smaps file
>>>> 
>>>>(yarn.nodemanager.container-monitor.procfs-tree.smaps-based-rss.enabled
>>>>)
>>>> the ProcfsBasedProcessTree calculates mmaped read-only pages as
>>>>consumed
>>>> by the process tree, while spark uses
>>>>FileChannel.map(MapMode.READ_ONLY)
>>>> to read the cached data. The JVM then consumes *a lot* more memory
>>>>than
>>>> the configured heap size (and it cannot be really controlled), but
>>>>this
>>>> memory is IMO not really consumed by the process, the kernel can
>>>>reclaim
>>>> these pages, if needed. My question is - is there any explicit reason
>>>> why "Private_Clean" pages are calculated as consumed by process tree?
>>>>I
>>>> patched the ProcfsBasedProcessTree not to calculate them, but I don't
>>>> know if this is the "correct" solution.
>>>>
>>>> Thanks for opinions,
>>>>   cheers,
>>>>   Jan
>>>>
>>>>
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: user-unsubscribe@hadoop.apache.org
>>>> For additional commands, e-mail: user-help@hadoop.apache.org
>>>>
>>>>
>>
>
>

Re: ProcFsBasedProcessTree and clean pages in smaps

Posted by Varun Vasudev <vv...@apache.org>.

Hi Jan,

YARN-1856 was recently committed which allows admins to use cgroups instead the ProcFsBasedProcessTree monitory. Would that solve your problem? However, that requires usage of the LinuxContainerExecutor.

-Varun



On 2/5/16, 6:45 PM, "Jan Lukavský" <ja...@firma.seznam.cz> wrote:

>Hi Chris,
>
>thanks for your reply. As far as I can see right, new linux kernels show 
>the locked memory in "Locked" field.
>
>If mmap file a mlock it, I see the following in 'smaps' file:
>
>7efd20aeb000-7efd2172b000 r--p 00000000 103:04 1870                      
>/tmp/file.bin
>Size:              12544 kB
>Rss:               12544 kB
>Pss:               12544 kB
>Shared_Clean:          0 kB
>Shared_Dirty:          0 kB
>Private_Clean:     12544 kB
>Private_Dirty:         0 kB
>Referenced:        12544 kB
>Anonymous:             0 kB
>AnonHugePages:         0 kB
>Swap:                  0 kB
>KernelPageSize:        4 kB
>MMUPageSize:           4 kB
>Locked:            12544 kB
>
>...
># uname -a
>Linux XXXXXX 3.2.0-4-amd64 #1 SMP Debian 3.2.68-1+deb7u3 x86_64 GNU/Linux
>
>If I do this on an older kernel (2.6.x), the Locked field is missing.
>
>I can make a patch for the ProcfsBasedProcessTree that will calculate 
>the "Locked" pages instead of the "Private_Clean" (based on 
>configuration option). The question is - should there be made even more 
>changes in the way the memory footprint is calculated? For instance, I 
>believe the kernel can write to disk even all dirty pages (if they are 
>backed by a file), making them clean and therefore can later free them. 
>Should I open a JIRA for this to have some discussion on this topic?
>
>Regards,
>  Jan
>
>
>On 02/04/2016 07:20 PM, Chris Nauroth wrote:
>> Hello Jan,
>>
>> I am moving this thread from user@hadoop.apache.org to
>> yarn-dev@hadoop.apache.org, since it's less a question of general usage
>> and more a question of internal code implementation details and possible
>> enhancements.
>>
>> I think the issue is that it's not guaranteed in the general case that
>> Private_Clean pages are easily evictable from page cache by the kernel.
>> For example, if the pages have been pinned into RAM by calling mlock [1],
>> then the kernel cannot evict them.  Since YARN can execute any code
>> submitted by an application, including possibly code that calls mlock, it
>> takes a cautious approach and assumes that these pages must be counted
>> towards the process footprint.  Although your Spark use case won't mlock
>> the pages (I assume), YARN doesn't have a way to identify this.
>>
>> Perhaps there is room for improvement here.  If there is a reliable way to
>> count only mlock'ed pages, then perhaps that behavior could be added as
>> another option in ProcfsBasedProcessTree.  Off the top of my head, I can't
>> think of a reliable way to do this, and I can't research it further
>> immediately.  Do others on the thread have ideas?
>>
>> --Chris Nauroth
>>
>> [1] http://linux.die.net/man/2/mlock
>>
>>
>>
>>
>> On 2/4/16, 5:11 AM, "Jan Lukavský" <ja...@firma.seznam.cz> wrote:
>>
>>> Hello,
>>>
>>> I have a question about the way LinuxResourceCalculatorPlugin calculates
>>> memory consumed by process tree (it is calculated via
>>> ProcfsBasedProcessTree class). When we enable caching (disk) in apache
>>> spark jobs run on YARN cluster, the node manager starts to kill the
>>> containers while reading the cached data, because of "Container is
>>> running beyond memory limits ...". The reason is that even if we enable
>>> parsing of the smaps file
>>> (yarn.nodemanager.container-monitor.procfs-tree.smaps-based-rss.enabled)
>>> the ProcfsBasedProcessTree calculates mmaped read-only pages as consumed
>>> by the process tree, while spark uses FileChannel.map(MapMode.READ_ONLY)
>>> to read the cached data. The JVM then consumes *a lot* more memory than
>>> the configured heap size (and it cannot be really controlled), but this
>>> memory is IMO not really consumed by the process, the kernel can reclaim
>>> these pages, if needed. My question is - is there any explicit reason
>>> why "Private_Clean" pages are calculated as consumed by process tree? I
>>> patched the ProcfsBasedProcessTree not to calculate them, but I don't
>>> know if this is the "correct" solution.
>>>
>>> Thanks for opinions,
>>>   cheers,
>>>   Jan
>>>
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: user-unsubscribe@hadoop.apache.org
>>> For additional commands, e-mail: user-help@hadoop.apache.org
>>>
>>>
>

Re: ProcFsBasedProcessTree and clean pages in smaps

Posted by Jan Lukavský <ja...@firma.seznam.cz>.

Hi Chris,

thanks for your reply. As far as I can see right, new linux kernels show 
the locked memory in "Locked" field.

If mmap file a mlock it, I see the following in 'smaps' file:

7efd20aeb000-7efd2172b000 r--p 00000000 103:04 1870                      
/tmp/file.bin
Size:              12544 kB
Rss:               12544 kB
Pss:               12544 kB
Shared_Clean:          0 kB
Shared_Dirty:          0 kB
Private_Clean:     12544 kB
Private_Dirty:         0 kB
Referenced:        12544 kB
Anonymous:             0 kB
AnonHugePages:         0 kB
Swap:                  0 kB
KernelPageSize:        4 kB
MMUPageSize:           4 kB
Locked:            12544 kB

...
# uname -a
Linux XXXXXX 3.2.0-4-amd64 #1 SMP Debian 3.2.68-1+deb7u3 x86_64 GNU/Linux

If I do this on an older kernel (2.6.x), the Locked field is missing.

I can make a patch for the ProcfsBasedProcessTree that will calculate 
the "Locked" pages instead of the "Private_Clean" (based on 
configuration option). The question is - should there be made even more 
changes in the way the memory footprint is calculated? For instance, I 
believe the kernel can write to disk even all dirty pages (if they are 
backed by a file), making them clean and therefore can later free them. 
Should I open a JIRA for this to have some discussion on this topic?

Regards,
  Jan


On 02/04/2016 07:20 PM, Chris Nauroth wrote:
> Hello Jan,
>
> I am moving this thread from user@hadoop.apache.org to
> yarn-dev@hadoop.apache.org, since it's less a question of general usage
> and more a question of internal code implementation details and possible
> enhancements.
>
> I think the issue is that it's not guaranteed in the general case that
> Private_Clean pages are easily evictable from page cache by the kernel.
> For example, if the pages have been pinned into RAM by calling mlock [1],
> then the kernel cannot evict them.  Since YARN can execute any code
> submitted by an application, including possibly code that calls mlock, it
> takes a cautious approach and assumes that these pages must be counted
> towards the process footprint.  Although your Spark use case won't mlock
> the pages (I assume), YARN doesn't have a way to identify this.
>
> Perhaps there is room for improvement here.  If there is a reliable way to
> count only mlock'ed pages, then perhaps that behavior could be added as
> another option in ProcfsBasedProcessTree.  Off the top of my head, I can't
> think of a reliable way to do this, and I can't research it further
> immediately.  Do others on the thread have ideas?
>
> --Chris Nauroth
>
> [1] http://linux.die.net/man/2/mlock
>
>
>
>
> On 2/4/16, 5:11 AM, "Jan Lukavský" <ja...@firma.seznam.cz> wrote:
>
>> Hello,
>>
>> I have a question about the way LinuxResourceCalculatorPlugin calculates
>> memory consumed by process tree (it is calculated via
>> ProcfsBasedProcessTree class). When we enable caching (disk) in apache
>> spark jobs run on YARN cluster, the node manager starts to kill the
>> containers while reading the cached data, because of "Container is
>> running beyond memory limits ...". The reason is that even if we enable
>> parsing of the smaps file
>> (yarn.nodemanager.container-monitor.procfs-tree.smaps-based-rss.enabled)
>> the ProcfsBasedProcessTree calculates mmaped read-only pages as consumed
>> by the process tree, while spark uses FileChannel.map(MapMode.READ_ONLY)
>> to read the cached data. The JVM then consumes *a lot* more memory than
>> the configured heap size (and it cannot be really controlled), but this
>> memory is IMO not really consumed by the process, the kernel can reclaim
>> these pages, if needed. My question is - is there any explicit reason
>> why "Private_Clean" pages are calculated as consumed by process tree? I
>> patched the ProcfsBasedProcessTree not to calculate them, but I don't
>> know if this is the "correct" solution.
>>
>> Thanks for opinions,
>>   cheers,
>>   Jan
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: user-unsubscribe@hadoop.apache.org
>> For additional commands, e-mail: user-help@hadoop.apache.org
>>
>>