You are viewing a plain text version of this content. The canonical link for it is here.

Posted to common-user@hadoop.apache.org by Jan Lukavský <ja...@firma.seznam.cz> on 2016/02/04 14:11:34 UTC

ProcFsBasedProcessTree and clean pages in smaps

Hello,

I have a question about the way LinuxResourceCalculatorPlugin calculates 
memory consumed by process tree (it is calculated via 
ProcfsBasedProcessTree class). When we enable caching (disk) in apache 
spark jobs run on YARN cluster, the node manager starts to kill the 
containers while reading the cached data, because of "Container is 
running beyond memory limits ...". The reason is that even if we enable 
parsing of the smaps file 
(yarn.nodemanager.container-monitor.procfs-tree.smaps-based-rss.enabled) 
the ProcfsBasedProcessTree calculates mmaped read-only pages as consumed 
by the process tree, while spark uses FileChannel.map(MapMode.READ_ONLY) 
to read the cached data. The JVM then consumes *a lot* more memory than 
the configured heap size (and it cannot be really controlled), but this 
memory is IMO not really consumed by the process, the kernel can reclaim 
these pages, if needed. My question is - is there any explicit reason 
why "Private_Clean" pages are calculated as consumed by process tree? I 
patched the ProcfsBasedProcessTree not to calculate them, but I don't 
know if this is the "correct" solution.

Thanks for opinions,
  cheers,
  Jan


---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@hadoop.apache.org
For additional commands, e-mail: user-help@hadoop.apache.org

Re: ProcFsBasedProcessTree and clean pages in smaps

Posted by Chris Nauroth <cn...@hortonworks.com>.

Thank you for the follow-up, Jan.  I'll join the discussion on YARN-4681.

--Chris Nauroth




On 2/9/16, 3:22 AM, "Jan Lukavský" <ja...@firma.seznam.cz> wrote:

>Hi Chris and Varun,
>
>thanks for you suggestions. I played around with the cgroups, and I
>think that although it kind of resolves memory issues, I think it
>doesn't fit our needs, because of other restrictions enforced on the
>container (mainly the CPU restrictions). I created
>https://issues.apache.org/jira/browse/YARN-4681 and submitted a very
>simplistic version of the patch.
>
>Thanks for comments,
>  Jan
>
>On 02/05/2016 06:10 PM, Chris Nauroth wrote:
>> Interesting, I didn't know about "Locked" in smaps.  Thanks for pointing
>> that out.
>>
>> At this point, if Varun's suggestion to check out YARN-1856 doesn't
>>solve
>> the problem, then I suggest opening a JIRA to track further design
>> discussion.
>>
>> --Chris Nauroth
>>
>>
>>
>>
>> On 2/5/16, 6:10 AM, "Varun Vasudev" <vv...@apache.org> wrote:
>>
>>> Hi Jan,
>>>
>>> YARN-1856 was recently committed which allows admins to use cgroups
>>> instead the ProcFsBasedProcessTree monitory. Would that solve your
>>> problem? However, that requires usage of the LinuxContainerExecutor.
>>>
>>> -Varun
>>>
>>>
>>>
>>> On 2/5/16, 6:45 PM, "Jan Lukavský" <ja...@firma.seznam.cz>
>>>wrote:
>>>
>>>> Hi Chris,
>>>>
>>>> thanks for your reply. As far as I can see right, new linux kernels
>>>>show
>>>> the locked memory in "Locked" field.
>>>>
>>>> If mmap file a mlock it, I see the following in 'smaps' file:
>>>>
>>>> 7efd20aeb000-7efd2172b000 r--p 00000000 103:04 1870
>>>> /tmp/file.bin
>>>> Size:              12544 kB
>>>> Rss:               12544 kB
>>>> Pss:               12544 kB
>>>> Shared_Clean:          0 kB
>>>> Shared_Dirty:          0 kB
>>>> Private_Clean:     12544 kB
>>>> Private_Dirty:         0 kB
>>>> Referenced:        12544 kB
>>>> Anonymous:             0 kB
>>>> AnonHugePages:         0 kB
>>>> Swap:                  0 kB
>>>> KernelPageSize:        4 kB
>>>> MMUPageSize:           4 kB
>>>> Locked:            12544 kB
>>>>
>>>> ...
>>>> # uname -a
>>>> Linux XXXXXX 3.2.0-4-amd64 #1 SMP Debian 3.2.68-1+deb7u3 x86_64
>>>>GNU/Linux
>>>>
>>>> If I do this on an older kernel (2.6.x), the Locked field is missing.
>>>>
>>>> I can make a patch for the ProcfsBasedProcessTree that will calculate
>>>> the "Locked" pages instead of the "Private_Clean" (based on
>>>> configuration option). The question is - should there be made even
>>>>more
>>>> changes in the way the memory footprint is calculated? For instance, I
>>>> believe the kernel can write to disk even all dirty pages (if they are
>>>> backed by a file), making them clean and therefore can later free
>>>>them.
>>>> Should I open a JIRA for this to have some discussion on this topic?
>>>>
>>>> Regards,
>>>>   Jan
>>>>
>>>>
>>>> On 02/04/2016 07:20 PM, Chris Nauroth wrote:
>>>>> Hello Jan,
>>>>>
>>>>> I am moving this thread from user@hadoop.apache.org to
>>>>> yarn-dev@hadoop.apache.org, since it's less a question of general
>>>>>usage
>>>>> and more a question of internal code implementation details and
>>>>> possible
>>>>> enhancements.
>>>>>
>>>>> I think the issue is that it's not guaranteed in the general case
>>>>>that
>>>>> Private_Clean pages are easily evictable from page cache by the
>>>>>kernel.
>>>>> For example, if the pages have been pinned into RAM by calling mlock
>>>>> [1],
>>>>> then the kernel cannot evict them.  Since YARN can execute any code
>>>>> submitted by an application, including possibly code that calls
>>>>>mlock,
>>>>> it
>>>>> takes a cautious approach and assumes that these pages must be
>>>>>counted
>>>>> towards the process footprint.  Although your Spark use case won't
>>>>> mlock
>>>>> the pages (I assume), YARN doesn't have a way to identify this.
>>>>>
>>>>> Perhaps there is room for improvement here.  If there is a reliable
>>>>> way to
>>>>> count only mlock'ed pages, then perhaps that behavior could be added
>>>>>as
>>>>> another option in ProcfsBasedProcessTree.  Off the top of my head, I
>>>>> can't
>>>>> think of a reliable way to do this, and I can't research it further
>>>>> immediately.  Do others on the thread have ideas?
>>>>>
>>>>> --Chris Nauroth
>>>>>
>>>>> [1] http://linux.die.net/man/2/mlock
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> On 2/4/16, 5:11 AM, "Jan Lukavský" <ja...@firma.seznam.cz>
>>>>> wrote:
>>>>>
>>>>>> Hello,
>>>>>>
>>>>>> I have a question about the way LinuxResourceCalculatorPlugin
>>>>>> calculates
>>>>>> memory consumed by process tree (it is calculated via
>>>>>> ProcfsBasedProcessTree class). When we enable caching (disk) in
>>>>>>apache
>>>>>> spark jobs run on YARN cluster, the node manager starts to kill the
>>>>>> containers while reading the cached data, because of "Container is
>>>>>> running beyond memory limits ...". The reason is that even if we
>>>>>> enable
>>>>>> parsing of the smaps file
>>>>>>
>>>>>> 
>>>>>>(yarn.nodemanager.container-monitor.procfs-tree.smaps-based-rss.enabl
>>>>>>ed
>>>>>> )
>>>>>> the ProcfsBasedProcessTree calculates mmaped read-only pages as
>>>>>> consumed
>>>>>> by the process tree, while spark uses
>>>>>> FileChannel.map(MapMode.READ_ONLY)
>>>>>> to read the cached data. The JVM then consumes *a lot* more memory
>>>>>> than
>>>>>> the configured heap size (and it cannot be really controlled), but
>>>>>> this
>>>>>> memory is IMO not really consumed by the process, the kernel can
>>>>>> reclaim
>>>>>> these pages, if needed. My question is - is there any explicit
>>>>>>reason
>>>>>> why "Private_Clean" pages are calculated as consumed by process
>>>>>>tree?
>>>>>> I
>>>>>> patched the ProcfsBasedProcessTree not to calculate them, but I
>>>>>>don't
>>>>>> know if this is the "correct" solution.
>>>>>>
>>>>>> Thanks for opinions,
>>>>>>    cheers,
>>>>>>    Jan
>>>>>>
>>>>>>
>>>>>> 
>>>>>>---------------------------------------------------------------------
>>>>>> To unsubscribe, e-mail: user-unsubscribe@hadoop.apache.org
>>>>>> For additional commands, e-mail: user-help@hadoop.apache.org
>>>>>>
>>>>>>
>>>
>
>
>-- 
>
>Jan Lukavský
>Vedoucí týmu vývoje
>Seznam.cz, a.s.
>Radlická 3294/10
>15000, Praha 5
>
>jan.lukavsky@firma.seznam.cz
>http://www.seznam.cz
>
>

Re: ProcFsBasedProcessTree and clean pages in smaps

Posted by Jan Lukavský <ja...@firma.seznam.cz>.

Hi Chris and Varun,

thanks for you suggestions. I played around with the cgroups, and I 
think that although it kind of resolves memory issues, I think it 
doesn't fit our needs, because of other restrictions enforced on the 
container (mainly the CPU restrictions). I created 
https://issues.apache.org/jira/browse/YARN-4681 and submitted a very 
simplistic version of the patch.

Thanks for comments,
  Jan

On 02/05/2016 06:10 PM, Chris Nauroth wrote:
> Interesting, I didn't know about "Locked" in smaps.  Thanks for pointing
> that out.
>
> At this point, if Varun's suggestion to check out YARN-1856 doesn't solve
> the problem, then I suggest opening a JIRA to track further design
> discussion.
>
> --Chris Nauroth
>
>
>
>
> On 2/5/16, 6:10 AM, "Varun Vasudev" <vv...@apache.org> wrote:
>
>> Hi Jan,
>>
>> YARN-1856 was recently committed which allows admins to use cgroups
>> instead the ProcFsBasedProcessTree monitory. Would that solve your
>> problem? However, that requires usage of the LinuxContainerExecutor.
>>
>> -Varun
>>
>>
>>
>> On 2/5/16, 6:45 PM, "Jan Lukavský" <ja...@firma.seznam.cz> wrote:
>>
>>> Hi Chris,
>>>
>>> thanks for your reply. As far as I can see right, new linux kernels show
>>> the locked memory in "Locked" field.
>>>
>>> If mmap file a mlock it, I see the following in 'smaps' file:
>>>
>>> 7efd20aeb000-7efd2172b000 r--p 00000000 103:04 1870
>>> /tmp/file.bin
>>> Size:              12544 kB
>>> Rss:               12544 kB
>>> Pss:               12544 kB
>>> Shared_Clean:          0 kB
>>> Shared_Dirty:          0 kB
>>> Private_Clean:     12544 kB
>>> Private_Dirty:         0 kB
>>> Referenced:        12544 kB
>>> Anonymous:             0 kB
>>> AnonHugePages:         0 kB
>>> Swap:                  0 kB
>>> KernelPageSize:        4 kB
>>> MMUPageSize:           4 kB
>>> Locked:            12544 kB
>>>
>>> ...
>>> # uname -a
>>> Linux XXXXXX 3.2.0-4-amd64 #1 SMP Debian 3.2.68-1+deb7u3 x86_64 GNU/Linux
>>>
>>> If I do this on an older kernel (2.6.x), the Locked field is missing.
>>>
>>> I can make a patch for the ProcfsBasedProcessTree that will calculate
>>> the "Locked" pages instead of the "Private_Clean" (based on
>>> configuration option). The question is - should there be made even more
>>> changes in the way the memory footprint is calculated? For instance, I
>>> believe the kernel can write to disk even all dirty pages (if they are
>>> backed by a file), making them clean and therefore can later free them.
>>> Should I open a JIRA for this to have some discussion on this topic?
>>>
>>> Regards,
>>>   Jan
>>>
>>>
>>> On 02/04/2016 07:20 PM, Chris Nauroth wrote:
>>>> Hello Jan,
>>>>
>>>> I am moving this thread from user@hadoop.apache.org to
>>>> yarn-dev@hadoop.apache.org, since it's less a question of general usage
>>>> and more a question of internal code implementation details and
>>>> possible
>>>> enhancements.
>>>>
>>>> I think the issue is that it's not guaranteed in the general case that
>>>> Private_Clean pages are easily evictable from page cache by the kernel.
>>>> For example, if the pages have been pinned into RAM by calling mlock
>>>> [1],
>>>> then the kernel cannot evict them.  Since YARN can execute any code
>>>> submitted by an application, including possibly code that calls mlock,
>>>> it
>>>> takes a cautious approach and assumes that these pages must be counted
>>>> towards the process footprint.  Although your Spark use case won't
>>>> mlock
>>>> the pages (I assume), YARN doesn't have a way to identify this.
>>>>
>>>> Perhaps there is room for improvement here.  If there is a reliable
>>>> way to
>>>> count only mlock'ed pages, then perhaps that behavior could be added as
>>>> another option in ProcfsBasedProcessTree.  Off the top of my head, I
>>>> can't
>>>> think of a reliable way to do this, and I can't research it further
>>>> immediately.  Do others on the thread have ideas?
>>>>
>>>> --Chris Nauroth
>>>>
>>>> [1] http://linux.die.net/man/2/mlock
>>>>
>>>>
>>>>
>>>>
>>>> On 2/4/16, 5:11 AM, "Jan Lukavský" <ja...@firma.seznam.cz>
>>>> wrote:
>>>>
>>>>> Hello,
>>>>>
>>>>> I have a question about the way LinuxResourceCalculatorPlugin
>>>>> calculates
>>>>> memory consumed by process tree (it is calculated via
>>>>> ProcfsBasedProcessTree class). When we enable caching (disk) in apache
>>>>> spark jobs run on YARN cluster, the node manager starts to kill the
>>>>> containers while reading the cached data, because of "Container is
>>>>> running beyond memory limits ...". The reason is that even if we
>>>>> enable
>>>>> parsing of the smaps file
>>>>>
>>>>> (yarn.nodemanager.container-monitor.procfs-tree.smaps-based-rss.enabled
>>>>> )
>>>>> the ProcfsBasedProcessTree calculates mmaped read-only pages as
>>>>> consumed
>>>>> by the process tree, while spark uses
>>>>> FileChannel.map(MapMode.READ_ONLY)
>>>>> to read the cached data. The JVM then consumes *a lot* more memory
>>>>> than
>>>>> the configured heap size (and it cannot be really controlled), but
>>>>> this
>>>>> memory is IMO not really consumed by the process, the kernel can
>>>>> reclaim
>>>>> these pages, if needed. My question is - is there any explicit reason
>>>>> why "Private_Clean" pages are calculated as consumed by process tree?
>>>>> I
>>>>> patched the ProcfsBasedProcessTree not to calculate them, but I don't
>>>>> know if this is the "correct" solution.
>>>>>
>>>>> Thanks for opinions,
>>>>>    cheers,
>>>>>    Jan
>>>>>
>>>>>
>>>>> ---------------------------------------------------------------------
>>>>> To unsubscribe, e-mail: user-unsubscribe@hadoop.apache.org
>>>>> For additional commands, e-mail: user-help@hadoop.apache.org
>>>>>
>>>>>
>>


-- 

Jan Lukavský
Vedoucí týmu vývoje
Seznam.cz, a.s.
Radlická 3294/10
15000, Praha 5

jan.lukavsky@firma.seznam.cz
http://www.seznam.cz

Re: ProcFsBasedProcessTree and clean pages in smaps

Posted by Chris Nauroth <cn...@hortonworks.com>.

Interesting, I didn't know about "Locked" in smaps.  Thanks for pointing
that out.

At this point, if Varun's suggestion to check out YARN-1856 doesn't solve
the problem, then I suggest opening a JIRA to track further design
discussion.

--Chris Nauroth




On 2/5/16, 6:10 AM, "Varun Vasudev" <vv...@apache.org> wrote:

>Hi Jan,
>
>YARN-1856 was recently committed which allows admins to use cgroups
>instead the ProcFsBasedProcessTree monitory. Would that solve your
>problem? However, that requires usage of the LinuxContainerExecutor.
>
>-Varun
>
>
>
>On 2/5/16, 6:45 PM, "Jan Lukavský" <ja...@firma.seznam.cz> wrote:
>
>>Hi Chris,
>>
>>thanks for your reply. As far as I can see right, new linux kernels show
>>the locked memory in "Locked" field.
>>
>>If mmap file a mlock it, I see the following in 'smaps' file:
>>
>>7efd20aeb000-7efd2172b000 r--p 00000000 103:04 1870
>>/tmp/file.bin
>>Size:              12544 kB
>>Rss:               12544 kB
>>Pss:               12544 kB
>>Shared_Clean:          0 kB
>>Shared_Dirty:          0 kB
>>Private_Clean:     12544 kB
>>Private_Dirty:         0 kB
>>Referenced:        12544 kB
>>Anonymous:             0 kB
>>AnonHugePages:         0 kB
>>Swap:                  0 kB
>>KernelPageSize:        4 kB
>>MMUPageSize:           4 kB
>>Locked:            12544 kB
>>
>>...
>># uname -a
>>Linux XXXXXX 3.2.0-4-amd64 #1 SMP Debian 3.2.68-1+deb7u3 x86_64 GNU/Linux
>>
>>If I do this on an older kernel (2.6.x), the Locked field is missing.
>>
>>I can make a patch for the ProcfsBasedProcessTree that will calculate
>>the "Locked" pages instead of the "Private_Clean" (based on
>>configuration option). The question is - should there be made even more
>>changes in the way the memory footprint is calculated? For instance, I
>>believe the kernel can write to disk even all dirty pages (if they are
>>backed by a file), making them clean and therefore can later free them.
>>Should I open a JIRA for this to have some discussion on this topic?
>>
>>Regards,
>>  Jan
>>
>>
>>On 02/04/2016 07:20 PM, Chris Nauroth wrote:
>>> Hello Jan,
>>>
>>> I am moving this thread from user@hadoop.apache.org to
>>> yarn-dev@hadoop.apache.org, since it's less a question of general usage
>>> and more a question of internal code implementation details and
>>>possible
>>> enhancements.
>>>
>>> I think the issue is that it's not guaranteed in the general case that
>>> Private_Clean pages are easily evictable from page cache by the kernel.
>>> For example, if the pages have been pinned into RAM by calling mlock
>>>[1],
>>> then the kernel cannot evict them.  Since YARN can execute any code
>>> submitted by an application, including possibly code that calls mlock,
>>>it
>>> takes a cautious approach and assumes that these pages must be counted
>>> towards the process footprint.  Although your Spark use case won't
>>>mlock
>>> the pages (I assume), YARN doesn't have a way to identify this.
>>>
>>> Perhaps there is room for improvement here.  If there is a reliable
>>>way to
>>> count only mlock'ed pages, then perhaps that behavior could be added as
>>> another option in ProcfsBasedProcessTree.  Off the top of my head, I
>>>can't
>>> think of a reliable way to do this, and I can't research it further
>>> immediately.  Do others on the thread have ideas?
>>>
>>> --Chris Nauroth
>>>
>>> [1] http://linux.die.net/man/2/mlock
>>>
>>>
>>>
>>>
>>> On 2/4/16, 5:11 AM, "Jan Lukavský" <ja...@firma.seznam.cz>
>>>wrote:
>>>
>>>> Hello,
>>>>
>>>> I have a question about the way LinuxResourceCalculatorPlugin
>>>>calculates
>>>> memory consumed by process tree (it is calculated via
>>>> ProcfsBasedProcessTree class). When we enable caching (disk) in apache
>>>> spark jobs run on YARN cluster, the node manager starts to kill the
>>>> containers while reading the cached data, because of "Container is
>>>> running beyond memory limits ...". The reason is that even if we
>>>>enable
>>>> parsing of the smaps file
>>>> 
>>>>(yarn.nodemanager.container-monitor.procfs-tree.smaps-based-rss.enabled
>>>>)
>>>> the ProcfsBasedProcessTree calculates mmaped read-only pages as
>>>>consumed
>>>> by the process tree, while spark uses
>>>>FileChannel.map(MapMode.READ_ONLY)
>>>> to read the cached data. The JVM then consumes *a lot* more memory
>>>>than
>>>> the configured heap size (and it cannot be really controlled), but
>>>>this
>>>> memory is IMO not really consumed by the process, the kernel can
>>>>reclaim
>>>> these pages, if needed. My question is - is there any explicit reason
>>>> why "Private_Clean" pages are calculated as consumed by process tree?
>>>>I
>>>> patched the ProcfsBasedProcessTree not to calculate them, but I don't
>>>> know if this is the "correct" solution.
>>>>
>>>> Thanks for opinions,
>>>>   cheers,
>>>>   Jan
>>>>
>>>>
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: user-unsubscribe@hadoop.apache.org
>>>> For additional commands, e-mail: user-help@hadoop.apache.org
>>>>
>>>>
>>
>
>

Re: ProcFsBasedProcessTree and clean pages in smaps

Posted by Varun Vasudev <vv...@apache.org>.

Hi Jan,

YARN-1856 was recently committed which allows admins to use cgroups instead the ProcFsBasedProcessTree monitory. Would that solve your problem? However, that requires usage of the LinuxContainerExecutor.

-Varun



On 2/5/16, 6:45 PM, "Jan Lukavský" <ja...@firma.seznam.cz> wrote:

>Hi Chris,
>
>thanks for your reply. As far as I can see right, new linux kernels show 
>the locked memory in "Locked" field.
>
>If mmap file a mlock it, I see the following in 'smaps' file:
>
>7efd20aeb000-7efd2172b000 r--p 00000000 103:04 1870                      
>/tmp/file.bin
>Size:              12544 kB
>Rss:               12544 kB
>Pss:               12544 kB
>Shared_Clean:          0 kB
>Shared_Dirty:          0 kB
>Private_Clean:     12544 kB
>Private_Dirty:         0 kB
>Referenced:        12544 kB
>Anonymous:             0 kB
>AnonHugePages:         0 kB
>Swap:                  0 kB
>KernelPageSize:        4 kB
>MMUPageSize:           4 kB
>Locked:            12544 kB
>
>...
># uname -a
>Linux XXXXXX 3.2.0-4-amd64 #1 SMP Debian 3.2.68-1+deb7u3 x86_64 GNU/Linux
>
>If I do this on an older kernel (2.6.x), the Locked field is missing.
>
>I can make a patch for the ProcfsBasedProcessTree that will calculate 
>the "Locked" pages instead of the "Private_Clean" (based on 
>configuration option). The question is - should there be made even more 
>changes in the way the memory footprint is calculated? For instance, I 
>believe the kernel can write to disk even all dirty pages (if they are 
>backed by a file), making them clean and therefore can later free them. 
>Should I open a JIRA for this to have some discussion on this topic?
>
>Regards,
>  Jan
>
>
>On 02/04/2016 07:20 PM, Chris Nauroth wrote:
>> Hello Jan,
>>
>> I am moving this thread from user@hadoop.apache.org to
>> yarn-dev@hadoop.apache.org, since it's less a question of general usage
>> and more a question of internal code implementation details and possible
>> enhancements.
>>
>> I think the issue is that it's not guaranteed in the general case that
>> Private_Clean pages are easily evictable from page cache by the kernel.
>> For example, if the pages have been pinned into RAM by calling mlock [1],
>> then the kernel cannot evict them.  Since YARN can execute any code
>> submitted by an application, including possibly code that calls mlock, it
>> takes a cautious approach and assumes that these pages must be counted
>> towards the process footprint.  Although your Spark use case won't mlock
>> the pages (I assume), YARN doesn't have a way to identify this.
>>
>> Perhaps there is room for improvement here.  If there is a reliable way to
>> count only mlock'ed pages, then perhaps that behavior could be added as
>> another option in ProcfsBasedProcessTree.  Off the top of my head, I can't
>> think of a reliable way to do this, and I can't research it further
>> immediately.  Do others on the thread have ideas?
>>
>> --Chris Nauroth
>>
>> [1] http://linux.die.net/man/2/mlock
>>
>>
>>
>>
>> On 2/4/16, 5:11 AM, "Jan Lukavský" <ja...@firma.seznam.cz> wrote:
>>
>>> Hello,
>>>
>>> I have a question about the way LinuxResourceCalculatorPlugin calculates
>>> memory consumed by process tree (it is calculated via
>>> ProcfsBasedProcessTree class). When we enable caching (disk) in apache
>>> spark jobs run on YARN cluster, the node manager starts to kill the
>>> containers while reading the cached data, because of "Container is
>>> running beyond memory limits ...". The reason is that even if we enable
>>> parsing of the smaps file
>>> (yarn.nodemanager.container-monitor.procfs-tree.smaps-based-rss.enabled)
>>> the ProcfsBasedProcessTree calculates mmaped read-only pages as consumed
>>> by the process tree, while spark uses FileChannel.map(MapMode.READ_ONLY)
>>> to read the cached data. The JVM then consumes *a lot* more memory than
>>> the configured heap size (and it cannot be really controlled), but this
>>> memory is IMO not really consumed by the process, the kernel can reclaim
>>> these pages, if needed. My question is - is there any explicit reason
>>> why "Private_Clean" pages are calculated as consumed by process tree? I
>>> patched the ProcfsBasedProcessTree not to calculate them, but I don't
>>> know if this is the "correct" solution.
>>>
>>> Thanks for opinions,
>>>   cheers,
>>>   Jan
>>>
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: user-unsubscribe@hadoop.apache.org
>>> For additional commands, e-mail: user-help@hadoop.apache.org
>>>
>>>
>

Re: ProcFsBasedProcessTree and clean pages in smaps

Posted by Jan Lukavský <ja...@firma.seznam.cz>.

Hi Chris,

thanks for your reply. As far as I can see right, new linux kernels show 
the locked memory in "Locked" field.

If mmap file a mlock it, I see the following in 'smaps' file:

7efd20aeb000-7efd2172b000 r--p 00000000 103:04 1870                      
/tmp/file.bin
Size:              12544 kB
Rss:               12544 kB
Pss:               12544 kB
Shared_Clean:          0 kB
Shared_Dirty:          0 kB
Private_Clean:     12544 kB
Private_Dirty:         0 kB
Referenced:        12544 kB
Anonymous:             0 kB
AnonHugePages:         0 kB
Swap:                  0 kB
KernelPageSize:        4 kB
MMUPageSize:           4 kB
Locked:            12544 kB

...
# uname -a
Linux XXXXXX 3.2.0-4-amd64 #1 SMP Debian 3.2.68-1+deb7u3 x86_64 GNU/Linux

If I do this on an older kernel (2.6.x), the Locked field is missing.

I can make a patch for the ProcfsBasedProcessTree that will calculate 
the "Locked" pages instead of the "Private_Clean" (based on 
configuration option). The question is - should there be made even more 
changes in the way the memory footprint is calculated? For instance, I 
believe the kernel can write to disk even all dirty pages (if they are 
backed by a file), making them clean and therefore can later free them. 
Should I open a JIRA for this to have some discussion on this topic?

Regards,
  Jan


On 02/04/2016 07:20 PM, Chris Nauroth wrote:
> Hello Jan,
>
> I am moving this thread from user@hadoop.apache.org to
> yarn-dev@hadoop.apache.org, since it's less a question of general usage
> and more a question of internal code implementation details and possible
> enhancements.
>
> I think the issue is that it's not guaranteed in the general case that
> Private_Clean pages are easily evictable from page cache by the kernel.
> For example, if the pages have been pinned into RAM by calling mlock [1],
> then the kernel cannot evict them.  Since YARN can execute any code
> submitted by an application, including possibly code that calls mlock, it
> takes a cautious approach and assumes that these pages must be counted
> towards the process footprint.  Although your Spark use case won't mlock
> the pages (I assume), YARN doesn't have a way to identify this.
>
> Perhaps there is room for improvement here.  If there is a reliable way to
> count only mlock'ed pages, then perhaps that behavior could be added as
> another option in ProcfsBasedProcessTree.  Off the top of my head, I can't
> think of a reliable way to do this, and I can't research it further
> immediately.  Do others on the thread have ideas?
>
> --Chris Nauroth
>
> [1] http://linux.die.net/man/2/mlock
>
>
>
>
> On 2/4/16, 5:11 AM, "Jan Lukavský" <ja...@firma.seznam.cz> wrote:
>
>> Hello,
>>
>> I have a question about the way LinuxResourceCalculatorPlugin calculates
>> memory consumed by process tree (it is calculated via
>> ProcfsBasedProcessTree class). When we enable caching (disk) in apache
>> spark jobs run on YARN cluster, the node manager starts to kill the
>> containers while reading the cached data, because of "Container is
>> running beyond memory limits ...". The reason is that even if we enable
>> parsing of the smaps file
>> (yarn.nodemanager.container-monitor.procfs-tree.smaps-based-rss.enabled)
>> the ProcfsBasedProcessTree calculates mmaped read-only pages as consumed
>> by the process tree, while spark uses FileChannel.map(MapMode.READ_ONLY)
>> to read the cached data. The JVM then consumes *a lot* more memory than
>> the configured heap size (and it cannot be really controlled), but this
>> memory is IMO not really consumed by the process, the kernel can reclaim
>> these pages, if needed. My question is - is there any explicit reason
>> why "Private_Clean" pages are calculated as consumed by process tree? I
>> patched the ProcfsBasedProcessTree not to calculate them, but I don't
>> know if this is the "correct" solution.
>>
>> Thanks for opinions,
>>   cheers,
>>   Jan
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: user-unsubscribe@hadoop.apache.org
>> For additional commands, e-mail: user-help@hadoop.apache.org
>>
>>

Re: ProcFsBasedProcessTree and clean pages in smaps

Posted by Chris Nauroth <cn...@hortonworks.com>.

Hello Jan,

I am moving this thread from user@hadoop.apache.org to
yarn-dev@hadoop.apache.org, since it's less a question of general usage
and more a question of internal code implementation details and possible
enhancements.

I think the issue is that it's not guaranteed in the general case that
Private_Clean pages are easily evictable from page cache by the kernel.
For example, if the pages have been pinned into RAM by calling mlock [1],
then the kernel cannot evict them.  Since YARN can execute any code
submitted by an application, including possibly code that calls mlock, it
takes a cautious approach and assumes that these pages must be counted
towards the process footprint.  Although your Spark use case won't mlock
the pages (I assume), YARN doesn't have a way to identify this.

Perhaps there is room for improvement here.  If there is a reliable way to
count only mlock'ed pages, then perhaps that behavior could be added as
another option in ProcfsBasedProcessTree.  Off the top of my head, I can't
think of a reliable way to do this, and I can't research it further
immediately.  Do others on the thread have ideas?

--Chris Nauroth

[1] http://linux.die.net/man/2/mlock

On 2/4/16, 5:11 AM, "Jan Lukavský" <ja...@firma.seznam.cz> wrote:

>Hello,
>
>I have a question about the way LinuxResourceCalculatorPlugin calculates
>memory consumed by process tree (it is calculated via
>ProcfsBasedProcessTree class). When we enable caching (disk) in apache
>spark jobs run on YARN cluster, the node manager starts to kill the
>containers while reading the cached data, because of "Container is
>running beyond memory limits ...". The reason is that even if we enable
>parsing of the smaps file
>(yarn.nodemanager.container-monitor.procfs-tree.smaps-based-rss.enabled)
>the ProcfsBasedProcessTree calculates mmaped read-only pages as consumed
>by the process tree, while spark uses FileChannel.map(MapMode.READ_ONLY)
>to read the cached data. The JVM then consumes *a lot* more memory than
>the configured heap size (and it cannot be really controlled), but this
>memory is IMO not really consumed by the process, the kernel can reclaim
>these pages, if needed. My question is - is there any explicit reason
>why "Private_Clean" pages are calculated as consumed by process tree? I
>patched the ProcfsBasedProcessTree not to calculate them, but I don't
>know if this is the "correct" solution.
>
>Thanks for opinions,
>  cheers,
>  Jan
>
>
>---------------------------------------------------------------------
>To unsubscribe, e-mail: user-unsubscribe@hadoop.apache.org
>For additional commands, e-mail: user-help@hadoop.apache.org
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@hadoop.apache.org
For additional commands, e-mail: user-help@hadoop.apache.org

Re: ProcFsBasedProcessTree and clean pages in smaps

Posted by Chris Nauroth <cn...@hortonworks.com>.

Hello Jan,

I am moving this thread from user@hadoop.apache.org to
yarn-dev@hadoop.apache.org, since it's less a question of general usage
and more a question of internal code implementation details and possible
enhancements.

I think the issue is that it's not guaranteed in the general case that
Private_Clean pages are easily evictable from page cache by the kernel.
For example, if the pages have been pinned into RAM by calling mlock [1],
then the kernel cannot evict them.  Since YARN can execute any code
submitted by an application, including possibly code that calls mlock, it
takes a cautious approach and assumes that these pages must be counted
towards the process footprint.  Although your Spark use case won't mlock
the pages (I assume), YARN doesn't have a way to identify this.

Perhaps there is room for improvement here.  If there is a reliable way to
count only mlock'ed pages, then perhaps that behavior could be added as
another option in ProcfsBasedProcessTree.  Off the top of my head, I can't
think of a reliable way to do this, and I can't research it further
immediately.  Do others on the thread have ideas?

--Chris Nauroth

[1] http://linux.die.net/man/2/mlock

On 2/4/16, 5:11 AM, "Jan Lukavský" <ja...@firma.seznam.cz> wrote:

>Hello,
>
>I have a question about the way LinuxResourceCalculatorPlugin calculates
>memory consumed by process tree (it is calculated via
>ProcfsBasedProcessTree class). When we enable caching (disk) in apache
>spark jobs run on YARN cluster, the node manager starts to kill the
>containers while reading the cached data, because of "Container is
>running beyond memory limits ...". The reason is that even if we enable
>parsing of the smaps file
>(yarn.nodemanager.container-monitor.procfs-tree.smaps-based-rss.enabled)
>the ProcfsBasedProcessTree calculates mmaped read-only pages as consumed
>by the process tree, while spark uses FileChannel.map(MapMode.READ_ONLY)
>to read the cached data. The JVM then consumes *a lot* more memory than
>the configured heap size (and it cannot be really controlled), but this
>memory is IMO not really consumed by the process, the kernel can reclaim
>these pages, if needed. My question is - is there any explicit reason
>why "Private_Clean" pages are calculated as consumed by process tree? I
>patched the ProcfsBasedProcessTree not to calculate them, but I don't
>know if this is the "correct" solution.
>
>Thanks for opinions,
>  cheers,
>  Jan
>
>
>---------------------------------------------------------------------
>To unsubscribe, e-mail: user-unsubscribe@hadoop.apache.org
>For additional commands, e-mail: user-help@hadoop.apache.org
>
>

Re: ProcFsBasedProcessTree and clean pages in smaps

Posted by Chris Nauroth <cn...@hortonworks.com>.

Hello Jan,

I am moving this thread from user@hadoop.apache.org to
yarn-dev@hadoop.apache.org, since it's less a question of general usage
and more a question of internal code implementation details and possible
enhancements.

I think the issue is that it's not guaranteed in the general case that
Private_Clean pages are easily evictable from page cache by the kernel.
For example, if the pages have been pinned into RAM by calling mlock [1],
then the kernel cannot evict them.  Since YARN can execute any code
submitted by an application, including possibly code that calls mlock, it
takes a cautious approach and assumes that these pages must be counted
towards the process footprint.  Although your Spark use case won't mlock
the pages (I assume), YARN doesn't have a way to identify this.

Perhaps there is room for improvement here.  If there is a reliable way to
count only mlock'ed pages, then perhaps that behavior could be added as
another option in ProcfsBasedProcessTree.  Off the top of my head, I can't
think of a reliable way to do this, and I can't research it further
immediately.  Do others on the thread have ideas?

--Chris Nauroth

[1] http://linux.die.net/man/2/mlock

On 2/4/16, 5:11 AM, "Jan Lukavský" <ja...@firma.seznam.cz> wrote:

>Hello,
>
>I have a question about the way LinuxResourceCalculatorPlugin calculates
>memory consumed by process tree (it is calculated via
>ProcfsBasedProcessTree class). When we enable caching (disk) in apache
>spark jobs run on YARN cluster, the node manager starts to kill the
>containers while reading the cached data, because of "Container is
>running beyond memory limits ...". The reason is that even if we enable
>parsing of the smaps file
>(yarn.nodemanager.container-monitor.procfs-tree.smaps-based-rss.enabled)
>the ProcfsBasedProcessTree calculates mmaped read-only pages as consumed
>by the process tree, while spark uses FileChannel.map(MapMode.READ_ONLY)
>to read the cached data. The JVM then consumes *a lot* more memory than
>the configured heap size (and it cannot be really controlled), but this
>memory is IMO not really consumed by the process, the kernel can reclaim
>these pages, if needed. My question is - is there any explicit reason
>why "Private_Clean" pages are calculated as consumed by process tree? I
>patched the ProcfsBasedProcessTree not to calculate them, but I don't
>know if this is the "correct" solution.
>
>Thanks for opinions,
>  cheers,
>  Jan
>
>
>---------------------------------------------------------------------
>To unsubscribe, e-mail: user-unsubscribe@hadoop.apache.org
>For additional commands, e-mail: user-help@hadoop.apache.org
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@hadoop.apache.org
For additional commands, e-mail: user-help@hadoop.apache.org

Re: ProcFsBasedProcessTree and clean pages in smaps

Posted by Chris Nauroth <cn...@hortonworks.com>.

Hello Jan,

I am moving this thread from user@hadoop.apache.org to
yarn-dev@hadoop.apache.org, since it's less a question of general usage
and more a question of internal code implementation details and possible
enhancements.

I think the issue is that it's not guaranteed in the general case that
Private_Clean pages are easily evictable from page cache by the kernel.
For example, if the pages have been pinned into RAM by calling mlock [1],
then the kernel cannot evict them.  Since YARN can execute any code
submitted by an application, including possibly code that calls mlock, it
takes a cautious approach and assumes that these pages must be counted
towards the process footprint.  Although your Spark use case won't mlock
the pages (I assume), YARN doesn't have a way to identify this.

Perhaps there is room for improvement here.  If there is a reliable way to
count only mlock'ed pages, then perhaps that behavior could be added as
another option in ProcfsBasedProcessTree.  Off the top of my head, I can't
think of a reliable way to do this, and I can't research it further
immediately.  Do others on the thread have ideas?

--Chris Nauroth

[1] http://linux.die.net/man/2/mlock

On 2/4/16, 5:11 AM, "Jan Lukavský" <ja...@firma.seznam.cz> wrote:

>Hello,
>
>I have a question about the way LinuxResourceCalculatorPlugin calculates
>memory consumed by process tree (it is calculated via
>ProcfsBasedProcessTree class). When we enable caching (disk) in apache
>spark jobs run on YARN cluster, the node manager starts to kill the
>containers while reading the cached data, because of "Container is
>running beyond memory limits ...". The reason is that even if we enable
>parsing of the smaps file
>(yarn.nodemanager.container-monitor.procfs-tree.smaps-based-rss.enabled)
>the ProcfsBasedProcessTree calculates mmaped read-only pages as consumed
>by the process tree, while spark uses FileChannel.map(MapMode.READ_ONLY)
>to read the cached data. The JVM then consumes *a lot* more memory than
>the configured heap size (and it cannot be really controlled), but this
>memory is IMO not really consumed by the process, the kernel can reclaim
>these pages, if needed. My question is - is there any explicit reason
>why "Private_Clean" pages are calculated as consumed by process tree? I
>patched the ProcfsBasedProcessTree not to calculate them, but I don't
>know if this is the "correct" solution.
>
>Thanks for opinions,
>  cheers,
>  Jan
>
>
>---------------------------------------------------------------------
>To unsubscribe, e-mail: user-unsubscribe@hadoop.apache.org
>For additional commands, e-mail: user-help@hadoop.apache.org
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@hadoop.apache.org
For additional commands, e-mail: user-help@hadoop.apache.org

Re: ProcFsBasedProcessTree and clean pages in smaps

Posted by Chris Nauroth <cn...@hortonworks.com>.

Hello Jan,

I am moving this thread from user@hadoop.apache.org to
yarn-dev@hadoop.apache.org, since it's less a question of general usage
and more a question of internal code implementation details and possible
enhancements.

I think the issue is that it's not guaranteed in the general case that
Private_Clean pages are easily evictable from page cache by the kernel.
For example, if the pages have been pinned into RAM by calling mlock [1],
then the kernel cannot evict them.  Since YARN can execute any code
submitted by an application, including possibly code that calls mlock, it
takes a cautious approach and assumes that these pages must be counted
towards the process footprint.  Although your Spark use case won't mlock
the pages (I assume), YARN doesn't have a way to identify this.

Perhaps there is room for improvement here.  If there is a reliable way to
count only mlock'ed pages, then perhaps that behavior could be added as
another option in ProcfsBasedProcessTree.  Off the top of my head, I can't
think of a reliable way to do this, and I can't research it further
immediately.  Do others on the thread have ideas?

--Chris Nauroth

[1] http://linux.die.net/man/2/mlock

On 2/4/16, 5:11 AM, "Jan Lukavský" <ja...@firma.seznam.cz> wrote:

>Hello,
>
>I have a question about the way LinuxResourceCalculatorPlugin calculates
>memory consumed by process tree (it is calculated via
>ProcfsBasedProcessTree class). When we enable caching (disk) in apache
>spark jobs run on YARN cluster, the node manager starts to kill the
>containers while reading the cached data, because of "Container is
>running beyond memory limits ...". The reason is that even if we enable
>parsing of the smaps file
>(yarn.nodemanager.container-monitor.procfs-tree.smaps-based-rss.enabled)
>the ProcfsBasedProcessTree calculates mmaped read-only pages as consumed
>by the process tree, while spark uses FileChannel.map(MapMode.READ_ONLY)
>to read the cached data. The JVM then consumes *a lot* more memory than
>the configured heap size (and it cannot be really controlled), but this
>memory is IMO not really consumed by the process, the kernel can reclaim
>these pages, if needed. My question is - is there any explicit reason
>why "Private_Clean" pages are calculated as consumed by process tree? I
>patched the ProcfsBasedProcessTree not to calculate them, but I don't
>know if this is the "correct" solution.
>
>Thanks for opinions,
>  cheers,
>  Jan
>
>
>---------------------------------------------------------------------
>To unsubscribe, e-mail: user-unsubscribe@hadoop.apache.org
>For additional commands, e-mail: user-help@hadoop.apache.org
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@hadoop.apache.org
For additional commands, e-mail: user-help@hadoop.apache.org