You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@kudu.apache.org by "Todd Lipcon (JIRA)" <ji...@apache.org> on 2018/06/08 00:03:00 UTC

[jira] [Commented] (KUDU-2466) Fault tolerant scanners can over-allocate memory and crash a cluster

    [ https://issues.apache.org/jira/browse/KUDU-2466?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16505520#comment-16505520 ] 

Todd Lipcon commented on KUDU-2466:
-----------------------------------

On a semi-related note, do you think we should introduce some parameter whereby we actually forceably exit a kudu process if the memory consumption is >2x the configured memory limit, or some other conservative threshold? Perhaps only if we also detect swapping on the system? In this case we were >6x the limit.

> Fault tolerant scanners can over-allocate memory and crash a cluster
> --------------------------------------------------------------------
>
>                 Key: KUDU-2466
>                 URL: https://issues.apache.org/jira/browse/KUDU-2466
>             Project: Kudu
>          Issue Type: Bug
>            Reporter: Grant Henke
>            Priority: Critical
>
> When testing a Spark job with fault tolerant scanners enabled, reading a large table (~1.5TB replicated) with many columns resulted in using up all of the memory on the tablet servers. 400 GB of total memory was being consumed though the memory limit was configured for 60 GB. This impacted all services on the machines making the cluster effectively unusable. Killing the job running the scans did not free the memory. However, restarting the Tablet servers resulted in a healthy cluster. 
>  
> Based on a chat with [~tlipcon], [~jdcryans], and [~mpercy] it looks like we are not lazy in MergeIterator initialization and we could fix this by being lazy about the merger based on rowset bounds. Limiting the number of concurrently open scanners to O(rowset height).
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)