You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Hyukjin Kwon (JIRA)" <ji...@apache.org> on 2019/05/21 04:13:42 UTC

[jira] [Resolved] (SPARK-12519) "Managed memory leak detected" when using distinct on PySpark DataFrame

     [ https://issues.apache.org/jira/browse/SPARK-12519?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Hyukjin Kwon resolved SPARK-12519.
----------------------------------
    Resolution: Incomplete

> "Managed memory leak detected" when using distinct on PySpark DataFrame
> -----------------------------------------------------------------------
>
>                 Key: SPARK-12519
>                 URL: https://issues.apache.org/jira/browse/SPARK-12519
>             Project: Spark
>          Issue Type: Bug
>          Components: PySpark
>    Affects Versions: 1.5.2
>         Environment: OS X 10.9.5, Java 1.8.0_66
>            Reporter: Paul Shearer
>            Priority: Major
>              Labels: bulk-closed
>
> After running the distinct() method to transform a DataFrame, subsequent actions like count() and show() may report a managed memory leak. Here is a minimal example that reproduces the bug on my machine:
> h1. Script
> {noformat}
> logger = sc._jvm.org.apache.log4j
> logger.LogManager.getLogger("org"). setLevel( logger.Level.WARN )
> logger.LogManager.getLogger("akka").setLevel( logger.Level.WARN )
> import string
> import random
> def id_generator(size=6, chars=string.ascii_uppercase + string.digits):
>     return ''.join(random.choice(chars) for _ in range(size))
> nrow   = 80000
> ncol   = 20
> ndrow  = 40000 # number distinct rows
> tmp = [id_generator() for i in xrange(ndrow*ncol)]
> tmp = [tuple(tmp[ncol*(i % ndrow)+0:ncol*(i % ndrow)+ncol]) for i in xrange(nrow)] 
> dat = sc.parallelize(tmp,1000).toDF()
> dat = dat.distinct() # if this line is commented out, no memory leak will be reported
> # dat = dat.rdd.distinct().toDF() # if this line is used instead of the above, no leak
> ct = dat.count()
> print ct  
> # memory leak warning prints at this point in the code
> dat.show()  
> {noformat}
> h1. Output
> When this script is run in PySpark (with IPython kernel), I get this error:
> {noformat}
> $ pyspark --executor-memory 12G --driver-memory 12G
> Python 2.7.10 (v2.7.10:15c95b7d81dc, May 23 2015, 09:33:12) 
> Type "copyright", "credits" or "license" for more information.
> IPython 4.0.0 -- An enhanced Interactive Python.
> ?         -> Introduction and overview of IPython's features.
> %quickref -> Quick reference.
> help      -> Python's own help system.
> object?   -> Details about 'object', use 'object??' for extra details.
> <<<... usual loading info...>>>
> Welcome to
>       ____              __
>      / __/__  ___ _____/ /__
>     _\ \/ _ \/ _ `/ __/  '_/
>    /__ / .__/\_,_/_/ /_/\_\   version 1.5.2
>       /_/
> Using Python version 2.7.10 (v2.7.10:15c95b7d81dc, May 23 2015 09:33:12)
> SparkContext available as sc, SQLContext available as sqlContext.
> In [1]: execfile('bugtest.py')
> 40000
> 15/12/24 09:33:14 ERROR Executor: Managed memory leak detected; size = 16777216 bytes, TID = 2202
> +------+------+------+------+------+------+------+------+------+------+------+------+------+------+------+------+------+------+------+------+
> |    _1|    _2|    _3|    _4|    _5|    _6|    _7|    _8|    _9|   _10|   _11|   _12|   _13|   _14|   _15|   _16|   _17|   _18|   _19|   _20|
> +------+------+------+------+------+------+------+------+------+------+------+------+------+------+------+------+------+------+------+------+
> |83I981|09B1ZK|J5UB1A|BPYI80|7JTIMU|HVPQVY|XS4YM2|6N4YO3|AB9GQZ|92RCHR|1N46EU|THPZFH|5IXNR1|KL4LGD|B0S50O|DZH5QP|FKTHHF|MLOCTD|ZVV5BY|D76KRK|
> |BNLSVC|CYYYMD|W6ZXF6|Z0QXDT|4JPRX6|YSXIBK|WCB6YD|C86MPS|ZRA42Z|8W8GX8|2DW3AA|ZZ1U0O|EVXX3L|683UOL|5M6TOZ|PI4QX8|6V7SOS|THQVVJ|0ULB14|DJ2LP5|
> |IZYG7Q|Q0NCUG|0FSTPN|UVT8Y6|TBAEF6|5CGN50|WNGOSB|NX2Y8R|XWPW7Y|WPTLIV|NPF00K|92YSNO|FP50AU|CW0K3K|8ULT74|SZM6HK|4XPQU9|L109TB|02X1UC|TV8BLZ|
> |S7AWK6|7DQ8JP|YSIHVQ|1NKN5G|UOD1TN|ZSL6K4|86SDUW|NHLK9P|Z2ZBFL|QTOA89|D6D1NK|UXUJMG|B0A0ZF|94HB2S|HGLX19|VCVF05|HMAXNE|Y265LD|DHNR78|9L23XR|
> |U6JCLP|PKJEOB|66C408|HNAUQK|1Q9O2X|NFW976|YLAXD4|0XC334|NMKW62|W297XR|WL9KMG|8K1P94|T5P7LP|WAQ7PT|Q5JYG0|2A9H44|9DOW5P|9SOPFH|M0NNK5|W877FV|
> |3M39A1|K97EL6|7JFM9G|23I3JT|FIS25Z|HIY6VN|2ORNRG|MTGYMT|32IEH8|RX41EH|EJSSKX|H6QY8J|8G0R0H|AAPYPI|HDEVZ4|WP3VCW|2KNQZ0|U8V254|61T6SH|VJJP4L|
> |XT3CON|WG8XST|KKJ67T|5RBQB0|OC4LJT|GYSIBI|XGVGUP|8RND4A|38CY23|W3Q26Z|K0ARWU|FLA3O7|I3DGN7|IY080I|HAQW3T|EQDQHD|1Z8E3X|I0J5WN|P4B6IO|1S23KL|
> |4GMPF8|FFZLKK|Y4UW1Q|AF5J2H|VQ32TO|VMU7PG|WS66ZH|VXSYVK|S0GVCY|OL5I4Q|LFB98K|BCQVZK|XW03W6|F5YGTS|NTYCKZ|JTJ5YY|DR0VSC|KIUJMN|HCPYS4|QG9WYL|
> |USOIHJ|HPGNXC|DIGTPY|BL0QZ4|2957GI|8A7EC5|GOMEFU|568QPG|6EA6Z2|W7P0Z8|TSP1BF|XXYS8Q|TMN7OA|3ZL2R4|7W1856|DS3LHW|QH32TF|3Y7XPC|EUO5O6|95CIMH|
> |0CQR4E|ZV8SYE|UZNOLC|19JG2Q|G4RJVC|D2YUGB|HUKQUK|T0HSQH|9K0B9T|EHVBJF|R07A6C|6LS1FL|1NWXKJ|X7TIWZ|MPVWCO|RSO4F9|J5DQG6|AGMXTS|MFFRMX|PEDHGU|
> |1LQTDY|JV1HJY|7OH9HL|0AUWC7|LQFF5G|DUK4GW|HU6VLJ|PHY36G|BLMOYU|PY7E64|Y6XHYS|3IA38F|RF4LQ5|PIXEM2|0H5GIW|6V3M9C|0VBIUC|U4ZWRH|68M496|6UUVWZ|
> |B7A7TT|9K5MRI|8CJWX2|YUZ8SY|JLB0MX|3JNIN6|PJP0S5|9W7N5C|LIJSXB|488P8Y|PHWN5N|E6TF76|FGYZQ2|MGDN65|YNLUJE|5D6455|JI4J2K|C3J8K8|BTJ131|D5C7CD|
> |G9AKQ5|UPEQDN|JAWFI2|I0EKX2|YG8TN4|8NNJBO|X3GMYR|RXG2RX|CRS9US|53VX2Q|S72E08|H5PR14|JRDDMT|Q8G6PR|KOJA0W|1U4AX8|844N9D|SKN5F7|H0C29Z|7U7GHH|
> |A10ZUQ|HEI32J|VP99PD|44UP47|4W5BPO|X0QE8Q|H3UQVM|47VU9U|3AUPR8|TCGT7L|65WLUU|6PX6IW|5NCTC7|ES2S38|T86EI9|G20RFI|SX2V3V|5XT724|HV8HVS|T3JYJD|
> |3USH3X|NHXB4D|QPL3QC|8CN92J|MJF9JZ|DFA2IV|XT7C4S|CUB4IJ|4BD3OR|T3EK2S|V81146|LWXTMJ|PCVJ5N|R8H3H6|0W5DLU|GVAO4D|I7SNKJ|6TLMAV|E57PMA|OGCVQM|
> |FZEDSN|WO4JEN|000HBA|HA2GAN|5ROPXM|5K6NUG|2HWCJ0|OPX5AT|6PT5ZV|HGB74S|FCQT9S|NNODZP|G0ZMSJ|SHIFDQ|MYSHAT|KZDNA4|M25MPR|4XD9J9|JBFZZ0|XLIE31|
> |6UPQWB|BEV7YK|Q0JHJG|4C91TM|7HBK81|RIKZ9D|ZK96YJ|O4KZ48|GUYUHO|GYYO8P|4O1QUM|74I38Z|CAXQDE|URVY7R|PQ4WM4|4QOQ81|4PPV8B|SWKFCD|S8TC2W|QTJIS6|
> |L2DC9N|N0NQSH|N3SMU0|VRSSPM|7TGRZ9|1FP90A|Z9KB0U|CWOH6I|O2WNSY|IJEUNA|MTJQXG|CAT0VD|5SL8A0|R6SX6H|9ZSVL1|HWPTBR|4SBQPN|4GPD0Z|ZQ72K5|EIVYSE|
> |X6MH6R|VM5M86|ZV1H22|Z5V1FX|XRZSGC|L39Q1R|1OT5XB|84NY6I|IXKYXQ|KY2U4G|F13S00|CZRR3E|ZIAVU0|DU2BAB|27KBZ8|XWBB7G|09V69R|LTXJ4U|8GP3EM|P3WVAX|
> |1IPOKL|9EIG2Q|UQJV00|RXJGCK|X20VBH|CZB7SQ|THZ95A|V90YSH|9QTKCW|0RLYJO|WSTNYK|UXZYST|WT8OHL|KE31OO|C0ZKRE|9VSDJF|6Z3JAR|RR0KMB|R3J61U|EPNRZL|
> +------+------+------+------+------+------+------+------+------+------+------+------+------+------+------+------+------+------+------+------+
> only showing top 20 rows
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org