You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Parth Gandhi (JIRA)" <ji...@apache.org> on 2019/03/05 19:48:00 UTC

[jira] [Resolved] (SPARK-26947) Pyspark KMeans Clustering job fails on large values of k

     [ https://issues.apache.org/jira/browse/SPARK-26947?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Parth Gandhi resolved SPARK-26947.
----------------------------------
    Resolution: Invalid

> Pyspark KMeans Clustering job fails on large values of k
> --------------------------------------------------------
>
>                 Key: SPARK-26947
>                 URL: https://issues.apache.org/jira/browse/SPARK-26947
>             Project: Spark
>          Issue Type: Bug
>          Components: ML, MLlib, PySpark
>    Affects Versions: 2.4.0
>            Reporter: Parth Gandhi
>            Priority: Minor
>         Attachments: clustering_app.py
>
>
> We recently had a case where a user's pyspark job running KMeans clustering was failing for large values of k. I was able to reproduce the same issue with dummy dataset. I have attached the code as well as the data in the JIRA. The stack trace is printed below from Java:
>  
> {code:java}
> Exception in thread "Thread-10" java.lang.OutOfMemoryError: Java heap space
> 	at java.util.Arrays.copyOf(Arrays.java:3332)
> 	at java.lang.AbstractStringBuilder.ensureCapacityInternal(AbstractStringBuilder.java:124)
> 	at java.lang.AbstractStringBuilder.append(AbstractStringBuilder.java:649)
> 	at java.lang.StringBuilder.append(StringBuilder.java:202)
> 	at py4j.Protocol.getOutputCommand(Protocol.java:328)
> 	at py4j.commands.CallCommand.execute(CallCommand.java:81)
> 	at py4j.GatewayConnection.run(GatewayConnection.java:238)
> 	at java.lang.Thread.run(Thread.java:748)
> {code}
> Python:
> {code:java}
> Traceback (most recent call last):
>   File "/grid/2/tmp/yarn-local/usercache/user/appcache/xxx/py4j-0.10.7-src.zip/py4j/java_gateway.py", line 1159, in send_command
>     raise Py4JNetworkError("Answer from Java side is empty")
> py4j.protocol.Py4JNetworkError: Answer from Java side is empty
> During handling of the above exception, another exception occurred:
> Traceback (most recent call last):
>   File "/grid/2/tmp/yarn-local/usercache/user/appcache/xxx/container_xxx/py4j-0.10.7-src.zip/py4j/java_gateway.py", line 985, in send_command
>     response = connection.send_command(command)
>   File "/grid/2/tmp/yarn-local/usercache/user/appcache/application_xxx/container_xxx/py4j-0.10.7-src.zip/py4j/java_gateway.py", line 1164, in send_command
>     "Error while receiving", e, proto.ERROR_ON_RECEIVE)
> py4j.protocol.Py4JNetworkError: Error while receiving
> Traceback (most recent call last):
>   File "clustering_app.py", line 154, in <module>
>     main(args)
>   File "clustering_app.py", line 145, in main
>     run_clustering(sc, args.input_path, args.output_path, args.num_clusters_list)
>   File "clustering_app.py", line 136, in run_clustering
>     clustersTable, cluster_Centers = clustering(sc, documents, output_path, k, max_iter)
>   File "clustering_app.py", line 68, in clustering
>     cluster_Centers = km_model.clusterCenters()
>   File "/grid/2/tmp/yarn-local/usercache/user/appcache/application_xxx/container_xxx/pyspark.zip/pyspark/ml/clustering.py", line 337, in clusterCenters
>   File "/grid/2/tmp/yarn-local/usercache/user/appcache/application_xxx/container_xxx/pyspark.zip/pyspark/ml/wrapper.py", line 55, in _call_java
>   File "/grid/2/tmp/yarn-local/usercache/user/appcache/application_xxx/container_xxx/pyspark.zip/pyspark/ml/common.py", line 109, in _java2py
>   File "/grid/2/tmp/yarn-local/usercache/user/appcache/application_xxx/container_xxx/py4j-0.10.7-src.zip/py4j/java_gateway.py", line 1257, in __call__
>   File "/grid/2/tmp/yarn-local/usercache/user/appcache/application_xxx/container_xxx/pyspark.zip/pyspark/sql/utils.py", line 63, in deco
>   File "/grid/2/tmp/yarn-local/usercache/user/appcache/application_xxx/container_xxx/py4j-0.10.7-src.zip/py4j/protocol.py", line 336, in get_return_value
> py4j.protocol.Py4JError: An error occurred while calling z:org.apache.spark.ml.python.MLSerDe.dumps
> {code}
> The command with which the application was launched is given below:
> {code:java}
> $SPARK_HOME/bin/spark-submit --master yarn --deploy-mode cluster --conf spark.executor.memory=20g --conf spark.driver.memory=20g --conf spark.executor.memoryOverhead=4g --conf spark.driver.memoryOverhead=4g --conf spark.kryoserializer.buffer.max=2000m --conf spark.driver.maxResultSize=12g ~/clustering_app.py --input_path hdfs:///user/username/part-v001x --output_path hdfs:///user/username --num_clusters_list 10000
> {code}
> The input dataset is approximately 90 MB in size and the assigned heap memory to both driver and executor is close to 20 GB. This only happens for large values of k.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org