You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by Riccardo Ferrari <fe...@gmail.com> on 2019/02/08 01:26:40 UTC
PySpark OOM when running PCA
Hi list,
I am having troubles running a PCA with pyspark. I am trying to reduce a
matrix size since my features after OHE gets 40k wide.
Spark 2.2.0 Stand-alone (Oracle JVM)
pyspark 2.2.0 from a docker (OpenJDK)
I'm starting the spark session from the notebook however I make sure to:
- PYSPARK_SUBMIT_ARGS: "--packages ... --driver-memory 20G pyspark-shell"
- sparkConf.set("spark.executor.memory", "24G")
- sparkConf.set("spark.driver.memory", "20G")
My executors gets 24Gb per node, and my driver process starts with:
/usr/lib/jvm/java-8-openjdk-amd64/jre/bin/java -cp
/usr/local/spark/conf/:/usr/local/spark/jars/* -Xmx20G
org.apache.spark.deploy.SparkSubmit --conf spark.executor.memory=24G ...
pyspark-shell
So I should have plenty of memory to play with, however when running
PCA.fit I get in the spark driver logs:
19/02/08 01:02:43 WARN TaskSetManager: Stage 29 contains a task of very
large size (142 KB). The maximum recommended task size is 100 KB.
19/02/08 01:02:43 WARN RowMatrix: 34771 columns will require at least 9672
megabytes of memory!
19/02/08 01:02:46 WARN RowMatrix: 34771 columns will require at least 9672
megabytes of memory!
Eventually fails:
Py4JJavaError: An error occurred while calling o287.fit.
: java.lang.OutOfMemoryError
at
java.io.ByteArrayOutputStream.hugeCapacity(ByteArrayOutputStream.java:123)
at java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:117)
at
java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutputStream.java:93)
at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:153)
at
org.apache.spark.util.ByteBufferOutputStream.write(ByteBufferOutputStream.scala:41)
...
at
org.apache.spark.mllib.linalg.distributed.RowMatrix.computeGramianMatrix(RowMatrix.scala:122)
at
org.apache.spark.mllib.linalg.distributed.RowMatrix.computeCovariance(RowMatrix.scala:344)
at
org.apache.spark.mllib.linalg.distributed.RowMatrix.computePrincipalComponentsAndExplainedVariance(RowMatrix.scala:387)
at org.apache.spark.mllib.feature.PCA.fit(PCA.scala:48)
...
What am I missing ?
Any hints much appreciated,