You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by Riccardo Ferrari <fe...@gmail.com> on 2019/02/08 01:26:40 UTC

PySpark OOM when running PCA

Hi list,

I am having troubles running a PCA with pyspark. I am trying to reduce a
matrix size since my features after OHE gets 40k wide.

Spark 2.2.0 Stand-alone (Oracle JVM)
pyspark 2.2.0 from a docker (OpenJDK)

I'm starting the spark session from the notebook however I make sure to:

   - PYSPARK_SUBMIT_ARGS: "--packages ... --driver-memory 20G pyspark-shell"
   - sparkConf.set("spark.executor.memory", "24G")
   - sparkConf.set("spark.driver.memory", "20G")

My executors gets 24Gb per node, and my driver process starts with:
/usr/lib/jvm/java-8-openjdk-amd64/jre/bin/java -cp
/usr/local/spark/conf/:/usr/local/spark/jars/* -Xmx20G
org.apache.spark.deploy.SparkSubmit --conf spark.executor.memory=24G ...
pyspark-shell

So I should have plenty of memory to play with, however when running
PCA.fit I get in the spark driver logs:
19/02/08 01:02:43 WARN TaskSetManager: Stage 29 contains a task of very
large size (142 KB). The maximum recommended task size is 100 KB.
19/02/08 01:02:43 WARN RowMatrix: 34771 columns will require at least 9672
megabytes of memory!
19/02/08 01:02:46 WARN RowMatrix: 34771 columns will require at least 9672
megabytes of memory!

Eventually fails:
Py4JJavaError: An error occurred while calling o287.fit.
: java.lang.OutOfMemoryError
    at
java.io.ByteArrayOutputStream.hugeCapacity(ByteArrayOutputStream.java:123)
    at java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:117)
    at
java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutputStream.java:93)
    at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:153)
    at
org.apache.spark.util.ByteBufferOutputStream.write(ByteBufferOutputStream.scala:41)
...
    at
org.apache.spark.mllib.linalg.distributed.RowMatrix.computeGramianMatrix(RowMatrix.scala:122)
    at
org.apache.spark.mllib.linalg.distributed.RowMatrix.computeCovariance(RowMatrix.scala:344)
    at
org.apache.spark.mllib.linalg.distributed.RowMatrix.computePrincipalComponentsAndExplainedVariance(RowMatrix.scala:387)
    at org.apache.spark.mllib.feature.PCA.fit(PCA.scala:48)
...

What am I missing ?
Any hints much appreciated,