You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@spark.apache.org by Asher Krim <ak...@hubspot.com> on 2017/01/22 17:35:11 UTC

Spark 1.6.3 Driver OOM on createDataFrame

Hi All,

There seems to be a bug in Spark 1.6.3 which causes the driver to OOM when
creating a dataframe using a lot of data in memory on the driver. Examining
a heap dump, it looks like the driver is filled with multiple copies of the
data. The following java code reproduces the bug:

  public void run() {
    try (JavaSparkContext sc = getSparkContext()) {
      SQLContext sqlContext = new SQLContext(sc);

      DataFrame df = sqlContext.createDataFrame(generateData().stream()
              .map(floats -> RowFactory.create(floats))
              .collect(Collectors.toList()),
          DataTypes.createStructType(new StructField[] { VECTOR_FIELD }));

      LOG.info("successfully parallelized {} rows", df.count());
    }

  }

private List<List<Float>> generateData() { List<List<Float>> data = new
ArrayList<>(3_000_000); for (int i = 0; i < 3_000_000; i++) { List<Float>
row = new ArrayList<>(300); for (int j = 0; j < 300; j++) { row.add(random.
nextFloat()); } data.add(row); }


Increasing the driver memory to insane values (28g) doesn't help. I tested
in Spark 2 and the problem seems to have been solved, however I'm not sure
which issue is responsible for solving it. I assume it's one of these:
https://issues.apache.org/jira/browse/SPARK-12511?jql=project%20%3D%20SPARK%20AND%20status%20%3D%20Resolved%20AND%20fixVersion%20%3D%202.0.0%20AND%20text%20~%20%22OOM%22

The reason this is an issue is because some machine learning models are
represented as large-ish local data structures on the driver, so this bug
is encountered while attempting to save them. Unfortunately, using mllib
instead of ml is not an option since some mllib algorithms also rely on the
dataframe API for persisting the model (such as word2vec and LDA), even
though mllib is supposed to be based on RDDs. This makes these algorithms
unusable for anything larger than toy examples in < Spark 2.

If anyone is familiar with this bug, I would really appreciate it if they
could point me in the direction of the pr that fixed it.

Is a 1.6.4 release planned?
Would be possible to backport the dataframe bugfix?

Thanks,
Asher Krim
Senior Software Engineer