You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@systemml.apache.org by "Nakul Jindal (JIRA)" <ji...@apache.org> on 2016/09/14 02:35:20 UTC

[jira] [Created] (SYSTEMML-918) Severe performance drop when using csv matrix -> spark df -> systemml binary block

Nakul Jindal created SYSTEMML-918:
-------------------------------------

             Summary: Severe performance drop when using csv matrix -> spark df -> systemml binary block
                 Key: SYSTEMML-918
                 URL: https://issues.apache.org/jira/browse/SYSTEMML-918
             Project: SystemML
          Issue Type: Improvement
          Components: Runtime
            Reporter: Nakul Jindal


Imran is trying to read in a csv file which encodes a matrix.
The dimensions of the matrix are 10000 x 784
Here is what the metadata looks like:
{
    "data_type": "matrix",
    "value_type": "double",
    "rows": 10000,
    "cols": 784,
    "format": "csv",
    "header": false,
    "sep": ",",
    "description": { "author": "SystemML" }
}

There are two ways I read this matrix into the [tSNE|https://github.com/iyounus/incubator-systemml/blob/SYSTEMML-831_Implement_tSNE/scripts/staging/tSNE.dml#L25] algorithm as variable X.

1. The csv file is read into a dataframe and then the MLContext api is used to convert it to the binary block format. Here is the code used to create the dataframe:
{code}
val X0 = {sqlContext.read
    .format("com.databricks.spark.csv")
    .option("header", "false") // Use first line of all files as header
    .option("inferSchema", "true") // Automatically infer data types
    .load("hdfs://rr-dense1/user/iyounus/data/mnist_test.csv")
         }
// code for tsne ..........
val X = X0.drop(X0.col("C784")) // Drop last column
val tsneScript = dml(tsne).in("X", X).out("Y", "C")
println(Calendar.getInstance().get(Calendar.MINUTE))
val res = ml.execute(tsneScript)
{code}

2. A DML read statement reads the csv file directly.

Option 2 happens almost instantaneously. Option 1 takes about 8 minutes on a beefy machine with 40g allocated to the driver and 60g allocated to the executors on a 7 node cluster.

Looking at the spark web UI and following the code path, it leads us to this function:
https://github.com/apache/incubator-systemml/blob/master/src/main/java/org/apache/sysml/runtime/instructions/spark/utils/RDDConverterUtilsExt.java#L874-L879

This needs some investigation.
[~iyounus], [~fschueler]




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)