You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@systemml.apache.org by "Mike Dusenberry (JIRA)" <ji...@apache.org> on 2016/07/07 00:01:24 UTC

[jira] [Commented] (SYSTEMML-775) Distribute Data for spark

    [ https://issues.apache.org/jira/browse/SYSTEMML-775?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15365361#comment-15365361 ] 

Mike Dusenberry commented on SYSTEMML-775:
------------------------------------------

Hi [~johannes.tud]!

Thanks for reaching out.  For parallel computation, SystemML will intentionally only using the driver node if it is beneficial performance-wise, even if a cluster is present.  One such case, which is likely happening for this scenario, is when the input data is small and the subsequent intermediate results of operations are small.  Even in the situation in which the computation is carried out on the single driver node, SystemML will still operate in parallel on that node using threads.  Regardless, in the example given, the Spark DataFrame ({{dff}}) that you start with will be parallelized across the cluster as necessary by Spark, so no more work is needed on that part.  I would simply condense your example to the following:

{code}
ml.reset()
// Pass in DataFrame directly
ml.registerInput("X", dff)
ml.execute..... // execute script
{code}

This will simply use the parallelized DataFrame and convert it internally to the SystemML format to be used in the script as a matrix.

---

Note: *If* you use a large DataFrame and *intend to use it for multiple SystemML invocations*, it could be helpful to explicitly convert it to the SystemML format first:

{code}
// Convert DataFrame to SystemML format for performance across several uses
val sysMlMatrix = RDDConverterUtils.dataFrameToBinaryBlock(sc, dff, mc, false)

// Register and execute
ml.reset()
ml.registerInput("X", sysMlMatrix, numRows, numCols)
ml.execute....

// Register and execute another time using the same converted matrix
ml.reset()
ml.registerInput("X", sysMlMatrix, numRows, numCols)
ml.execute....  // another script

// Register and execute again using the same converted matrix
ml.reset()
ml.registerInput("X", sysMlMatrix, numRows, numCols)
ml.execute....  // another script
...
{code}

> Distribute Data for spark
> -------------------------
>
>                 Key: SYSTEMML-775
>                 URL: https://issues.apache.org/jira/browse/SYSTEMML-775
>             Project: SystemML
>          Issue Type: Question
>          Components: Algorithms
>    Affects Versions: SystemML 0.10
>            Reporter: Johannes Wilke
>            Priority: Minor
>
> Hi!
> I have to calculate in parallel on data on a spark-Cluster with SystemML.
> The program works fine on the cluster, but not in parallel, because I don't know how to distribute my data throw this Cluster to use the data with SystemML.
> In Scala I have tried the following:
>  val sysMlMatrix = RDDConverterUtils.dataFrameToBinaryBlock(sc, dff, mc, false)
>  sysMlMatrix.saveAsObjectFile("/home/hduser/test.obj")
>  val sysMlMatrix2 = sc.sequenceFile[MatrixIndexes, MatrixBlock]("/home/hduser/test.obj",1000);
>  val sysMlMatrix3 = JavaPairRDD.fromRDD(sysMlMatrix2)
>     ml.reset()
>     ml.registerInput("X", sysMlMatrix3, numRows, numCols)
> But I get a ClassCastException, when I try to load the object File.
> My Matrix has 1000 rows and I want to work in parallel on these rows.
> How can I reach this? I hope you can help me!



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)