You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@systemml.apache.org by Sourav Mazumder <so...@gmail.com> on 2015/12/08 04:29:36 UTC

Using GLM with Spark

Hi,

Trying to use GLM with Spark.

I go through the documentation of the same in
http://apache.github.io/incubator-systemml/algorithms-regression.html#generalized-linear-models
I see that inputs like X and Y have to supplied using a file and the file
has to be there in HDFS.

Is this understanding correct ? Can't X and Y be supplied using a Data
Frame from a Spark Context (as in case of example of LinearRegression in
http://apache.github.io/incubator-systemml/mlcontext-programming-guide.html#train-using-systemml-linear-regression-algorithm)
?

Regards,
Sourav

Re: Using GLM with Spark

Posted by Niketan Pansare <np...@us.ibm.com>.
Hi Sourav,

I guess you found the answer for question (a) based on recent email
threads.

>> b) What is the use of the parameter cmdLineParams ? If I am anyway
supplying X and y, the mandatory parameters why do I need to pass this
parameter again ?
Good question. While implementing MLContext, the key requirement was the
DML script should remain the same irrespective to invocation or backends
(i.e MLContext, command-line using Spark/Hadoop, standalone mode,
spark-shell, Jupyter, pyspark, etc.). This meant that we had to provide at
least two (or three if you count JMLC) mechanism for input matrices:
1. File name
2. RDD/DataFrame

Consider the following piece of DML code:
foo = read($bar)
or
fileFoo = $bar
foo = read(fileFoo)

Here, you can either call registerInput("foo", RDD) or registerInput("bar",
RDD). We decided to go with former approach (I will skip the reasons for
now). To remain consistent with the semantics of dollar parameters, we
ought to throw error if no value is provided for $bar. Hence they need to
be provided. I understand in above case, we can avoid it because we have
knowledge of which variables are registered. But I think special casing
situations is bad idea as it can break the language semantics in corner
cases:
fileFoo = $bar + ".bak"
foo = read(fileFoo)

Thanks,

Niketan Pansare
IBM Almaden Research Center
E-mail: npansar At us.ibm.com
http://researcher.watson.ibm.com/researcher/view.php?person=us-npansar



From:	Sourav Mazumder <so...@gmail.com>
To:	dev@systemml.incubator.apache.org
Date:	12/08/2015 07:11 AM
Subject:	Re: Using GLM with Spark



Hi Niketan,

Thanks a lot again for detailed clarification and example.

I do suggest to mention explicitly in the documentation that X and y can be
passed as Data Frame/RDD in case of Spark. It is not very clear from the
documentation. Right now the documentation sort of gives idea that having
Hadoop cluster is a need to execute this where as I'm looking for an end to
end execution of System ML only using Spark (without using Hadoop at all).

Next, questions I have are -

a)  How do I get back the B after I execute GLM in Spark ( ml.execute() ) ?
I need to use the same as an input to GLM-predict for using the model. And
I don't want to incur additional i/o. Can I use something like ml.get()
which will return back the B in a Matrix form ?

b) What is the use of the parameter cmdLineParams ? If I am anyway
supplying X and y, the mandatory parameters why do I need to pass this
parameter again ?

Regards,
Sourav


On Mon, Dec 7, 2015 at 11:11 PM, Niketan Pansare <np...@us.ibm.com>
wrote:

> Hi Sourav,
>
> Your understanding is correct, X and Y can be supplied either as a file
or
> as a RDD/DataFrame. Each of these two mechanisms has its own benefits.
The
> former mechanism (i.e. passing as file) pushes the reading/reblocking
into
> the optimizer, while the latter mechanism allows for preprocessing of
data
> (for example: using Spark SQL).
>
> Two use-cases when X and Y are supplied as files on HDFS:
> 1. Command-line invocation: $SPARK_HOME/bin/spark-submit --master ....
> SystemML.jar -f GLM.dml -nvargs dfam=2 link=2 yneg=-1.0 icpt=2 reg=0.001
> tol=0.00000001 disp=1.0 moi=100 mii=10 X=INPUT_DIR/X Y=INPUT_DIR/Y
> B=OUTPUT_DIR/betas fmt=csv O=OUTPUT_DIR/stats Log=OUTPUT_DIR/log
>
> 2. Using MLContext but without registering X and Y as input. Instead we
> pass filenames as command-line parameters:
> > val ml = new MLContext(sc)
> > val cmdLineParams = Map("X"->"INPUT_DIR/X", "Y"- > "INPUT_DIR/Y",
"dfam"
> -> "2", "link" -> "2", ...)
> > ml.execute("GLM.dml", cmdLineParams)
>
> As mentioned earlier, X and Y can be provided as RDD/DataFrame as well.
> > val ml = new MLContext(sc)
> > ml.registerInput("X", xDF)
> > ml.registerInput("Y", yDF)
> > val cmdLineParams = Map("X"->" ", "Y"- > " ", "B" -> " ", "dfam" ->
"2",
> "link" -> "2", ...)
> > ml.execute("GLM.dml", cmdLineParams)
>
> One important thing that I must point is the concept of "ifdef". It is
> explained in the section
>
http://apache.github.io/incubator-systemml/dml-language-reference.html#command-line-arguments
.
>
> Here is snippet from the DML script for GLM:
>
https://github.com/apache/incubator-systemml/blob/master/scripts/algorithms/GLM.dml

> fileX = $X;
> fileY = $Y;
> fileO = ifdef ($O, " ");
> fmtB = ifdef ($fmt, "text");
> distribution_type = ifdef ($dfam, 1);
>
> The above DML code essentially says $X and $Y are required parameters (a
> design decision that GLM script writer made), whereas $fmt and $dfam are
> optional as they are assigned default values when not explicitly
provided.
> Both these constructs are important tools in the arsenal of DML script
> writer. By not guarding a dollar parameter with ifdef, the DML script
> writer ensures that the user has to provide its value (in this case file
> names for X and Y). This is why, you will notice that I have provide a
> space for X, Y and B in the second MLContext snippet.
>
> Thanks,
>
> Niketan Pansare
> IBM Almaden Research Center
> E-mail: npansar At us.ibm.com
> http://researcher.watson.ibm.com/researcher/view.php?person=us-npansar
>
> [image: Inactive hide details for Sourav Mazumder ---12/07/2015 07:30:06
> PM---Hi, Trying to use GLM with Spark.]Sourav Mazumder ---12/07/2015
> 07:30:06 PM---Hi, Trying to use GLM with Spark.
>
> From: Sourav Mazumder <so...@gmail.com>
> To: dev@systemml.incubator.apache.org
> Date: 12/07/2015 07:30 PM
> Subject: Using GLM with Spark
> ------------------------------
>
>
>
> Hi,
>
> Trying to use GLM with Spark.
>
> I go through the documentation of the same in
>
>
http://apache.github.io/incubator-systemml/algorithms-regression.html#generalized-linear-models

> I see that inputs like X and Y have to supplied using a file and the file
> has to be there in HDFS.
>
> Is this understanding correct ? Can't X and Y be supplied using a Data
> Frame from a Spark Context (as in case of example of LinearRegression in
>
>
http://apache.github.io/incubator-systemml/mlcontext-programming-guide.html#train-using-systemml-linear-regression-algorithm

> )
> ?
>
> Regards,
> Sourav
>
>
>


Re: Using GLM with Spark

Posted by Sourav Mazumder <so...@gmail.com>.
Hi Niketan,

Thanks a lot again for detailed clarification and example.

I do suggest to mention explicitly in the documentation that X and y can be
passed as Data Frame/RDD in case of Spark. It is not very clear from the
documentation. Right now the documentation sort of gives idea that having
Hadoop cluster is a need to execute this where as I'm looking for an end to
end execution of System ML only using Spark (without using Hadoop at all).

Next, questions I have are -

a)  How do I get back the B after I execute GLM in Spark ( ml.execute() ) ?
I need to use the same as an input to GLM-predict for using the model. And
I don't want to incur additional i/o. Can I use something like ml.get()
which will return back the B in a Matrix form ?

b) What is the use of the parameter cmdLineParams ? If I am anyway
supplying X and y, the mandatory parameters why do I need to pass this
parameter again ?

Regards,
Sourav


On Mon, Dec 7, 2015 at 11:11 PM, Niketan Pansare <np...@us.ibm.com> wrote:

> Hi Sourav,
>
> Your understanding is correct, X and Y can be supplied either as a file or
> as a RDD/DataFrame. Each of these two mechanisms has its own benefits. The
> former mechanism (i.e. passing as file) pushes the reading/reblocking into
> the optimizer, while the latter mechanism allows for preprocessing of data
> (for example: using Spark SQL).
>
> Two use-cases when X and Y are supplied as files on HDFS:
> 1. Command-line invocation: $SPARK_HOME/bin/spark-submit --master ....
> SystemML.jar -f GLM.dml -nvargs dfam=2 link=2 yneg=-1.0 icpt=2 reg=0.001
> tol=0.00000001 disp=1.0 moi=100 mii=10 X=INPUT_DIR/X Y=INPUT_DIR/Y
> B=OUTPUT_DIR/betas fmt=csv O=OUTPUT_DIR/stats Log=OUTPUT_DIR/log
>
> 2. Using MLContext but without registering X and Y as input. Instead we
> pass filenames as command-line parameters:
> > val ml = new MLContext(sc)
> > val cmdLineParams = Map("X"->"INPUT_DIR/X", "Y"- > "INPUT_DIR/Y", "dfam"
> -> "2", "link" -> "2", ...)
> > ml.execute("GLM.dml", cmdLineParams)
>
> As mentioned earlier, X and Y can be provided as RDD/DataFrame as well.
> > val ml = new MLContext(sc)
> > ml.registerInput("X", xDF)
> > ml.registerInput("Y", yDF)
> > val cmdLineParams = Map("X"->" ", "Y"- > " ", "B" -> " ", "dfam" -> "2",
> "link" -> "2", ...)
> > ml.execute("GLM.dml", cmdLineParams)
>
> One important thing that I must point is the concept of "ifdef". It is
> explained in the section
> http://apache.github.io/incubator-systemml/dml-language-reference.html#command-line-arguments.
>
> Here is snippet from the DML script for GLM:
> https://github.com/apache/incubator-systemml/blob/master/scripts/algorithms/GLM.dml
> fileX = $X;
> fileY = $Y;
> fileO = ifdef ($O, " ");
> fmtB = ifdef ($fmt, "text");
> distribution_type = ifdef ($dfam, 1);
>
> The above DML code essentially says $X and $Y are required parameters (a
> design decision that GLM script writer made), whereas $fmt and $dfam are
> optional as they are assigned default values when not explicitly provided.
> Both these constructs are important tools in the arsenal of DML script
> writer. By not guarding a dollar parameter with ifdef, the DML script
> writer ensures that the user has to provide its value (in this case file
> names for X and Y). This is why, you will notice that I have provide a
> space for X, Y and B in the second MLContext snippet.
>
> Thanks,
>
> Niketan Pansare
> IBM Almaden Research Center
> E-mail: npansar At us.ibm.com
> http://researcher.watson.ibm.com/researcher/view.php?person=us-npansar
>
> [image: Inactive hide details for Sourav Mazumder ---12/07/2015 07:30:06
> PM---Hi, Trying to use GLM with Spark.]Sourav Mazumder ---12/07/2015
> 07:30:06 PM---Hi, Trying to use GLM with Spark.
>
> From: Sourav Mazumder <so...@gmail.com>
> To: dev@systemml.incubator.apache.org
> Date: 12/07/2015 07:30 PM
> Subject: Using GLM with Spark
> ------------------------------
>
>
>
> Hi,
>
> Trying to use GLM with Spark.
>
> I go through the documentation of the same in
>
> http://apache.github.io/incubator-systemml/algorithms-regression.html#generalized-linear-models
> I see that inputs like X and Y have to supplied using a file and the file
> has to be there in HDFS.
>
> Is this understanding correct ? Can't X and Y be supplied using a Data
> Frame from a Spark Context (as in case of example of LinearRegression in
>
> http://apache.github.io/incubator-systemml/mlcontext-programming-guide.html#train-using-systemml-linear-regression-algorithm
> )
> ?
>
> Regards,
> Sourav
>
>
>

Re: Using GLM with Spark

Posted by Sourav Mazumder <so...@gmail.com>.
Hi Sirish,

This is cool.

I typically achieve the same using Spark ML Lib utilities like
VectorAssembler and DataFrame utilities

So this brings a question in my mind when to use DML script for this type
of data preparation and when to use the available libraries in the existing
platform to do the. Any suggestion ?

Regards,
Sourav

On Tue, Dec 8, 2015 at 12:29 AM, Shirish Tatikonda <
shirish.tatikonda@gmail.com> wrote:

> Hi Saurav,
>
> Just to add to Niketan's response, you can find a utility DML script to
> split a data set into X and Y at [1]. This obviously is useful only if you
> have one unified data set with both X and Y.
>
> [1]
>
> https://github.com/apache/incubator-systemml/blob/master/scripts/utils/splitXY.dml
>
> Shirish
> On Dec 7, 2015 11:11 PM, "Niketan Pansare" <np...@us.ibm.com> wrote:
>
> > Hi Sourav,
> >
> > Your understanding is correct, X and Y can be supplied either as a file
> or
> > as a RDD/DataFrame. Each of these two mechanisms has its own benefits.
> The
> > former mechanism (i.e. passing as file) pushes the reading/reblocking
> into
> > the optimizer, while the latter mechanism allows for preprocessing of
> data
> > (for example: using Spark SQL).
> >
> > Two use-cases when X and Y are supplied as files on HDFS:
> > 1. Command-line invocation: $SPARK_HOME/bin/spark-submit --master ....
> > SystemML.jar -f GLM.dml -nvargs dfam=2 link=2 yneg=-1.0 icpt=2 reg=0.001
> > tol=0.00000001 disp=1.0 moi=100 mii=10 X=INPUT_DIR/X Y=INPUT_DIR/Y
> > B=OUTPUT_DIR/betas fmt=csv O=OUTPUT_DIR/stats Log=OUTPUT_DIR/log
> >
> > 2. Using MLContext but without registering X and Y as input. Instead we
> > pass filenames as command-line parameters:
> > > val ml = new MLContext(sc)
> > > val cmdLineParams = Map("X"->"INPUT_DIR/X", "Y"- > "INPUT_DIR/Y",
> "dfam"
> > -> "2", "link" -> "2", ...)
> > > ml.execute("GLM.dml", cmdLineParams)
> >
> > As mentioned earlier, X and Y can be provided as RDD/DataFrame as well.
> > > val ml = new MLContext(sc)
> > > ml.registerInput("X", xDF)
> > > ml.registerInput("Y", yDF)
> > > val cmdLineParams = Map("X"->" ", "Y"- > " ", "B" -> " ", "dfam" ->
> "2",
> > "link" -> "2", ...)
> > > ml.execute("GLM.dml", cmdLineParams)
> >
> > One important thing that I must point is the concept of "ifdef". It is
> > explained in the section
> >
> http://apache.github.io/incubator-systemml/dml-language-reference.html#command-line-arguments
> .
> >
> > Here is snippet from the DML script for GLM:
> >
> https://github.com/apache/incubator-systemml/blob/master/scripts/algorithms/GLM.dml
> > fileX = $X;
> > fileY = $Y;
> > fileO = ifdef ($O, " ");
> > fmtB = ifdef ($fmt, "text");
> > distribution_type = ifdef ($dfam, 1);
> >
> > The above DML code essentially says $X and $Y are required parameters (a
> > design decision that GLM script writer made), whereas $fmt and $dfam are
> > optional as they are assigned default values when not explicitly
> provided.
> > Both these constructs are important tools in the arsenal of DML script
> > writer. By not guarding a dollar parameter with ifdef, the DML script
> > writer ensures that the user has to provide its value (in this case file
> > names for X and Y). This is why, you will notice that I have provide a
> > space for X, Y and B in the second MLContext snippet.
> >
> > Thanks,
> >
> > Niketan Pansare
> > IBM Almaden Research Center
> > E-mail: npansar At us.ibm.com
> > http://researcher.watson.ibm.com/researcher/view.php?person=us-npansar
> >
> > [image: Inactive hide details for Sourav Mazumder ---12/07/2015 07:30:06
> > PM---Hi, Trying to use GLM with Spark.]Sourav Mazumder ---12/07/2015
> > 07:30:06 PM---Hi, Trying to use GLM with Spark.
> >
> > From: Sourav Mazumder <so...@gmail.com>
> > To: dev@systemml.incubator.apache.org
> > Date: 12/07/2015 07:30 PM
> > Subject: Using GLM with Spark
> > ------------------------------
> >
> >
> >
> > Hi,
> >
> > Trying to use GLM with Spark.
> >
> > I go through the documentation of the same in
> >
> >
> http://apache.github.io/incubator-systemml/algorithms-regression.html#generalized-linear-models
> > I see that inputs like X and Y have to supplied using a file and the file
> > has to be there in HDFS.
> >
> > Is this understanding correct ? Can't X and Y be supplied using a Data
> > Frame from a Spark Context (as in case of example of LinearRegression in
> >
> >
> http://apache.github.io/incubator-systemml/mlcontext-programming-guide.html#train-using-systemml-linear-regression-algorithm
> > )
> > ?
> >
> > Regards,
> > Sourav
> >
> >
> >
>

Re: Using GLM with Spark

Posted by Shirish Tatikonda <sh...@gmail.com>.
Hi Saurav,

Just to add to Niketan's response, you can find a utility DML script to
split a data set into X and Y at [1]. This obviously is useful only if you
have one unified data set with both X and Y.

[1]
https://github.com/apache/incubator-systemml/blob/master/scripts/utils/splitXY.dml

Shirish
On Dec 7, 2015 11:11 PM, "Niketan Pansare" <np...@us.ibm.com> wrote:

> Hi Sourav,
>
> Your understanding is correct, X and Y can be supplied either as a file or
> as a RDD/DataFrame. Each of these two mechanisms has its own benefits. The
> former mechanism (i.e. passing as file) pushes the reading/reblocking into
> the optimizer, while the latter mechanism allows for preprocessing of data
> (for example: using Spark SQL).
>
> Two use-cases when X and Y are supplied as files on HDFS:
> 1. Command-line invocation: $SPARK_HOME/bin/spark-submit --master ....
> SystemML.jar -f GLM.dml -nvargs dfam=2 link=2 yneg=-1.0 icpt=2 reg=0.001
> tol=0.00000001 disp=1.0 moi=100 mii=10 X=INPUT_DIR/X Y=INPUT_DIR/Y
> B=OUTPUT_DIR/betas fmt=csv O=OUTPUT_DIR/stats Log=OUTPUT_DIR/log
>
> 2. Using MLContext but without registering X and Y as input. Instead we
> pass filenames as command-line parameters:
> > val ml = new MLContext(sc)
> > val cmdLineParams = Map("X"->"INPUT_DIR/X", "Y"- > "INPUT_DIR/Y", "dfam"
> -> "2", "link" -> "2", ...)
> > ml.execute("GLM.dml", cmdLineParams)
>
> As mentioned earlier, X and Y can be provided as RDD/DataFrame as well.
> > val ml = new MLContext(sc)
> > ml.registerInput("X", xDF)
> > ml.registerInput("Y", yDF)
> > val cmdLineParams = Map("X"->" ", "Y"- > " ", "B" -> " ", "dfam" -> "2",
> "link" -> "2", ...)
> > ml.execute("GLM.dml", cmdLineParams)
>
> One important thing that I must point is the concept of "ifdef". It is
> explained in the section
> http://apache.github.io/incubator-systemml/dml-language-reference.html#command-line-arguments.
>
> Here is snippet from the DML script for GLM:
> https://github.com/apache/incubator-systemml/blob/master/scripts/algorithms/GLM.dml
> fileX = $X;
> fileY = $Y;
> fileO = ifdef ($O, " ");
> fmtB = ifdef ($fmt, "text");
> distribution_type = ifdef ($dfam, 1);
>
> The above DML code essentially says $X and $Y are required parameters (a
> design decision that GLM script writer made), whereas $fmt and $dfam are
> optional as they are assigned default values when not explicitly provided.
> Both these constructs are important tools in the arsenal of DML script
> writer. By not guarding a dollar parameter with ifdef, the DML script
> writer ensures that the user has to provide its value (in this case file
> names for X and Y). This is why, you will notice that I have provide a
> space for X, Y and B in the second MLContext snippet.
>
> Thanks,
>
> Niketan Pansare
> IBM Almaden Research Center
> E-mail: npansar At us.ibm.com
> http://researcher.watson.ibm.com/researcher/view.php?person=us-npansar
>
> [image: Inactive hide details for Sourav Mazumder ---12/07/2015 07:30:06
> PM---Hi, Trying to use GLM with Spark.]Sourav Mazumder ---12/07/2015
> 07:30:06 PM---Hi, Trying to use GLM with Spark.
>
> From: Sourav Mazumder <so...@gmail.com>
> To: dev@systemml.incubator.apache.org
> Date: 12/07/2015 07:30 PM
> Subject: Using GLM with Spark
> ------------------------------
>
>
>
> Hi,
>
> Trying to use GLM with Spark.
>
> I go through the documentation of the same in
>
> http://apache.github.io/incubator-systemml/algorithms-regression.html#generalized-linear-models
> I see that inputs like X and Y have to supplied using a file and the file
> has to be there in HDFS.
>
> Is this understanding correct ? Can't X and Y be supplied using a Data
> Frame from a Spark Context (as in case of example of LinearRegression in
>
> http://apache.github.io/incubator-systemml/mlcontext-programming-guide.html#train-using-systemml-linear-regression-algorithm
> )
> ?
>
> Regards,
> Sourav
>
>
>

Re: Using GLM with Spark

Posted by Niketan Pansare <np...@us.ibm.com>.
Hi Sourav,

Your understanding is correct, X and Y can be supplied either as a file or
as a RDD/DataFrame. Each of these two mechanisms has its own benefits. The
former mechanism (i.e. passing as file) pushes the reading/reblocking into
the optimizer, while the latter mechanism allows for preprocessing of data
(for example: using Spark SQL).

Two use-cases when X and Y are supplied as files on HDFS:
1. Command-line invocation: $SPARK_HOME/bin/spark-submit --master ....
SystemML.jar -f GLM.dml -nvargs dfam=2 link=2 yneg=-1.0 icpt=2 reg=0.001
tol=0.00000001 disp=1.0 moi=100 mii=10 X=INPUT_DIR/X Y=INPUT_DIR/Y
B=OUTPUT_DIR/betas fmt=csv O=OUTPUT_DIR/stats Log=OUTPUT_DIR/log

2. Using MLContext but without registering X and Y as input. Instead we
pass filenames as command-line parameters:
> val ml = new MLContext(sc)
> val cmdLineParams = Map("X"->"INPUT_DIR/X", "Y"- > "INPUT_DIR/Y", "dfam"
-> "2", "link" -> "2", ...)
> ml.execute("GLM.dml", cmdLineParams)

As mentioned earlier, X and Y can be provided as RDD/DataFrame as well.
> val ml = new MLContext(sc)
> ml.registerInput("X", xDF)
> ml.registerInput("Y", yDF)
> val cmdLineParams = Map("X"->" ", "Y"- > " ", "B" -> " ", "dfam" -> "2",
"link" -> "2", ...)
> ml.execute("GLM.dml", cmdLineParams)

One important thing that I must point is the concept of "ifdef". It is
explained in the section
http://apache.github.io/incubator-systemml/dml-language-reference.html#command-line-arguments
.
Here is snippet from the DML script for GLM:
https://github.com/apache/incubator-systemml/blob/master/scripts/algorithms/GLM.dml
fileX = $X;
fileY = $Y;
fileO = ifdef ($O, " ");
fmtB = ifdef ($fmt, "text");
distribution_type = ifdef ($dfam, 1);

The above DML code essentially says $X and $Y are required parameters (a
design decision that GLM script writer made), whereas $fmt and $dfam are
optional as they are assigned default values when not explicitly provided.
Both these constructs are important tools in the arsenal of DML script
writer. By not guarding a dollar parameter with ifdef, the DML script
writer ensures that the user has to provide its value (in this case file
names for X and Y). This is why, you will notice that I have provide a
space for X, Y and B in the second MLContext snippet.

Thanks,

Niketan Pansare
IBM Almaden Research Center
E-mail: npansar At us.ibm.com
http://researcher.watson.ibm.com/researcher/view.php?person=us-npansar



From:	Sourav Mazumder <so...@gmail.com>
To:	dev@systemml.incubator.apache.org
Date:	12/07/2015 07:30 PM
Subject:	Using GLM with Spark



Hi,

Trying to use GLM with Spark.

I go through the documentation of the same in
http://apache.github.io/incubator-systemml/algorithms-regression.html#generalized-linear-models

I see that inputs like X and Y have to supplied using a file and the file
has to be there in HDFS.

Is this understanding correct ? Can't X and Y be supplied using a Data
Frame from a Spark Context (as in case of example of LinearRegression in
http://apache.github.io/incubator-systemml/mlcontext-programming-guide.html#train-using-systemml-linear-regression-algorithm
)
?

Regards,
Sourav