You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@systemml.apache.org by Sourav Mazumder <so...@gmail.com> on 2015/12/08 21:23:34 UTC

Using GLM-predict

Hi,

I have used GLM.dml to create a model using some sample data. It returns to
me the matrix of Beta, B.

Now I want to use this matrix of Beta on a new set of data points and
generate predicted value of the dependent variable/observation.

When I checked GLM-predict, I could see that one can pass feature vector
for the new data set and also the matrix of beta.

But I could not see any way to get the predicted value of the dependent
variable/observation. The output parameter only supports matrix of
predicted means/probabilities.

Is there a way one can get the predicted value of the dependent
variable/observation from GLM-predict ?

Regards,
Sourav

Re: Using GLM-predict

Posted by Sourav Mazumder <so...@gmail.com>.

Hi Shirish,

Thanks for your clarification.

Not sure how exactly I can modify the GLM-predict.dml to get some
prediction to start with. Ideally I would like the probability threshold to
be parameterized.

If I check the script at line 135-136 I can see the code which returns
means and vars -
[means, vars] = glm_means_and_vars (linear_terms, dist_type, var_power,
link_type, link_power);

Can you give me some idea how from here I can calculate the predicted value
of the label using some value of probability threshold ?

Regards,
Sourav

On Tue, Dec 8, 2015 at 12:49 PM, Shirish Tatikonda <
shirish.tatikonda@gmail.com> wrote:

> Hi Sourav,
>
> Yes, GLM-predict.dml gives out only the probabilities. You can put a
> threshold on the resulting probabilities to get the actual class labels --
> for example, prob > 0.5 is positive and <=0.5 as negative.
>
> The exact value of threshold typically depends on the data and the
> application. Different thresholds yield different classifiers with
> different performance (precision, recall, etc.). You can find the best
> threshold for the given data set by finding a value that gives the desired
> classifier performance (for example, a threshold that gives roughly equal
> precision and recall). Such an optimization is obviously done during the
> training phase using a held out test set.
>
> If you wish, you can also modify the DML script to perform this entire
> process.
>
> Shirish
>
>
> On Tue, Dec 8, 2015 at 12:23 PM, Sourav Mazumder <
> sourav.mazumder00@gmail.com> wrote:
>
> > Hi,
> >
> > I have used GLM.dml to create a model using some sample data. It returns
> to
> > me the matrix of Beta, B.
> >
> > Now I want to use this matrix of Beta on a new set of data points and
> > generate predicted value of the dependent variable/observation.
> >
> > When I checked GLM-predict, I could see that one can pass feature vector
> > for the new data set and also the matrix of beta.
> >
> > But I could not see any way to get the predicted value of the dependent
> > variable/observation. The output parameter only supports matrix of
> > predicted means/probabilities.
> >
> > Is there a way one can get the predicted value of the dependent
> > variable/observation from GLM-predict ?
> >
> > Regards,
> > Sourav
> >
>

Re: Using GLM-predict

Posted by Sourav Mazumder <so...@gmail.com>.

Thanks a lot Niketan.

It worked. Finally I could create something end to end.

Couple of suggestions -

1. If someway we can make the ID related part transparent (handled by
System ML internally) to end user/data scientists it would be very helpful
for them.
2. API documentation of MLContext is required soon to that people can
understand various nuances of the parameter passing and getting the output
to/from DML script.

Regards,
Sourav



On Thu, Dec 10, 2015 at 10:47 AM, Niketan Pansare <np...@us.ibm.com>
wrote:

> Hi Sourav,
>
> >>  The first thing I noticed that in the target folder there is no .tar
> files
> for the distribution (like system-ml-0.9.0-SNAPSHOT-distrib.tar.gz). This
> was created previously when I downloaded the previous version form the
> github.
> We added maven profiles in the commit "
> https://github.com/apache/incubator-systemml/commit/3cfb0fb0ada7e6556a74500b33b53508c0309751".
> Please see the email thread regarding this change:
> https://www.mail-archive.com/dev@systemml.incubator.apache.org/msg00059.html
>
> >> But with that I
> started getting problem the package name. I could run finally the things
> after changing the package structure to org.apache.sysml. Please update the
> documentations accordingly.
> The package renaming was done in the commit "
> https://github.com/apache/incubator-systemml/commit/276d9257c08e667bc70ce49024c6450deb473b43".
> This was discussed in the email thread
> https://www.mail-archive.com/dev%40systemml.incubator.apache.org/msg00049.html.
> The documentation was updated in the commit
> https://github.com/apache/incubator-systemml/commit/7cd7dc2be83ea73c700b2bebe50e4f37bd275974.
> If we have missed anything, please let us know.
>
> Please feel free to reply back to the above email threads with
> suggestions/criticism.
>
> >> However, when I tried running GLM-predict after adding a new column as
> ID
> the GLM-predict has started failing.
> One possible reason for the error is that you have added "ID" to the
> DataFrame, but did not inform SystemML that ID was inserted. To do that,
> please replace "ml.registerInput("X", predDfIn)" to "ml.registerInput("X",
> predDfIn, true)".
>
> Thanks,
>
> Niketan Pansare
> IBM Almaden Research Center
> E-mail: npansar At us.ibm.com
> http://researcher.watson.ibm.com/researcher/view.php?person=us-npansar
>
> [image: Inactive hide details for Sourav Mazumder ---12/10/2015 08:00:51
> AM---Hi Niketan, Thanks for the exaplanation.]Sourav Mazumder
> ---12/10/2015 08:00:51 AM---Hi Niketan, Thanks for the exaplanation.
>
> From: Sourav Mazumder <so...@gmail.com>
> To: dev@systemml.incubator.apache.org
> Date: 12/10/2015 08:00 AM
> Subject: Re: Using GLM-predict
> ------------------------------
>
>
>
> Hi Niketan,
>
> Thanks for the exaplanation.
>
> While trying out the new build from github I'm facing issue.
>
> I downloaded the zip from github and rebuilt the package using 'mvn clean
> package'.
>
> The first thing I noticed that in the target folder there is no .tar files
> for the distribution (like system-ml-0.9.0-SNAPSHOT-distrib.tar.gz). This
> was created previously when I downloaded the previous version form the
> github. However I tried system-ml-0.9.0-SNAPSHOT.jar. But with that I
> started getting problem the package name. I could run finally the things
> after changing the package structure to org.apache.sysml. Please update the
> documentations accordingly.
>
> However, when I tried running GLM-predict after adding a new column as ID
> the GLM-predict has started failing.
>
> Here is the code I'm executing -
>
> val beta = outputs.getBinaryBlockedRDD("beta_out")
> val betaMC = outputs.getMatrixCharacteristics("beta_out")
>
> val Xin = sqlContext.sql("select Res_Area, Bldg_Area, Lot_Area, Bldg_Age
> from modeldf")
>
> val predDfIn = RDDConverterUtils.addIDToDataFrame(Xin, sqlContext, "ID")
>
> val cmdLineParamsPredict = Map("X" -> " ", "B" -> " ")
> ml.registerInput("X", predDfIn)
> ml.registerInput("B_full", beta, betaMC)
> ml.registerOutput("means")
>
> val outputsPredict =
> ml.execute("/home/system-ml-0.9.0-SNAPSHOT/algorithms/GLM-predict.dml",
> cmdLineParamsPredict)
>
> The error is -
>
> org.apache.sysml.runtime.DMLRuntimeException:
> org.apache.sysml.runtime.DMLRuntimeException: ERROR: Runtime error in
> program block generated from statement block between lines 122 and 123 --
> Error evaluating instruction:
>
> CP°rangeReIndex°B_full·MATRIX·DOUBLE°1·SCALAR·INT·true°5·SCALAR·INT·true°1·SCALAR·INT·true°1·SCALAR·INT·true°_mVar10563·MATRIX·DOUBLE
> at
> org.apache.sysml.runtime.controlprogram.Program.execute(Program.java:153)
> at
>
> org.apache.sysml.api.MLContext.executeUsingSimplifiedCompilationChain(MLContext.java:1337)
> at
> org.apache.sysml.api.MLContext.compileAndExecuteScript(MLContext.java:1203)
> at
> org.apache.sysml.api.MLContext.compileAndExecuteScript(MLContext.java:1149)
> at org.apache.sysml.api.MLContext.execute(MLContext.java:631) at
> org.apache.sysml.api.MLContext.execute(MLContext.java:666) at
> org.apache.sysml.api.MLContext.execute(MLContext.java:679) at
> $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:45)
> at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:50)
> at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:52) at
> $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:54) at
> $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:56) at
> $iwC$$iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:58) at
> $iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:60) at
> $iwC$$iwC$$iwC$$iwC.<init>(<console>:62) at
> $iwC$$iwC$$iwC.<init>(<console>:64) at $iwC$$iwC.<init>(<console>:66) at
> $iwC.<init>(<console>:68) at <init>(<console>:70) at .<init>(<console>:74)
> at .<clinit>(<console>) at .<init>(<console>:7) at .<clinit>(<console>) at
> $print(<console>)
>
> Regards,
> Sourav
>
> On Wed, Dec 9, 2015 at 9:56 PM, Niketan Pansare <np...@us.ibm.com>
> wrote:
>
> > Hi Sourav,
> >
> > There are two possible options here:
> > 1. If "unique_id" is one-based integer column: In this case, please
> > rename "unique_id" column to ID and use registerInput("X", DF1, true)
> > method.
> >
> > 2. If "unique_id" is anything else (for example: String), then there is
> > no trivial way for SystemML to correlate "string-based unique id" to row
> > index (which is required to interpret a DataFrame into a matrix). This
> > means you have to explicitly add the column ID to DF1:
> > val dataset = RDDConverterUtilsExt.*addIDToDataFrame*(DF1, sqlContext,
> > "ID")
> >
> > When you get DF5 from GLM-predict.dml, you can use following two lines of
> > code which guarantees correct mapping:
> > val DF5 = outNew.getDF(sqlContext, "outPred").withColumnRenamed("C1",
> > "prediction") // Note: there already is a column ID in DF5 which
> > specifies the row index.
> > val output = dataset1.join(pred, dataset1.col("ID").equalTo(pred.col("ID"
> > )))
> >
> > Note: once DataFrame is passed to SystemML via registerInput, SystemML
> > first converts the DataFrame into binary block (i.e.
> > JavaPairRDD<MatrixIndexes, MatrixBlock>) and executes GLM-predict.dml
> using
> > the binary block. After execution, the output is present in MLOutput (
> >
> https://github.com/apache/incubator-systemml/blob/master/src/main/java/org/apache/sysml/api/MLOutput.java#L89
> )
> > in binary block format. If user choses to, he/she may call getDF(...)
> which
> > does DataFrame to binary block conversion.
> >
> > For DataFrame to binary block conversion, see
> >
> https://github.com/apache/incubator-systemml/blob/master/src/main/java/org/apache/sysml/runtime/instructions/spark/utils/RDDConverterUtilsExt.java#L277
> > ... ordering specified by zipWithIndex (which is also used by
> > RDDConverterUtilsExt.*addIDToDataFrame*)
> > For binary block to DataFrame conversion, see
> >
> https://github.com/apache/incubator-systemml/blob/master/src/main/java/org/apache/sysml/runtime/instructions/spark/utils/RDDConverterUtilsExt.java#L364
> > ... ordering specified by internal binary block format and hence we
> append
> > an extra column ID to specify this ordering.
> >
> > Thanks,
> >
> > Niketan Pansare
> > IBM Almaden Research Center
> > E-mail: npansar At us.ibm.com
> > http://researcher.watson.ibm.com/researcher/view.php?person=us-npansar
> >
> > [image: Inactive hide details for Sourav Mazumder ---12/09/2015 06:20:24
> > PM---Hi Niketan, Thanks again for such a detailed explanation.]Sourav
> > Mazumder ---12/09/2015 06:20:24 PM---Hi Niketan, Thanks again for such a
> > detailed explanation. I see your last point and in
> >
> > From: Sourav Mazumder <so...@gmail.com>
> > To: dev@systemml.incubator.apache.org
> > Date: 12/09/2015 06:20 PM
> > Subject: Re: Using GLM-predict
> > ------------------------------
> >
> >
> >
> > Hi Niketan,
> >
> > Thanks again for such a detailed explanation. I see your last point and
> in
> > agreement with the same. Also I got your point on the use of "means" for
> > gaussian vs other distributions.
> >
> > However, I'm still not convinced about the approach you mentioned for
> > correlating the unique id. I've already tried a code similar to what you
> > sent where I've used the vectorAssembler utility of Spark ML LIb.
> >
> > Let me try to explain the problem with more details -
> >
> > 1. Say my original data frame DF1 is distributed in 3 slave nodes in a
> > Spark cluster. Each has say 20 rows. Total 60 rows. The DF1 also has a
> > unique identifier column say unique_id.
> > 2. Now I used your code to create the feature vector from DF1 and pass it
> > to GLM-predict. And GLM-predict in turn returns me another data frame
> (say
> > DF5) of "means" (in this case say prediction). However, the rows of DF5
> may
> > be distributed in 4 slave nodes each having say 15 rows. Total 60 rows.
> > 3. Now if I just add this new data frame (DF5) as additional two columns
> to
> > DF1 where is the guarantee that for a specific unique_id of DF1 I'm
> getting
> > right mean/predicted value corresponding to unique_id ?
> >
> > Regards,
> > Sourav
> >
> >
> >
> > On Wed, Dec 9, 2015 at 4:14 PM, Niketan Pansare <np...@us.ibm.com>
> > wrote:
> >
> > > Hi Sourav,
> > >
> > > Please see below comments:
> > >
> > > >> I was basically hoping for some sort of API where one can pass the
> > > original
> > > data frame and from that dataframe can specify the columns to be used
> as
> > > feature and the column to be used for label. This model can work well
> for
> > > both creating the model and getting the prediction.
> > > Please use the most recent jar from git. To extract X and Y from your
> > > dataframe without IDs, use following code:
> > > import
> > > org.apache.sysml.runtime.instructions.spark.utils.RDDConverterUtilsExt
> > > val features = Array("lat", "height", "precipitation", "pressure")
> > > val Xmc = new MatrixCharacteristics() // SystemML will set them for you
> > if
> > > the dimensions are unknown
> > > val Ymc = new MatrixCharacteristics()
> > > val X = RDDConverterUtilsExt.dataFrameToBinaryBlock(sc, df, Xmc,
> > features)
> > > val Y = RDDConverterUtilsExt.dataFrameToBinaryBlock(sc, df, Ymc,
> > > Array("temperature"))
> > >
> > > If you want to add specific ordering to your DataFrame rows (let's say
> > for
> > > prediction ... in most cases it is not required), use following method:
> > > import
> > > org.apache.sysml.runtime.instructions.spark.utils.RDDConverterUtilsExt
> > > df = RDDConverterUtilsExt.addIDToDataFrame(df, sqlContext, "ID")
> > >
> > > >> 1. Yes dependent variables are nothing but labels
> > > 2. The values of the dependent variable are not 1 to totalNumOfClasses.
> > The
> > > values can be any double number. For example say in a weather data set
> > you
> > > have fields like lat, long, height (from sea level), precipitation,
> > > pressure, temperature. Now one way you can create a model where
> > Temperature
> > > is the dependent variable and other are features (the hypothesis is
> > > Temperature is some function of pressure, precipitation, height,
> latitude
> > > and longitude.
> > > Sorry, in this case, please ignore my earlier suggestion of
> "Prediction =
> > > rowIndexMax(Prob)" as it applies only to classification.
> > > In your case, the returned values are "means" of the distribution
> family
> > > which was used (See
> > >
> >
> http://apache.github.io/incubator-systemml/algorithms-regression.html#generalized-linear-models
> > ).
> > > If Gaussian distribution was used (dfam=1, vpow=0.0), and if the
> problem
> > > was linear and if you expected pointy-hat distribution (i.e. positive
> > > kurtosis), then you can simply return the mean as predicted label. This
> > is
> > > because in case of Gaussian distribution, mean is also the mode. In
> other
> > > case, it might not necessarily be true.
> > >
> > > You may ask why are we making it so complicated and why not just return
> > > the predicted labels instead of probability ?
> > > Well, the problem of labelling is not as simple as it appears and it
> > > highly depends on the problem setting. Let's consider the problem of
> > > multi-class classification and my earlier suggestion "Prediction =
> > > rowIndexMax(Prob)". Also, let the labels be as follows = {cancer, sore
> > > throat, birth defect, fever, normal}. If for a given test example,
> let's
> > > say GLM-predict.dml outputs following probability = {cancer: 0.2, sore
> > > throat: 0.15, birth defect: 0.15, fever: 0.2, normal:0.3}. Then
> according
> > > to "Prediction = rowIndexMax(Prob)", we should output the label
> "normal"
> > > and send the patient home ... right ? No. In this case, 20% probability
> > of
> > > cancer is just way too high for a doctor to send the patient home. In
> > this
> > > setting, the doctor might then say to the data scientist: I know that
> > based
> > > on the prevalence of cancer in general public, and based on that domain
> > > knowledge, I suggest that probability over "threshold" should always be
> > > flagged as cancer. Else output the label with highest probability.
> Using
> > > this suggestion, the data scientist modifies the DML as follows:
> > > zeroOneMat = ppred(prob[cancerColID], threshold, ">")
> > > prediction = zeroOneMat*cancerColID + (1-zeroOneMat)*rowIndexMax(prob)
> > >
> > > This also shows the usefulness of "Declarative Machine Learning" :)
> > >
> > > Thanks,
> > >
> > > Niketan Pansare
> > > IBM Almaden Research Center
> > > E-mail: npansar At us.ibm.com
> > > http://researcher.watson.ibm.com/researcher/view.php?person=us-npansar
> > >
> > > [image: Inactive hide details for Sourav Mazumder ---12/09/2015
> 01:15:30
> > > PM---Hi Niketan, Firstly to answer your Qs -]Sourav Mazumder
> > > ---12/09/2015 01:15:30 PM---Hi Niketan, Firstly to answer your Qs -
> > >
> > > From: Sourav Mazumder <so...@gmail.com>
> > > To: dev@systemml.incubator.apache.org
> > > Date: 12/09/2015 01:15 PM
> > > Subject: Re: Using GLM-predict
> > > ------------------------------
> > >
> > >
> > >
> > > Hi Niketan,
> > >
> > > Firstly to answer your Qs -
> > >
> > > 1. Yes dependent variables are nothing but labels
> > > 2. The values of the dependent variable are not 1 to totalNumOfClasses.
> > The
> > > values can be any double number. For example say in a weather data set
> > you
> > > have fields like lat, long, height (from sea level), precipitation,
> > > pressure, temperature. Now one way you can create a model where
> > Temperature
> > > is the dependent variable and other are features (the hypothesis is
> > > Temperature is some function of pressure, precipitation, height,
> latitude
> > > and longitude.
> > >
> > > Not sure about the correlation between step 2 and step 3 in your mail.
> In
> > > step 3 does one have to pass 'ID' column (created in step 2) to varName
> > > while calling registerInput(String varName, DataFrame df, containsID) ?
> > >
> > > However the unique Id in typical case can be string. Can't that be used
> > as
> > > is instead ? This means one has to first convert the original unique id
> > to
> > > integer to create an additional unique id column and then again later
> on
> > > that integer unique id has to mapped back.
> > >
> > > I was basically hoping for some sort of API where one can pass the
> > original
> > > data frame and from that dataframe can specify the columns to be used
> as
> > > feature and the column to be used for label. This model can work well
> for
> > > both creating the model and getting the prediction.
> > >
> > > Regards,
> > > Sourav
> > >
> > > On Wed, Dec 9, 2015 at 12:53 PM, Niketan Pansare <np...@us.ibm.com>
> > > wrote:
> > >
> > > > Hi Sourav,
> > > >
> > > > Couple of questions to make sure we are on same page: does the
> > "dependent
> > > > variable (double)" represents the class labels ? Are the values of
> the
> > > > class labels from 1 to numClasses (i..e one-based) ?
> > > >
> > > > Here are few comments regarding correlating IDs:
> > > >
> > > > To represent an unordered collection (i.e. DataFrame) to an ordered
> > > > collection ("Matrix"), we add special column "ID" which represents
> > > *one-based
> > > > row index*. Please perform following steps:
> > > > 1. Accept recent changes from
> > > https://github.com/apache/incubator-systemml
> > > > and use the generated jar.
> > > >
> > > > 2. Map the unique id in DF1 to int (*1 to number of rows*) and call
> > that
> > > > column 'ID'.
> > > >
> > > > 3. Use the variant of registerInput for both X (both for training and
> > > > predicting) and Y:
> > > > registerInput(String varName, DataFrame df, *b**oolean* containsID)
> > > >
> > > > As a side note: instead of separate double columns, you can represent
> > > them
> > > > using VectorUDT and use our converter "JavaPairRDD<MatrixIndexes,
> > > > MatrixBlock> vectorDataFrameToBinaryBlock(JavaSparkContext sc,
> > DataFrame
> > > > inputDF, MatrixCharacteristics mcOut, *boolean* containsID, String
> > > > vectorColumnName) "
> > > >
> > > > Thanks,
> > > >
> > > > Niketan Pansare
> > > > IBM Almaden Research Center
> > > > E-mail: npansar At us.ibm.com
> > > >
> http://researcher.watson.ibm.com/researcher/view.php?person=us-npansar
> > > >
> > > > [image: Inactive hide details for Sourav Mazumder ---12/09/2015
> > 11:15:19
> > > > AM---Hi Niketan, The code you provided works fine. The use of]Sourav
> > > > Mazumder ---12/09/2015 11:15:19 AM---Hi Niketan, The code you
> provided
> > > > works fine. The use of getMatrixCharacteristics
> > > >
> > > > From: Sourav Mazumder <so...@gmail.com>
> > > > To: dev@systemml.incubator.apache.org
> > > > Date: 12/09/2015 11:15 AM
> > > > Subject: Re: Using GLM-predict
> > > > ------------------------------
> > > >
> > > >
> > > >
> > > > Hi Niketan,
> > > >
> > > > The code you provided works fine. The use of getMatrixCharacteristics
> > > > solves the basic execution problem.
> > > >
> > > > However, question #3 is probably not yet unresolved. Let me explain
> the
> > > use
> > > > case scenario I'm trying to build.
> > > >
> > > > 1. Say I have a data frame (DF1) with a Unique Id (string), a bunch
> of
> > > > columns (say 4) which are to be used as features (double), and a
> column
> > > for
> > > > the dependent variable (double).
> > > > 2. When I created the model I created a data frame (DF2) from DF1
> using
> > > > only the feature vectors and pass that as X. And the column with
> > > dependent
> > > > value is passed as Y.
> > > > 3. For calling the GLM-predict I'm using another data frame (DF3) of
> > same
> > > > structure but with different Unique ID (essentially different
> > > > records/rows). From that data frame I'm first creating another data
> > frame
> > > > (DF4) containing the columns representing the features. Then I'm
> > sending
> > > > DF4 to GLM-predict which has only feature vectors.
> > > > 4. The response I get from GLM-predict is the 'means'. Then I'm using
> > the
> > > > inline predict script which returns another data frame {DF5) with ID
> > and
> > > > Predicted values.
> > > >
> > > > The question is how do I correlate the ID I'm getting from DF5 with
> the
> > > > Unique ID of the data frame DF3 ?
> > > >
> > > > Regards,
> > > > Sourav
> > > >
> > > >
> > > >
> > > >
> > > > On Wed, Dec 9, 2015 at 9:17 AM, Niketan Pansare <np...@us.ibm.com>
> > > > wrote:
> > > >
> > > > > Hi Sourav,
> > > > >
> > > > > 1. In the GLM-predict.dml I could see 'means' is the output
> variable.
> > > In
> > > > my
> > > > > understanding it is same as the probability matrix u have mentioned
> > in
> > > > your
> > > > > mail (to be used to compute the prediction). Am I right ?
> > > > > Yes, that's correct.
> > > > >
> > > > > 2. From GLM.dml I get the 'betas' as output using
> > > > > outputs.getBinaryBlockedRDD("beta_out"). The same I pass to
> > > > GLM-predict.dml
> > > > > as B.
> > > > >
> > > > > Can you try this ?
> > > > > // Get output from GLM
> > > > > val beta = outputs.getBinaryBlockedRDD("beta_out")
> > > > > val betaMC = outputs.getMatrixCharacteristics("beta_out") // This
> way
> > > you
> > > > > don't have to worry about dimensions.
> > > > > // -----------------------------------------
> > > > > val Xin = DataFrame/RDD of values (or even text/csv file) you want
> to
> > > > > predict
> > > > > // -----------------------------------------
> > > > > // Execute GLM-predict
> > > > > ml.reset()
> > > > > // Please read
> > > > >
> > > >
> > >
> >
> https://github.com/apache/incubator-systemml/blob/master/scripts/algorithms/GLM.dml
> > > > > // dfam Int 1 Distribution family code: 1 = Power, 2 = Binomial
> > > > > val cmdLineParamsPredict = Map("X" -> " ", "B" -> " ", "dfam" ->
> > "...")
> > > > //
> > > > > family of distribution ?
> > > > > ml.registerInput("X", Xin)
> > > > > ml.registerInput("B_full", beta, betaMC)
> > > > > ml.registerOutput("means")
> > > > > val outputsPredict =
> > > > >
> > ml.execute("/home/system-ml-0.9.0-SNAPSHOT/algorithms/GLM-predict.dml",
> > > > > cmdLineParamsPredict)
> > > > > val prob = out.getBinaryBlockedRDD("means");
> > > > > val probMC = out.getMatrixCharacteristics("means");
> > > > > // -----------------------------------------
> > > > > // Get predicted label
> > > > > ml.reset()
> > > > > ml.registerInput("Prob",prob, probMC)
> > > > > ml.registerOutput("Prediction")
> > > > > val outputsLabels = = mlNew.executeScript("Prob = read(\"temp1\");
> "
> > > > > + "Prediction = rowIndexMax(Prob); "
> > > > > + "write(Prediction, \"tempOut\", \"csv\")")
> > > > > val pred = outputsLabels.getDF(sqlContext,
> > > > > "Prediction").withColumnRenamed("C1", "prediction")
> > > > > // -----------------------------------------
> > > > >
> > > > >
> > > > > 3. Say I get back prediction matrix as an output (from predictions
> =
> > > > > rowIndexMax(means);). Now can I read add that as a column to my
> > > original
> > > > > data frame (the one from which I created the feature vector for the
> > > > > original model) ? My concern is whether adding back will ensure the
> > > right
> > > > > order so that teh key for the feature vector and the predicted
> value
> > > > remain
> > > > > same ? If not how to achieve the same ?
> > > > > In above example 'pred' is a DataFrame with column 'ID' which
> > provides
> > > > the
> > > > > row ID.
> > > > >
> > > > > Thanks,
> > > > >
> > > > > Niketan Pansare
> > > > > IBM Almaden Research Center
> > > > > E-mail: npansar At us.ibm.com
> > > > >
> > http://researcher.watson.ibm.com/researcher/view.php?person=us-npansar
> > > > >
> > > > > [image: Inactive hide details for Sourav Mazumder ---12/08/2015
> > > 10:53:40
> > > > > PM---Hi Niketan, Thanks again for the detailed inputs.]Sourav
> > Mazumder
> > > > > ---12/08/2015 10:53:40 PM---Hi Niketan, Thanks again for the
> detailed
> > > > > inputs.
> > > > >
> > > > > From: Sourav Mazumder <so...@gmail.com>
> > > > > To: dev@systemml.incubator.apache.org, Niketan
> > > Pansare/Almaden/IBM@IBMUS
> > > > > Date: 12/08/2015 10:53 PM
> > > > > Subject: Re: Using GLM-predict
> > > > > ------------------------------
> > > > >
> > > > >
> > > > >
> > > > > Hi Niketan,
> > > > >
> > > > > Thanks again for the detailed inputs.
> > > > >
> > > > > Some more follow up Qs -
> > > > >
> > > > > 1. In the GLM-predict.dml I could see 'means' is the output
> variable.
> > > In
> > > > my
> > > > > understanding it is same as the probability matrix u have mentioned
> > in
> > > > your
> > > > > mail (to be used to compute the prediction). Am I right ?
> > > > >
> > > > > 2. From GLM.dml I get the 'betas' as output using
> > > > > outputs.getBinaryBlockedRDD("beta_out"). The same I pass to
> > > > GLM-predict.dml
> > > > > as B. For registering B following statements are used
> > > > > val beta = outputs.getBinaryBlockedRDD("beta_out")
> > > > > ml.registerInput("B", beta, 1, 4) // I have four feature vectors
> so I
> > > > get 4
> > > > > coefficients
> > > > >
> > > > > However, when I execute GLM-predict.dml I get following error.
> > > > >
> > > > > val outputs =
> > > > >
> > ml.execute("/home/system-ml-0.9.0-SNAPSHOT/algorithms/GLM-predict.dml",
> > > > > cmdLineParams)
> > > > >
> > > > > 15/12/09 05:32:47 WARN Expression: Metadata file:  .mtd not
> provided
> > > > > 15/12/09 05:32:47 ERROR Expression: ERROR:
> > > > > /home/system-ml-0.9.0-SNAPSHOT/algori
> > > > > thms/GLM-predict.dml -- line 117, column 8 -- Missing or incomplete
> > > > > dimensio
> > > > > n information in read statement:  .mtd
> > > > > com.ibm.bi.dml.parser.LanguageException: Invalid Parameters :
> ERROR:
> > > > > /home/syste
> > > > > m-ml-0.9.0-SNAPSHOT/algorithms/GLM-predict.dml -- line 117, column
> 8
> > --
> > > > > Miss
> > > > > ing or incomplete dimension information in read statement:  .mtd
> > > > >
> > > > > In line 117 we have following statement : X = read (fileX);
> > > > >
> > > > > 3. Say I get back prediction matrix as an output (from predictions
> =
> > > > > rowIndexMax(means);). Now can I read add that as a column to my
> > > original
> > > > > data frame (the one from which I created the feature vector for the
> > > > > original model) ? My concern is whether adding back will ensure the
> > > right
> > > > > order so that teh key for the feature vector and the predicted
> value
> > > > remain
> > > > > same ? If not how to achieve the same ?
> > > > >
> > > > > Regards,
> > > > > Sourav
> > > > >
> > > > >
> > > > >
> > > > >
> > > > >
> > > > > On Tue, Dec 8, 2015 at 2:08 PM, Niketan Pansare <
> npansar@us.ibm.com>
> > > > > wrote:
> > > > >
> > > > > > Hi Sourav,
> > > > > >
> > > > > > For some reason, I didn't get your email on "*Tue, 08 Dec 2015
> > > 12:56:38
> > > > > > -0800*
> > > > > > <
> > > > >
> > > >
> > >
> >
> https://www.mail-archive.com/search?l=dev@systemml.incubator.apache.org&q=date:20151208
> > > > >
> > > > > "
> > > > > > (which I noticed in the archive).
> > > > > >
> > > > > > >> Not sure how exactly I can modify the GLM-predict.dml to get
> > some
> > > > > > prediction to start with.
> > > > > > There are two options here:
> > > > > > 1. Modify GLM-predict.dml as suggested by Shirish (better
> approach
> > > with
> > > > > > respect to the SystemML optimizer) or
> > > > > >
> > > > > > 2. Run a new script on the output of GLM-predict. Please see:
> > > > > >
> > > > >
> > > >
> > >
> >
> https://github.com/apache/incubator-systemml/blob/master/src/main/java/org/apache/sysml/api/ml/LogisticRegressionModel.java#L163
> > > > > > If you chose to go with option 2, you might also want to read the
> > > > > > documentation of following two built-in functions:
> > > > > > a. rowIndexMax (See
> > > > > >
> > > > >
> > > >
> > >
> >
> http://apache.github.io/incubator-systemml/dml-language-reference.html#matrix-andor-scalar-comparison-built-in-functions
> > > > > > <
> > > > >
> > > >
> > >
> >
> http://apache.github.io/incubator-systemml/dml-language-reference.html#matrix-andor-scalar-comparison-built-in-functions
> > > > > >
> > > > > > )
> > > > > > b. ppred
> > > > > >
> > > > > > >> Can you give me some idea how from here I can calculate the
> > > > predicted
> > > > > > value of the label using some value of probability threshold ?
> > > > > > Very simple way to predict the label given probability matrix:
> > > > > > Prediction = rowIndexMax(Prob) # predicts the label with highest
> > > > > > probability. This assumes one-based labels.
> > > > > >
> > > > > > Thanks,
> > > > > >
> > > > > > Niketan Pansare
> > > > > > IBM Almaden Research Center
> > > > > > E-mail: npansar At us.ibm.com
> > > > > >
> > > http://researcher.watson.ibm.com/researcher/view.php?person=us-npansar
> > > > > >
> > > > > > [image: Inactive hide details for Shirish Tatikonda ---12/08/2015
> > > > > 12:49:47
> > > > > > PM---Hi Sourav, Yes, GLM-predict.dml gives out only the
> > prob]Shirish
> > > > > > Tatikonda ---12/08/2015 12:49:47 PM---Hi Sourav, Yes,
> > GLM-predict.dml
> > > > > gives
> > > > > > out only the probabilities. You can put a
> > > > > >
> > > > > > From: Shirish Tatikonda <sh...@gmail.com>
> > > > > > To: dev@systemml.incubator.apache.org
> > > > > > Date: 12/08/2015 12:49 PM
> > > > > > Subject: Re: Using GLM-predict
> > > > > > ------------------------------
> > > > > >
> > > > > >
> > > > > >
> > > > > > Hi Sourav,
> > > > > >
> > > > > > Yes, GLM-predict.dml gives out only the probabilities. You can
> put
> > a
> > > > > > threshold on the resulting probabilities to get the actual class
> > > labels
> > > > > --
> > > > > > for example, prob > 0.5 is positive and <=0.5 as negative.
> > > > > >
> > > > > > The exact value of threshold typically depends on the data and
> the
> > > > > > application. Different thresholds yield different classifiers
> with
> > > > > > different performance (precision, recall, etc.). You can find the
> > > best
> > > > > > threshold for the given data set by finding a value that gives
> the
> > > > > desired
> > > > > > classifier performance (for example, a threshold that gives
> roughly
> > > > equal
> > > > > > precision and recall). Such an optimization is obviously done
> > during
> > > > the
> > > > > > training phase using a held out test set.
> > > > > >
> > > > > > If you wish, you can also modify the DML script to perform this
> > > entire
> > > > > > process.
> > > > > >
> > > > > > Shirish
> > > > > >
> > > > > >
> > > > > > On Tue, Dec 8, 2015 at 12:23 PM, Sourav Mazumder <
> > > > > > sourav.mazumder00@gmail.com> wrote:
> > > > > >
> > > > > > > Hi,
> > > > > > >
> > > > > > > I have used GLM.dml to create a model using some sample data.
> It
> > > > > returns
> > > > > > to
> > > > > > > me the matrix of Beta, B.
> > > > > > >
> > > > > > > Now I want to use this matrix of Beta on a new set of data
> points
> > > and
> > > > > > > generate predicted value of the dependent variable/observation.
> > > > > > >
> > > > > > > When I checked GLM-predict, I could see that one can pass
> feature
> > > > > vector
> > > > > > > for the new data set and also the matrix of beta.
> > > > > > >
> > > > > > > But I could not see any way to get the predicted value of the
> > > > dependent
> > > > > > > variable/observation. The output parameter only supports matrix
> > of
> > > > > > > predicted means/probabilities.
> > > > > > >
> > > > > > > Is there a way one can get the predicted value of the dependent
> > > > > > > variable/observation from GLM-predict ?
> > > > > > >
> > > > > > > Regards,
> > > > > > > Sourav
> > > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > >
> > > > >
> > > > >
> > > >
> > > >
> > > >
> > >
> > >
> > >
> >
> >
> >
>
>
>

Re: Using GLM-predict

Posted by Niketan Pansare <np...@us.ibm.com>.

Hi Sourav,

>>  The first thing I noticed that in the target folder there is no .tar
files
for the distribution (like system-ml-0.9.0-SNAPSHOT-distrib.tar.gz). This
was created previously when I downloaded the previous version form the
github.
We added maven profiles in the commit "
https://github.com/apache/incubator-systemml/commit/3cfb0fb0ada7e6556a74500b33b53508c0309751
". Please see the email thread regarding this change:
https://www.mail-archive.com/dev@systemml.incubator.apache.org/msg00059.html

>> But with that I
started getting problem the package name. I could run finally the things
after changing the package structure to org.apache.sysml. Please update the
documentations accordingly.
The package renaming was done in the commit "
https://github.com/apache/incubator-systemml/commit/276d9257c08e667bc70ce49024c6450deb473b43
". This was discussed in the email thread
https://www.mail-archive.com/dev%40systemml.incubator.apache.org/msg00049.html
. The documentation was updated in the commit
https://github.com/apache/incubator-systemml/commit/7cd7dc2be83ea73c700b2bebe50e4f37bd275974
. If we have missed anything, please let us know.

Please feel free to reply back to the above email threads with
suggestions/criticism.

>> However, when I tried running GLM-predict after adding a new column as
ID
the GLM-predict has started failing.
One possible reason for the error is that you have added "ID" to the
DataFrame, but did not inform SystemML that ID was inserted. To do that,
please replace "ml.registerInput("X", predDfIn)" to "ml.registerInput("X",
predDfIn, true)".

Thanks,

Niketan Pansare
IBM Almaden Research Center
E-mail: npansar At us.ibm.com
http://researcher.watson.ibm.com/researcher/view.php?person=us-npansar



From:	Sourav Mazumder <so...@gmail.com>
To:	dev@systemml.incubator.apache.org
Date:	12/10/2015 08:00 AM
Subject:	Re: Using GLM-predict



Hi Niketan,

Thanks for the exaplanation.

While trying out the new build from github I'm facing issue.

I downloaded the zip from github and rebuilt the package using 'mvn clean
package'.

The first thing I noticed that in the target folder there is no .tar files
for the distribution (like system-ml-0.9.0-SNAPSHOT-distrib.tar.gz). This
was created previously when I downloaded the previous version form the
github. However I tried system-ml-0.9.0-SNAPSHOT.jar. But with that I
started getting problem the package name. I could run finally the things
after changing the package structure to org.apache.sysml. Please update the
documentations accordingly.

However, when I tried running GLM-predict after adding a new column as ID
the GLM-predict has started failing.

Here is the code I'm executing -

val beta = outputs.getBinaryBlockedRDD("beta_out")
val betaMC = outputs.getMatrixCharacteristics("beta_out")

val Xin = sqlContext.sql("select Res_Area, Bldg_Area, Lot_Area, Bldg_Age
from modeldf")

val predDfIn = RDDConverterUtils.addIDToDataFrame(Xin, sqlContext, "ID")

val cmdLineParamsPredict = Map("X" -> " ", "B" -> " ")
ml.registerInput("X", predDfIn)
ml.registerInput("B_full", beta, betaMC)
ml.registerOutput("means")

val outputsPredict =
ml.execute("/home/system-ml-0.9.0-SNAPSHOT/algorithms/GLM-predict.dml",
cmdLineParamsPredict)

The error is -

org.apache.sysml.runtime.DMLRuntimeException:
org.apache.sysml.runtime.DMLRuntimeException: ERROR: Runtime error in
program block generated from statement block between lines 122 and 123 --
Error evaluating instruction:
CP°rangeReIndex°B_full·MATRIX·DOUBLE°1·SCALAR·INT·true°5·SCALAR·INT·true°
1·SCALAR·INT·true°1·SCALAR·INT·true°_mVar10563·MATRIX·DOUBLE
at
org.apache.sysml.runtime.controlprogram.Program.execute(Program.java:153)
at
org.apache.sysml.api.MLContext.executeUsingSimplifiedCompilationChain
(MLContext.java:1337)
at
org.apache.sysml.api.MLContext.compileAndExecuteScript(MLContext.java:1203)
at
org.apache.sysml.api.MLContext.compileAndExecuteScript(MLContext.java:1149)
at org.apache.sysml.api.MLContext.execute(MLContext.java:631) at
org.apache.sysml.api.MLContext.execute(MLContext.java:666) at
org.apache.sysml.api.MLContext.execute(MLContext.java:679) at
$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:45)
at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:50)
at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:52) at
$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:54) at
$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:56) at
$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:58) at
$iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:60) at
$iwC$$iwC$$iwC$$iwC.<init>(<console>:62) at
$iwC$$iwC$$iwC.<init>(<console>:64) at $iwC$$iwC.<init>(<console>:66) at
$iwC.<init>(<console>:68) at <init>(<console>:70) at .<init>(<console>:74)
at .<clinit>(<console>) at .<init>(<console>:7) at .<clinit>(<console>) at
$print(<console>)

Regards,
Sourav

On Wed, Dec 9, 2015 at 9:56 PM, Niketan Pansare <np...@us.ibm.com> wrote:

> Hi Sourav,
>
> There are two possible options here:
> 1. If "unique_id" is one-based integer column: In this case, please
> rename "unique_id" column to ID and use registerInput("X", DF1, true)
> method.
>
> 2. If "unique_id" is anything else (for example: String), then there is
> no trivial way for SystemML to correlate "string-based unique id" to row
> index (which is required to interpret a DataFrame into a matrix). This
> means you have to explicitly add the column ID to DF1:
> val dataset = RDDConverterUtilsExt.*addIDToDataFrame*(DF1, sqlContext,
> "ID")
>
> When you get DF5 from GLM-predict.dml, you can use following two lines of
> code which guarantees correct mapping:
> val DF5 = outNew.getDF(sqlContext, "outPred").withColumnRenamed("C1",
> "prediction") // Note: there already is a column ID in DF5 which
> specifies the row index.
> val output = dataset1.join(pred, dataset1.col("ID").equalTo(pred.col("ID"
> )))
>
> Note: once DataFrame is passed to SystemML via registerInput, SystemML
> first converts the DataFrame into binary block (i.e.
> JavaPairRDD<MatrixIndexes, MatrixBlock>) and executes GLM-predict.dml
using
> the binary block. After execution, the output is present in MLOutput (
>
https://github.com/apache/incubator-systemml/blob/master/src/main/java/org/apache/sysml/api/MLOutput.java#L89
)
> in binary block format. If user choses to, he/she may call getDF(...)
which
> does DataFrame to binary block conversion.
>
> For DataFrame to binary block conversion, see
>
https://github.com/apache/incubator-systemml/blob/master/src/main/java/org/apache/sysml/runtime/instructions/spark/utils/RDDConverterUtilsExt.java#L277

> ... ordering specified by zipWithIndex (which is also used by
> RDDConverterUtilsExt.*addIDToDataFrame*)
> For binary block to DataFrame conversion, see
>
https://github.com/apache/incubator-systemml/blob/master/src/main/java/org/apache/sysml/runtime/instructions/spark/utils/RDDConverterUtilsExt.java#L364

> ... ordering specified by internal binary block format and hence we
append
> an extra column ID to specify this ordering.
>
> Thanks,
>
> Niketan Pansare
> IBM Almaden Research Center
> E-mail: npansar At us.ibm.com
> http://researcher.watson.ibm.com/researcher/view.php?person=us-npansar
>
> [image: Inactive hide details for Sourav Mazumder ---12/09/2015 06:20:24
> PM---Hi Niketan, Thanks again for such a detailed explanation.]Sourav
> Mazumder ---12/09/2015 06:20:24 PM---Hi Niketan, Thanks again for such a
> detailed explanation. I see your last point and in
>
> From: Sourav Mazumder <so...@gmail.com>
> To: dev@systemml.incubator.apache.org
> Date: 12/09/2015 06:20 PM
> Subject: Re: Using GLM-predict
> ------------------------------
>
>
>
> Hi Niketan,
>
> Thanks again for such a detailed explanation. I see your last point and
in
> agreement with the same. Also I got your point on the use of "means" for
> gaussian vs other distributions.
>
> However, I'm still not convinced about the approach you mentioned for
> correlating the unique id. I've already tried a code similar to what you
> sent where I've used the vectorAssembler utility of Spark ML LIb.
>
> Let me try to explain the problem with more details -
>
> 1. Say my original data frame DF1 is distributed in 3 slave nodes in a
> Spark cluster. Each has say 20 rows. Total 60 rows. The DF1 also has a
> unique identifier column say unique_id.
> 2. Now I used your code to create the feature vector from DF1 and pass it
> to GLM-predict. And GLM-predict in turn returns me another data frame
(say
> DF5) of "means" (in this case say prediction). However, the rows of DF5
may
> be distributed in 4 slave nodes each having say 15 rows. Total 60 rows.
> 3. Now if I just add this new data frame (DF5) as additional two columns
to
> DF1 where is the guarantee that for a specific unique_id of DF1 I'm
getting
> right mean/predicted value corresponding to unique_id ?
>
> Regards,
> Sourav
>
>
>
> On Wed, Dec 9, 2015 at 4:14 PM, Niketan Pansare <np...@us.ibm.com>
> wrote:
>
> > Hi Sourav,
> >
> > Please see below comments:
> >
> > >> I was basically hoping for some sort of API where one can pass the
> > original
> > data frame and from that dataframe can specify the columns to be used
as
> > feature and the column to be used for label. This model can work well
for
> > both creating the model and getting the prediction.
> > Please use the most recent jar from git. To extract X and Y from your
> > dataframe without IDs, use following code:
> > import
> > org.apache.sysml.runtime.instructions.spark.utils.RDDConverterUtilsExt
> > val features = Array("lat", "height", "precipitation", "pressure")
> > val Xmc = new MatrixCharacteristics() // SystemML will set them for you
> if
> > the dimensions are unknown
> > val Ymc = new MatrixCharacteristics()
> > val X = RDDConverterUtilsExt.dataFrameToBinaryBlock(sc, df, Xmc,
> features)
> > val Y = RDDConverterUtilsExt.dataFrameToBinaryBlock(sc, df, Ymc,
> > Array("temperature"))
> >
> > If you want to add specific ordering to your DataFrame rows (let's say
> for
> > prediction ... in most cases it is not required), use following method:
> > import
> > org.apache.sysml.runtime.instructions.spark.utils.RDDConverterUtilsExt
> > df = RDDConverterUtilsExt.addIDToDataFrame(df, sqlContext, "ID")
> >
> > >> 1. Yes dependent variables are nothing but labels
> > 2. The values of the dependent variable are not 1 to totalNumOfClasses.
> The
> > values can be any double number. For example say in a weather data set
> you
> > have fields like lat, long, height (from sea level), precipitation,
> > pressure, temperature. Now one way you can create a model where
> Temperature
> > is the dependent variable and other are features (the hypothesis is
> > Temperature is some function of pressure, precipitation, height,
latitude
> > and longitude.
> > Sorry, in this case, please ignore my earlier suggestion of "Prediction
=
> > rowIndexMax(Prob)" as it applies only to classification.
> > In your case, the returned values are "means" of the distribution
family
> > which was used (See
> >
>
http://apache.github.io/incubator-systemml/algorithms-regression.html#generalized-linear-models

> ).
> > If Gaussian distribution was used (dfam=1, vpow=0.0), and if the
problem
> > was linear and if you expected pointy-hat distribution (i.e. positive
> > kurtosis), then you can simply return the mean as predicted label. This
> is
> > because in case of Gaussian distribution, mean is also the mode. In
other
> > case, it might not necessarily be true.
> >
> > You may ask why are we making it so complicated and why not just return
> > the predicted labels instead of probability ?
> > Well, the problem of labelling is not as simple as it appears and it
> > highly depends on the problem setting. Let's consider the problem of
> > multi-class classification and my earlier suggestion "Prediction =
> > rowIndexMax(Prob)". Also, let the labels be as follows = {cancer, sore
> > throat, birth defect, fever, normal}. If for a given test example,
let's
> > say GLM-predict.dml outputs following probability = {cancer: 0.2, sore
> > throat: 0.15, birth defect: 0.15, fever: 0.2, normal:0.3}. Then
according
> > to "Prediction = rowIndexMax(Prob)", we should output the label
"normal"
> > and send the patient home ... right ? No. In this case, 20% probability
> of
> > cancer is just way too high for a doctor to send the patient home. In
> this
> > setting, the doctor might then say to the data scientist: I know that
> based
> > on the prevalence of cancer in general public, and based on that domain
> > knowledge, I suggest that probability over "threshold" should always be
> > flagged as cancer. Else output the label with highest probability.
Using
> > this suggestion, the data scientist modifies the DML as follows:
> > zeroOneMat = ppred(prob[cancerColID], threshold, ">")
> > prediction = zeroOneMat*cancerColID + (1-zeroOneMat)*rowIndexMax(prob)
> >
> > This also shows the usefulness of "Declarative Machine Learning" :)
> >
> > Thanks,
> >
> > Niketan Pansare
> > IBM Almaden Research Center
> > E-mail: npansar At us.ibm.com
> > http://researcher.watson.ibm.com/researcher/view.php?person=us-npansar
> >
> > [image: Inactive hide details for Sourav Mazumder ---12/09/2015
01:15:30
> > PM---Hi Niketan, Firstly to answer your Qs -]Sourav Mazumder
> > ---12/09/2015 01:15:30 PM---Hi Niketan, Firstly to answer your Qs -
> >
> > From: Sourav Mazumder <so...@gmail.com>
> > To: dev@systemml.incubator.apache.org
> > Date: 12/09/2015 01:15 PM
> > Subject: Re: Using GLM-predict
> > ------------------------------
> >
> >
> >
> > Hi Niketan,
> >
> > Firstly to answer your Qs -
> >
> > 1. Yes dependent variables are nothing but labels
> > 2. The values of the dependent variable are not 1 to totalNumOfClasses.
> The
> > values can be any double number. For example say in a weather data set
> you
> > have fields like lat, long, height (from sea level), precipitation,
> > pressure, temperature. Now one way you can create a model where
> Temperature
> > is the dependent variable and other are features (the hypothesis is
> > Temperature is some function of pressure, precipitation, height,
latitude
> > and longitude.
> >
> > Not sure about the correlation between step 2 and step 3 in your mail.
In
> > step 3 does one have to pass 'ID' column (created in step 2) to varName
> > while calling registerInput(String varName, DataFrame df, containsID) ?
> >
> > However the unique Id in typical case can be string. Can't that be used
> as
> > is instead ? This means one has to first convert the original unique id
> to
> > integer to create an additional unique id column and then again later
on
> > that integer unique id has to mapped back.
> >
> > I was basically hoping for some sort of API where one can pass the
> original
> > data frame and from that dataframe can specify the columns to be used
as
> > feature and the column to be used for label. This model can work well
for
> > both creating the model and getting the prediction.
> >
> > Regards,
> > Sourav
> >
> > On Wed, Dec 9, 2015 at 12:53 PM, Niketan Pansare <np...@us.ibm.com>
> > wrote:
> >
> > > Hi Sourav,
> > >
> > > Couple of questions to make sure we are on same page: does the
> "dependent
> > > variable (double)" represents the class labels ? Are the values of
the
> > > class labels from 1 to numClasses (i..e one-based) ?
> > >
> > > Here are few comments regarding correlating IDs:
> > >
> > > To represent an unordered collection (i.e. DataFrame) to an ordered
> > > collection ("Matrix"), we add special column "ID" which represents
> > *one-based
> > > row index*. Please perform following steps:
> > > 1. Accept recent changes from
> > https://github.com/apache/incubator-systemml
> > > and use the generated jar.
> > >
> > > 2. Map the unique id in DF1 to int (*1 to number of rows*) and call
> that
> > > column 'ID'.
> > >
> > > 3. Use the variant of registerInput for both X (both for training and
> > > predicting) and Y:
> > > registerInput(String varName, DataFrame df, *b**oolean* containsID)
> > >
> > > As a side note: instead of separate double columns, you can represent
> > them
> > > using VectorUDT and use our converter "JavaPairRDD<MatrixIndexes,
> > > MatrixBlock> vectorDataFrameToBinaryBlock(JavaSparkContext sc,
> DataFrame
> > > inputDF, MatrixCharacteristics mcOut, *boolean* containsID, String
> > > vectorColumnName) "
> > >
> > > Thanks,
> > >
> > > Niketan Pansare
> > > IBM Almaden Research Center
> > > E-mail: npansar At us.ibm.com
> > >
http://researcher.watson.ibm.com/researcher/view.php?person=us-npansar
> > >
> > > [image: Inactive hide details for Sourav Mazumder ---12/09/2015
> 11:15:19
> > > AM---Hi Niketan, The code you provided works fine. The use of]Sourav
> > > Mazumder ---12/09/2015 11:15:19 AM---Hi Niketan, The code you
provided
> > > works fine. The use of getMatrixCharacteristics
> > >
> > > From: Sourav Mazumder <so...@gmail.com>
> > > To: dev@systemml.incubator.apache.org
> > > Date: 12/09/2015 11:15 AM
> > > Subject: Re: Using GLM-predict
> > > ------------------------------
> > >
> > >
> > >
> > > Hi Niketan,
> > >
> > > The code you provided works fine. The use of getMatrixCharacteristics
> > > solves the basic execution problem.
> > >
> > > However, question #3 is probably not yet unresolved. Let me explain
the
> > use
> > > case scenario I'm trying to build.
> > >
> > > 1. Say I have a data frame (DF1) with a Unique Id (string), a bunch
of
> > > columns (say 4) which are to be used as features (double), and a
column
> > for
> > > the dependent variable (double).
> > > 2. When I created the model I created a data frame (DF2) from DF1
using
> > > only the feature vectors and pass that as X. And the column with
> > dependent
> > > value is passed as Y.
> > > 3. For calling the GLM-predict I'm using another data frame (DF3) of
> same
> > > structure but with different Unique ID (essentially different
> > > records/rows). From that data frame I'm first creating another data
> frame
> > > (DF4) containing the columns representing the features. Then I'm
> sending
> > > DF4 to GLM-predict which has only feature vectors.
> > > 4. The response I get from GLM-predict is the 'means'. Then I'm using
> the
> > > inline predict script which returns another data frame {DF5) with ID
> and
> > > Predicted values.
> > >
> > > The question is how do I correlate the ID I'm getting from DF5 with
the
> > > Unique ID of the data frame DF3 ?
> > >
> > > Regards,
> > > Sourav
> > >
> > >
> > >
> > >
> > > On Wed, Dec 9, 2015 at 9:17 AM, Niketan Pansare <np...@us.ibm.com>
> > > wrote:
> > >
> > > > Hi Sourav,
> > > >
> > > > 1. In the GLM-predict.dml I could see 'means' is the output
variable.
> > In
> > > my
> > > > understanding it is same as the probability matrix u have mentioned
> in
> > > your
> > > > mail (to be used to compute the prediction). Am I right ?
> > > > Yes, that's correct.
> > > >
> > > > 2. From GLM.dml I get the 'betas' as output using
> > > > outputs.getBinaryBlockedRDD("beta_out"). The same I pass to
> > > GLM-predict.dml
> > > > as B.
> > > >
> > > > Can you try this ?
> > > > // Get output from GLM
> > > > val beta = outputs.getBinaryBlockedRDD("beta_out")
> > > > val betaMC = outputs.getMatrixCharacteristics("beta_out") // This
way
> > you
> > > > don't have to worry about dimensions.
> > > > // -----------------------------------------
> > > > val Xin = DataFrame/RDD of values (or even text/csv file) you want
to
> > > > predict
> > > > // -----------------------------------------
> > > > // Execute GLM-predict
> > > > ml.reset()
> > > > // Please read
> > > >
> > >
> >
>
https://github.com/apache/incubator-systemml/blob/master/scripts/algorithms/GLM.dml

> > > > // dfam Int 1 Distribution family code: 1 = Power, 2 = Binomial
> > > > val cmdLineParamsPredict = Map("X" -> " ", "B" -> " ", "dfam" ->
> "...")
> > > //
> > > > family of distribution ?
> > > > ml.registerInput("X", Xin)
> > > > ml.registerInput("B_full", beta, betaMC)
> > > > ml.registerOutput("means")
> > > > val outputsPredict =
> > > >
> ml.execute("/home/system-ml-0.9.0-SNAPSHOT/algorithms/GLM-predict.dml",
> > > > cmdLineParamsPredict)
> > > > val prob = out.getBinaryBlockedRDD("means");
> > > > val probMC = out.getMatrixCharacteristics("means");
> > > > // -----------------------------------------
> > > > // Get predicted label
> > > > ml.reset()
> > > > ml.registerInput("Prob",prob, probMC)
> > > > ml.registerOutput("Prediction")
> > > > val outputsLabels = = mlNew.executeScript("Prob = read(\"temp1\");
"
> > > > + "Prediction = rowIndexMax(Prob); "
> > > > + "write(Prediction, \"tempOut\", \"csv\")")
> > > > val pred = outputsLabels.getDF(sqlContext,
> > > > "Prediction").withColumnRenamed("C1", "prediction")
> > > > // -----------------------------------------
> > > >
> > > >
> > > > 3. Say I get back prediction matrix as an output (from predictions
=
> > > > rowIndexMax(means);). Now can I read add that as a column to my
> > original
> > > > data frame (the one from which I created the feature vector for the
> > > > original model) ? My concern is whether adding back will ensure the
> > right
> > > > order so that teh key for the feature vector and the predicted
value
> > > remain
> > > > same ? If not how to achieve the same ?
> > > > In above example 'pred' is a DataFrame with column 'ID' which
> provides
> > > the
> > > > row ID.
> > > >
> > > > Thanks,
> > > >
> > > > Niketan Pansare
> > > > IBM Almaden Research Center
> > > > E-mail: npansar At us.ibm.com
> > > >
> http://researcher.watson.ibm.com/researcher/view.php?person=us-npansar
> > > >
> > > > [image: Inactive hide details for Sourav Mazumder ---12/08/2015
> > 10:53:40
> > > > PM---Hi Niketan, Thanks again for the detailed inputs.]Sourav
> Mazumder
> > > > ---12/08/2015 10:53:40 PM---Hi Niketan, Thanks again for the
detailed
> > > > inputs.
> > > >
> > > > From: Sourav Mazumder <so...@gmail.com>
> > > > To: dev@systemml.incubator.apache.org, Niketan
> > Pansare/Almaden/IBM@IBMUS
> > > > Date: 12/08/2015 10:53 PM
> > > > Subject: Re: Using GLM-predict
> > > > ------------------------------
> > > >
> > > >
> > > >
> > > > Hi Niketan,
> > > >
> > > > Thanks again for the detailed inputs.
> > > >
> > > > Some more follow up Qs -
> > > >
> > > > 1. In the GLM-predict.dml I could see 'means' is the output
variable.
> > In
> > > my
> > > > understanding it is same as the probability matrix u have mentioned
> in
> > > your
> > > > mail (to be used to compute the prediction). Am I right ?
> > > >
> > > > 2. From GLM.dml I get the 'betas' as output using
> > > > outputs.getBinaryBlockedRDD("beta_out"). The same I pass to
> > > GLM-predict.dml
> > > > as B. For registering B following statements are used
> > > > val beta = outputs.getBinaryBlockedRDD("beta_out")
> > > > ml.registerInput("B", beta, 1, 4) // I have four feature vectors so
I
> > > get 4
> > > > coefficients
> > > >
> > > > However, when I execute GLM-predict.dml I get following error.
> > > >
> > > > val outputs =
> > > >
> ml.execute("/home/system-ml-0.9.0-SNAPSHOT/algorithms/GLM-predict.dml",
> > > > cmdLineParams)
> > > >
> > > > 15/12/09 05:32:47 WARN Expression: Metadata file:  .mtd not
provided
> > > > 15/12/09 05:32:47 ERROR Expression: ERROR:
> > > > /home/system-ml-0.9.0-SNAPSHOT/algori
> > > > thms/GLM-predict.dml -- line 117, column 8 -- Missing or incomplete
> > > > dimensio
> > > > n information in read statement:  .mtd
> > > > com.ibm.bi.dml.parser.LanguageException: Invalid Parameters :
ERROR:
> > > > /home/syste
> > > > m-ml-0.9.0-SNAPSHOT/algorithms/GLM-predict.dml -- line 117, column
8
> --
> > > > Miss
> > > > ing or incomplete dimension information in read statement:  .mtd
> > > >
> > > > In line 117 we have following statement : X = read (fileX);
> > > >
> > > > 3. Say I get back prediction matrix as an output (from predictions
=
> > > > rowIndexMax(means);). Now can I read add that as a column to my
> > original
> > > > data frame (the one from which I created the feature vector for the
> > > > original model) ? My concern is whether adding back will ensure the
> > right
> > > > order so that teh key for the feature vector and the predicted
value
> > > remain
> > > > same ? If not how to achieve the same ?
> > > >
> > > > Regards,
> > > > Sourav
> > > >
> > > >
> > > >
> > > >
> > > >
> > > > On Tue, Dec 8, 2015 at 2:08 PM, Niketan Pansare
<np...@us.ibm.com>
> > > > wrote:
> > > >
> > > > > Hi Sourav,
> > > > >
> > > > > For some reason, I didn't get your email on "*Tue, 08 Dec 2015
> > 12:56:38
> > > > > -0800*
> > > > > <
> > > >
> > >
> >
>
https://www.mail-archive.com/search?l=dev@systemml.incubator.apache.org&q=date:20151208

> > > >
> > > > "
> > > > > (which I noticed in the archive).
> > > > >
> > > > > >> Not sure how exactly I can modify the GLM-predict.dml to get
> some
> > > > > prediction to start with.
> > > > > There are two options here:
> > > > > 1. Modify GLM-predict.dml as suggested by Shirish (better
approach
> > with
> > > > > respect to the SystemML optimizer) or
> > > > >
> > > > > 2. Run a new script on the output of GLM-predict. Please see:
> > > > >
> > > >
> > >
> >
>
https://github.com/apache/incubator-systemml/blob/master/src/main/java/org/apache/sysml/api/ml/LogisticRegressionModel.java#L163

> > > > > If you chose to go with option 2, you might also want to read the
> > > > > documentation of following two built-in functions:
> > > > > a. rowIndexMax (See
> > > > >
> > > >
> > >
> >
>
http://apache.github.io/incubator-systemml/dml-language-reference.html#matrix-andor-scalar-comparison-built-in-functions

> > > > > <
> > > >
> > >
> >
>
http://apache.github.io/incubator-systemml/dml-language-reference.html#matrix-andor-scalar-comparison-built-in-functions

> > > > >
> > > > > )
> > > > > b. ppred
> > > > >
> > > > > >> Can you give me some idea how from here I can calculate the
> > > predicted
> > > > > value of the label using some value of probability threshold ?
> > > > > Very simple way to predict the label given probability matrix:
> > > > > Prediction = rowIndexMax(Prob) # predicts the label with highest
> > > > > probability. This assumes one-based labels.
> > > > >
> > > > > Thanks,
> > > > >
> > > > > Niketan Pansare
> > > > > IBM Almaden Research Center
> > > > > E-mail: npansar At us.ibm.com
> > > > >
> > http://researcher.watson.ibm.com/researcher/view.php?person=us-npansar
> > > > >
> > > > > [image: Inactive hide details for Shirish Tatikonda ---12/08/2015
> > > > 12:49:47
> > > > > PM---Hi Sourav, Yes, GLM-predict.dml gives out only the
> prob]Shirish
> > > > > Tatikonda ---12/08/2015 12:49:47 PM---Hi Sourav, Yes,
> GLM-predict.dml
> > > > gives
> > > > > out only the probabilities. You can put a
> > > > >
> > > > > From: Shirish Tatikonda <sh...@gmail.com>
> > > > > To: dev@systemml.incubator.apache.org
> > > > > Date: 12/08/2015 12:49 PM
> > > > > Subject: Re: Using GLM-predict
> > > > > ------------------------------
> > > > >
> > > > >
> > > > >
> > > > > Hi Sourav,
> > > > >
> > > > > Yes, GLM-predict.dml gives out only the probabilities. You can
put
> a
> > > > > threshold on the resulting probabilities to get the actual class
> > labels
> > > > --
> > > > > for example, prob > 0.5 is positive and <=0.5 as negative.
> > > > >
> > > > > The exact value of threshold typically depends on the data and
the
> > > > > application. Different thresholds yield different classifiers
with
> > > > > different performance (precision, recall, etc.). You can find the
> > best
> > > > > threshold for the given data set by finding a value that gives
the
> > > > desired
> > > > > classifier performance (for example, a threshold that gives
roughly
> > > equal
> > > > > precision and recall). Such an optimization is obviously done
> during
> > > the
> > > > > training phase using a held out test set.
> > > > >
> > > > > If you wish, you can also modify the DML script to perform this
> > entire
> > > > > process.
> > > > >
> > > > > Shirish
> > > > >
> > > > >
> > > > > On Tue, Dec 8, 2015 at 12:23 PM, Sourav Mazumder <
> > > > > sourav.mazumder00@gmail.com> wrote:
> > > > >
> > > > > > Hi,
> > > > > >
> > > > > > I have used GLM.dml to create a model using some sample data.
It
> > > > returns
> > > > > to
> > > > > > me the matrix of Beta, B.
> > > > > >
> > > > > > Now I want to use this matrix of Beta on a new set of data
points
> > and
> > > > > > generate predicted value of the dependent variable/observation.
> > > > > >
> > > > > > When I checked GLM-predict, I could see that one can pass
feature
> > > > vector
> > > > > > for the new data set and also the matrix of beta.
> > > > > >
> > > > > > But I could not see any way to get the predicted value of the
> > > dependent
> > > > > > variable/observation. The output parameter only supports matrix
> of
> > > > > > predicted means/probabilities.
> > > > > >
> > > > > > Is there a way one can get the predicted value of the dependent
> > > > > > variable/observation from GLM-predict ?
> > > > > >
> > > > > > Regards,
> > > > > > Sourav
> > > > > >
> > > > >
> > > > >
> > > > >
> > > >
> > > >
> > > >
> > >
> > >
> > >
> >
> >
> >
>
>
>

Re: Using GLM-predict

Posted by Sourav Mazumder <so...@gmail.com>.

Hi Niketan,

Thanks for the exaplanation.

While trying out the new build from github I'm facing issue.

I downloaded the zip from github and rebuilt the package using 'mvn clean
package'.

The first thing I noticed that in the target folder there is no .tar files
for the distribution (like system-ml-0.9.0-SNAPSHOT-distrib.tar.gz). This
was created previously when I downloaded the previous version form the
github. However I tried system-ml-0.9.0-SNAPSHOT.jar. But with that I
started getting problem the package name. I could run finally the things
after changing the package structure to org.apache.sysml. Please update the
documentations accordingly.

However, when I tried running GLM-predict after adding a new column as ID
the GLM-predict has started failing.

Here is the code I'm executing -

val beta = outputs.getBinaryBlockedRDD("beta_out")
val betaMC = outputs.getMatrixCharacteristics("beta_out")

val Xin = sqlContext.sql("select Res_Area, Bldg_Area, Lot_Area, Bldg_Age
from modeldf")

val predDfIn = RDDConverterUtils.addIDToDataFrame(Xin, sqlContext, "ID")

val cmdLineParamsPredict = Map("X" -> " ", "B" -> " ")
ml.registerInput("X", predDfIn)
ml.registerInput("B_full", beta, betaMC)
ml.registerOutput("means")

val outputsPredict =
ml.execute("/home/system-ml-0.9.0-SNAPSHOT/algorithms/GLM-predict.dml",
cmdLineParamsPredict)

The error is -

org.apache.sysml.runtime.DMLRuntimeException:
org.apache.sysml.runtime.DMLRuntimeException: ERROR: Runtime error in
program block generated from statement block between lines 122 and 123 --
Error evaluating instruction:
CP°rangeReIndex°B_full·MATRIX·DOUBLE°1·SCALAR·INT·true°5·SCALAR·INT·true°1·SCALAR·INT·true°1·SCALAR·INT·true°_mVar10563·MATRIX·DOUBLE
at
org.apache.sysml.runtime.controlprogram.Program.execute(Program.java:153)
at
org.apache.sysml.api.MLContext.executeUsingSimplifiedCompilationChain(MLContext.java:1337)
at
org.apache.sysml.api.MLContext.compileAndExecuteScript(MLContext.java:1203)
at
org.apache.sysml.api.MLContext.compileAndExecuteScript(MLContext.java:1149)
at org.apache.sysml.api.MLContext.execute(MLContext.java:631) at
org.apache.sysml.api.MLContext.execute(MLContext.java:666) at
org.apache.sysml.api.MLContext.execute(MLContext.java:679) at
$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:45)
at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:50)
at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:52) at
$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:54) at
$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:56) at
$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:58) at
$iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:60) at
$iwC$$iwC$$iwC$$iwC.<init>(<console>:62) at
$iwC$$iwC$$iwC.<init>(<console>:64) at $iwC$$iwC.<init>(<console>:66) at
$iwC.<init>(<console>:68) at <init>(<console>:70) at .<init>(<console>:74)
at .<clinit>(<console>) at .<init>(<console>:7) at .<clinit>(<console>) at
$print(<console>)

Regards,
Sourav

On Wed, Dec 9, 2015 at 9:56 PM, Niketan Pansare <np...@us.ibm.com> wrote:

> Hi Sourav,
>
> There are two possible options here:
> 1. If "unique_id" is one-based integer column: In this case, please
> rename "unique_id" column to ID and use registerInput("X", DF1, true)
> method.
>
> 2. If "unique_id" is anything else (for example: String), then there is
> no trivial way for SystemML to correlate "string-based unique id" to row
> index (which is required to interpret a DataFrame into a matrix). This
> means you have to explicitly add the column ID to DF1:
> val dataset = RDDConverterUtilsExt.*addIDToDataFrame*(DF1, sqlContext,
> "ID")
>
> When you get DF5 from GLM-predict.dml, you can use following two lines of
> code which guarantees correct mapping:
> val DF5 = outNew.getDF(sqlContext, "outPred").withColumnRenamed("C1",
> "prediction") // Note: there already is a column ID in DF5 which
> specifies the row index.
> val output = dataset1.join(pred, dataset1.col("ID").equalTo(pred.col("ID"
> )))
>
> Note: once DataFrame is passed to SystemML via registerInput, SystemML
> first converts the DataFrame into binary block (i.e.
> JavaPairRDD<MatrixIndexes, MatrixBlock>) and executes GLM-predict.dml using
> the binary block. After execution, the output is present in MLOutput (
> https://github.com/apache/incubator-systemml/blob/master/src/main/java/org/apache/sysml/api/MLOutput.java#L89)
> in binary block format. If user choses to, he/she may call getDF(...) which
> does DataFrame to binary block conversion.
>
> For DataFrame to binary block conversion, see
> https://github.com/apache/incubator-systemml/blob/master/src/main/java/org/apache/sysml/runtime/instructions/spark/utils/RDDConverterUtilsExt.java#L277
> ... ordering specified by zipWithIndex (which is also used by
> RDDConverterUtilsExt.*addIDToDataFrame*)
> For binary block to DataFrame conversion, see
> https://github.com/apache/incubator-systemml/blob/master/src/main/java/org/apache/sysml/runtime/instructions/spark/utils/RDDConverterUtilsExt.java#L364
> ... ordering specified by internal binary block format and hence we append
> an extra column ID to specify this ordering.
>
> Thanks,
>
> Niketan Pansare
> IBM Almaden Research Center
> E-mail: npansar At us.ibm.com
> http://researcher.watson.ibm.com/researcher/view.php?person=us-npansar
>
> [image: Inactive hide details for Sourav Mazumder ---12/09/2015 06:20:24
> PM---Hi Niketan, Thanks again for such a detailed explanation.]Sourav
> Mazumder ---12/09/2015 06:20:24 PM---Hi Niketan, Thanks again for such a
> detailed explanation. I see your last point and in
>
> From: Sourav Mazumder <so...@gmail.com>
> To: dev@systemml.incubator.apache.org
> Date: 12/09/2015 06:20 PM
> Subject: Re: Using GLM-predict
> ------------------------------
>
>
>
> Hi Niketan,
>
> Thanks again for such a detailed explanation. I see your last point and in
> agreement with the same. Also I got your point on the use of "means" for
> gaussian vs other distributions.
>
> However, I'm still not convinced about the approach you mentioned for
> correlating the unique id. I've already tried a code similar to what you
> sent where I've used the vectorAssembler utility of Spark ML LIb.
>
> Let me try to explain the problem with more details -
>
> 1. Say my original data frame DF1 is distributed in 3 slave nodes in a
> Spark cluster. Each has say 20 rows. Total 60 rows. The DF1 also has a
> unique identifier column say unique_id.
> 2. Now I used your code to create the feature vector from DF1 and pass it
> to GLM-predict. And GLM-predict in turn returns me another data frame (say
> DF5) of "means" (in this case say prediction). However, the rows of DF5 may
> be distributed in 4 slave nodes each having say 15 rows. Total 60 rows.
> 3. Now if I just add this new data frame (DF5) as additional two columns to
> DF1 where is the guarantee that for a specific unique_id of DF1 I'm getting
> right mean/predicted value corresponding to unique_id ?
>
> Regards,
> Sourav
>
>
>
> On Wed, Dec 9, 2015 at 4:14 PM, Niketan Pansare <np...@us.ibm.com>
> wrote:
>
> > Hi Sourav,
> >
> > Please see below comments:
> >
> > >> I was basically hoping for some sort of API where one can pass the
> > original
> > data frame and from that dataframe can specify the columns to be used as
> > feature and the column to be used for label. This model can work well for
> > both creating the model and getting the prediction.
> > Please use the most recent jar from git. To extract X and Y from your
> > dataframe without IDs, use following code:
> > import
> > org.apache.sysml.runtime.instructions.spark.utils.RDDConverterUtilsExt
> > val features = Array("lat", "height", "precipitation", "pressure")
> > val Xmc = new MatrixCharacteristics() // SystemML will set them for you
> if
> > the dimensions are unknown
> > val Ymc = new MatrixCharacteristics()
> > val X = RDDConverterUtilsExt.dataFrameToBinaryBlock(sc, df, Xmc,
> features)
> > val Y = RDDConverterUtilsExt.dataFrameToBinaryBlock(sc, df, Ymc,
> > Array("temperature"))
> >
> > If you want to add specific ordering to your DataFrame rows (let's say
> for
> > prediction ... in most cases it is not required), use following method:
> > import
> > org.apache.sysml.runtime.instructions.spark.utils.RDDConverterUtilsExt
> > df = RDDConverterUtilsExt.addIDToDataFrame(df, sqlContext, "ID")
> >
> > >> 1. Yes dependent variables are nothing but labels
> > 2. The values of the dependent variable are not 1 to totalNumOfClasses.
> The
> > values can be any double number. For example say in a weather data set
> you
> > have fields like lat, long, height (from sea level), precipitation,
> > pressure, temperature. Now one way you can create a model where
> Temperature
> > is the dependent variable and other are features (the hypothesis is
> > Temperature is some function of pressure, precipitation, height, latitude
> > and longitude.
> > Sorry, in this case, please ignore my earlier suggestion of "Prediction =
> > rowIndexMax(Prob)" as it applies only to classification.
> > In your case, the returned values are "means" of the distribution family
> > which was used (See
> >
> http://apache.github.io/incubator-systemml/algorithms-regression.html#generalized-linear-models
> ).
> > If Gaussian distribution was used (dfam=1, vpow=0.0), and if the problem
> > was linear and if you expected pointy-hat distribution (i.e. positive
> > kurtosis), then you can simply return the mean as predicted label. This
> is
> > because in case of Gaussian distribution, mean is also the mode. In other
> > case, it might not necessarily be true.
> >
> > You may ask why are we making it so complicated and why not just return
> > the predicted labels instead of probability ?
> > Well, the problem of labelling is not as simple as it appears and it
> > highly depends on the problem setting. Let's consider the problem of
> > multi-class classification and my earlier suggestion "Prediction =
> > rowIndexMax(Prob)". Also, let the labels be as follows = {cancer, sore
> > throat, birth defect, fever, normal}. If for a given test example, let's
> > say GLM-predict.dml outputs following probability = {cancer: 0.2, sore
> > throat: 0.15, birth defect: 0.15, fever: 0.2, normal:0.3}. Then according
> > to "Prediction = rowIndexMax(Prob)", we should output the label "normal"
> > and send the patient home ... right ? No. In this case, 20% probability
> of
> > cancer is just way too high for a doctor to send the patient home. In
> this
> > setting, the doctor might then say to the data scientist: I know that
> based
> > on the prevalence of cancer in general public, and based on that domain
> > knowledge, I suggest that probability over "threshold" should always be
> > flagged as cancer. Else output the label with highest probability. Using
> > this suggestion, the data scientist modifies the DML as follows:
> > zeroOneMat = ppred(prob[cancerColID], threshold, ">")
> > prediction = zeroOneMat*cancerColID + (1-zeroOneMat)*rowIndexMax(prob)
> >
> > This also shows the usefulness of "Declarative Machine Learning" :)
> >
> > Thanks,
> >
> > Niketan Pansare
> > IBM Almaden Research Center
> > E-mail: npansar At us.ibm.com
> > http://researcher.watson.ibm.com/researcher/view.php?person=us-npansar
> >
> > [image: Inactive hide details for Sourav Mazumder ---12/09/2015 01:15:30
> > PM---Hi Niketan, Firstly to answer your Qs -]Sourav Mazumder
> > ---12/09/2015 01:15:30 PM---Hi Niketan, Firstly to answer your Qs -
> >
> > From: Sourav Mazumder <so...@gmail.com>
> > To: dev@systemml.incubator.apache.org
> > Date: 12/09/2015 01:15 PM
> > Subject: Re: Using GLM-predict
> > ------------------------------
> >
> >
> >
> > Hi Niketan,
> >
> > Firstly to answer your Qs -
> >
> > 1. Yes dependent variables are nothing but labels
> > 2. The values of the dependent variable are not 1 to totalNumOfClasses.
> The
> > values can be any double number. For example say in a weather data set
> you
> > have fields like lat, long, height (from sea level), precipitation,
> > pressure, temperature. Now one way you can create a model where
> Temperature
> > is the dependent variable and other are features (the hypothesis is
> > Temperature is some function of pressure, precipitation, height, latitude
> > and longitude.
> >
> > Not sure about the correlation between step 2 and step 3 in your mail. In
> > step 3 does one have to pass 'ID' column (created in step 2) to varName
> > while calling registerInput(String varName, DataFrame df, containsID) ?
> >
> > However the unique Id in typical case can be string. Can't that be used
> as
> > is instead ? This means one has to first convert the original unique id
> to
> > integer to create an additional unique id column and then again later on
> > that integer unique id has to mapped back.
> >
> > I was basically hoping for some sort of API where one can pass the
> original
> > data frame and from that dataframe can specify the columns to be used as
> > feature and the column to be used for label. This model can work well for
> > both creating the model and getting the prediction.
> >
> > Regards,
> > Sourav
> >
> > On Wed, Dec 9, 2015 at 12:53 PM, Niketan Pansare <np...@us.ibm.com>
> > wrote:
> >
> > > Hi Sourav,
> > >
> > > Couple of questions to make sure we are on same page: does the
> "dependent
> > > variable (double)" represents the class labels ? Are the values of the
> > > class labels from 1 to numClasses (i..e one-based) ?
> > >
> > > Here are few comments regarding correlating IDs:
> > >
> > > To represent an unordered collection (i.e. DataFrame) to an ordered
> > > collection ("Matrix"), we add special column "ID" which represents
> > *one-based
> > > row index*. Please perform following steps:
> > > 1. Accept recent changes from
> > https://github.com/apache/incubator-systemml
> > > and use the generated jar.
> > >
> > > 2. Map the unique id in DF1 to int (*1 to number of rows*) and call
> that
> > > column 'ID'.
> > >
> > > 3. Use the variant of registerInput for both X (both for training and
> > > predicting) and Y:
> > > registerInput(String varName, DataFrame df, *b**oolean* containsID)
> > >
> > > As a side note: instead of separate double columns, you can represent
> > them
> > > using VectorUDT and use our converter "JavaPairRDD<MatrixIndexes,
> > > MatrixBlock> vectorDataFrameToBinaryBlock(JavaSparkContext sc,
> DataFrame
> > > inputDF, MatrixCharacteristics mcOut, *boolean* containsID, String
> > > vectorColumnName) "
> > >
> > > Thanks,
> > >
> > > Niketan Pansare
> > > IBM Almaden Research Center
> > > E-mail: npansar At us.ibm.com
> > > http://researcher.watson.ibm.com/researcher/view.php?person=us-npansar
> > >
> > > [image: Inactive hide details for Sourav Mazumder ---12/09/2015
> 11:15:19
> > > AM---Hi Niketan, The code you provided works fine. The use of]Sourav
> > > Mazumder ---12/09/2015 11:15:19 AM---Hi Niketan, The code you provided
> > > works fine. The use of getMatrixCharacteristics
> > >
> > > From: Sourav Mazumder <so...@gmail.com>
> > > To: dev@systemml.incubator.apache.org
> > > Date: 12/09/2015 11:15 AM
> > > Subject: Re: Using GLM-predict
> > > ------------------------------
> > >
> > >
> > >
> > > Hi Niketan,
> > >
> > > The code you provided works fine. The use of getMatrixCharacteristics
> > > solves the basic execution problem.
> > >
> > > However, question #3 is probably not yet unresolved. Let me explain the
> > use
> > > case scenario I'm trying to build.
> > >
> > > 1. Say I have a data frame (DF1) with a Unique Id (string), a bunch of
> > > columns (say 4) which are to be used as features (double), and a column
> > for
> > > the dependent variable (double).
> > > 2. When I created the model I created a data frame (DF2) from DF1 using
> > > only the feature vectors and pass that as X. And the column with
> > dependent
> > > value is passed as Y.
> > > 3. For calling the GLM-predict I'm using another data frame (DF3) of
> same
> > > structure but with different Unique ID (essentially different
> > > records/rows). From that data frame I'm first creating another data
> frame
> > > (DF4) containing the columns representing the features. Then I'm
> sending
> > > DF4 to GLM-predict which has only feature vectors.
> > > 4. The response I get from GLM-predict is the 'means'. Then I'm using
> the
> > > inline predict script which returns another data frame {DF5) with ID
> and
> > > Predicted values.
> > >
> > > The question is how do I correlate the ID I'm getting from DF5 with the
> > > Unique ID of the data frame DF3 ?
> > >
> > > Regards,
> > > Sourav
> > >
> > >
> > >
> > >
> > > On Wed, Dec 9, 2015 at 9:17 AM, Niketan Pansare <np...@us.ibm.com>
> > > wrote:
> > >
> > > > Hi Sourav,
> > > >
> > > > 1. In the GLM-predict.dml I could see 'means' is the output variable.
> > In
> > > my
> > > > understanding it is same as the probability matrix u have mentioned
> in
> > > your
> > > > mail (to be used to compute the prediction). Am I right ?
> > > > Yes, that's correct.
> > > >
> > > > 2. From GLM.dml I get the 'betas' as output using
> > > > outputs.getBinaryBlockedRDD("beta_out"). The same I pass to
> > > GLM-predict.dml
> > > > as B.
> > > >
> > > > Can you try this ?
> > > > // Get output from GLM
> > > > val beta = outputs.getBinaryBlockedRDD("beta_out")
> > > > val betaMC = outputs.getMatrixCharacteristics("beta_out") // This way
> > you
> > > > don't have to worry about dimensions.
> > > > // -----------------------------------------
> > > > val Xin = DataFrame/RDD of values (or even text/csv file) you want to
> > > > predict
> > > > // -----------------------------------------
> > > > // Execute GLM-predict
> > > > ml.reset()
> > > > // Please read
> > > >
> > >
> >
> https://github.com/apache/incubator-systemml/blob/master/scripts/algorithms/GLM.dml
> > > > // dfam Int 1 Distribution family code: 1 = Power, 2 = Binomial
> > > > val cmdLineParamsPredict = Map("X" -> " ", "B" -> " ", "dfam" ->
> "...")
> > > //
> > > > family of distribution ?
> > > > ml.registerInput("X", Xin)
> > > > ml.registerInput("B_full", beta, betaMC)
> > > > ml.registerOutput("means")
> > > > val outputsPredict =
> > > >
> ml.execute("/home/system-ml-0.9.0-SNAPSHOT/algorithms/GLM-predict.dml",
> > > > cmdLineParamsPredict)
> > > > val prob = out.getBinaryBlockedRDD("means");
> > > > val probMC = out.getMatrixCharacteristics("means");
> > > > // -----------------------------------------
> > > > // Get predicted label
> > > > ml.reset()
> > > > ml.registerInput("Prob",prob, probMC)
> > > > ml.registerOutput("Prediction")
> > > > val outputsLabels = = mlNew.executeScript("Prob = read(\"temp1\"); "
> > > > + "Prediction = rowIndexMax(Prob); "
> > > > + "write(Prediction, \"tempOut\", \"csv\")")
> > > > val pred = outputsLabels.getDF(sqlContext,
> > > > "Prediction").withColumnRenamed("C1", "prediction")
> > > > // -----------------------------------------
> > > >
> > > >
> > > > 3. Say I get back prediction matrix as an output (from predictions =
> > > > rowIndexMax(means);). Now can I read add that as a column to my
> > original
> > > > data frame (the one from which I created the feature vector for the
> > > > original model) ? My concern is whether adding back will ensure the
> > right
> > > > order so that teh key for the feature vector and the predicted value
> > > remain
> > > > same ? If not how to achieve the same ?
> > > > In above example 'pred' is a DataFrame with column 'ID' which
> provides
> > > the
> > > > row ID.
> > > >
> > > > Thanks,
> > > >
> > > > Niketan Pansare
> > > > IBM Almaden Research Center
> > > > E-mail: npansar At us.ibm.com
> > > >
> http://researcher.watson.ibm.com/researcher/view.php?person=us-npansar
> > > >
> > > > [image: Inactive hide details for Sourav Mazumder ---12/08/2015
> > 10:53:40
> > > > PM---Hi Niketan, Thanks again for the detailed inputs.]Sourav
> Mazumder
> > > > ---12/08/2015 10:53:40 PM---Hi Niketan, Thanks again for the detailed
> > > > inputs.
> > > >
> > > > From: Sourav Mazumder <so...@gmail.com>
> > > > To: dev@systemml.incubator.apache.org, Niketan
> > Pansare/Almaden/IBM@IBMUS
> > > > Date: 12/08/2015 10:53 PM
> > > > Subject: Re: Using GLM-predict
> > > > ------------------------------
> > > >
> > > >
> > > >
> > > > Hi Niketan,
> > > >
> > > > Thanks again for the detailed inputs.
> > > >
> > > > Some more follow up Qs -
> > > >
> > > > 1. In the GLM-predict.dml I could see 'means' is the output variable.
> > In
> > > my
> > > > understanding it is same as the probability matrix u have mentioned
> in
> > > your
> > > > mail (to be used to compute the prediction). Am I right ?
> > > >
> > > > 2. From GLM.dml I get the 'betas' as output using
> > > > outputs.getBinaryBlockedRDD("beta_out"). The same I pass to
> > > GLM-predict.dml
> > > > as B. For registering B following statements are used
> > > > val beta = outputs.getBinaryBlockedRDD("beta_out")
> > > > ml.registerInput("B", beta, 1, 4) // I have four feature vectors so I
> > > get 4
> > > > coefficients
> > > >
> > > > However, when I execute GLM-predict.dml I get following error.
> > > >
> > > > val outputs =
> > > >
> ml.execute("/home/system-ml-0.9.0-SNAPSHOT/algorithms/GLM-predict.dml",
> > > > cmdLineParams)
> > > >
> > > > 15/12/09 05:32:47 WARN Expression: Metadata file:  .mtd not provided
> > > > 15/12/09 05:32:47 ERROR Expression: ERROR:
> > > > /home/system-ml-0.9.0-SNAPSHOT/algori
> > > > thms/GLM-predict.dml -- line 117, column 8 -- Missing or incomplete
> > > > dimensio
> > > > n information in read statement:  .mtd
> > > > com.ibm.bi.dml.parser.LanguageException: Invalid Parameters : ERROR:
> > > > /home/syste
> > > > m-ml-0.9.0-SNAPSHOT/algorithms/GLM-predict.dml -- line 117, column 8
> --
> > > > Miss
> > > > ing or incomplete dimension information in read statement:  .mtd
> > > >
> > > > In line 117 we have following statement : X = read (fileX);
> > > >
> > > > 3. Say I get back prediction matrix as an output (from predictions =
> > > > rowIndexMax(means);). Now can I read add that as a column to my
> > original
> > > > data frame (the one from which I created the feature vector for the
> > > > original model) ? My concern is whether adding back will ensure the
> > right
> > > > order so that teh key for the feature vector and the predicted value
> > > remain
> > > > same ? If not how to achieve the same ?
> > > >
> > > > Regards,
> > > > Sourav
> > > >
> > > >
> > > >
> > > >
> > > >
> > > > On Tue, Dec 8, 2015 at 2:08 PM, Niketan Pansare <np...@us.ibm.com>
> > > > wrote:
> > > >
> > > > > Hi Sourav,
> > > > >
> > > > > For some reason, I didn't get your email on "*Tue, 08 Dec 2015
> > 12:56:38
> > > > > -0800*
> > > > > <
> > > >
> > >
> >
> https://www.mail-archive.com/search?l=dev@systemml.incubator.apache.org&q=date:20151208
> > > >
> > > > "
> > > > > (which I noticed in the archive).
> > > > >
> > > > > >> Not sure how exactly I can modify the GLM-predict.dml to get
> some
> > > > > prediction to start with.
> > > > > There are two options here:
> > > > > 1. Modify GLM-predict.dml as suggested by Shirish (better approach
> > with
> > > > > respect to the SystemML optimizer) or
> > > > >
> > > > > 2. Run a new script on the output of GLM-predict. Please see:
> > > > >
> > > >
> > >
> >
> https://github.com/apache/incubator-systemml/blob/master/src/main/java/org/apache/sysml/api/ml/LogisticRegressionModel.java#L163
> > > > > If you chose to go with option 2, you might also want to read the
> > > > > documentation of following two built-in functions:
> > > > > a. rowIndexMax (See
> > > > >
> > > >
> > >
> >
> http://apache.github.io/incubator-systemml/dml-language-reference.html#matrix-andor-scalar-comparison-built-in-functions
> > > > > <
> > > >
> > >
> >
> http://apache.github.io/incubator-systemml/dml-language-reference.html#matrix-andor-scalar-comparison-built-in-functions
> > > > >
> > > > > )
> > > > > b. ppred
> > > > >
> > > > > >> Can you give me some idea how from here I can calculate the
> > > predicted
> > > > > value of the label using some value of probability threshold ?
> > > > > Very simple way to predict the label given probability matrix:
> > > > > Prediction = rowIndexMax(Prob) # predicts the label with highest
> > > > > probability. This assumes one-based labels.
> > > > >
> > > > > Thanks,
> > > > >
> > > > > Niketan Pansare
> > > > > IBM Almaden Research Center
> > > > > E-mail: npansar At us.ibm.com
> > > > >
> > http://researcher.watson.ibm.com/researcher/view.php?person=us-npansar
> > > > >
> > > > > [image: Inactive hide details for Shirish Tatikonda ---12/08/2015
> > > > 12:49:47
> > > > > PM---Hi Sourav, Yes, GLM-predict.dml gives out only the
> prob]Shirish
> > > > > Tatikonda ---12/08/2015 12:49:47 PM---Hi Sourav, Yes,
> GLM-predict.dml
> > > > gives
> > > > > out only the probabilities. You can put a
> > > > >
> > > > > From: Shirish Tatikonda <sh...@gmail.com>
> > > > > To: dev@systemml.incubator.apache.org
> > > > > Date: 12/08/2015 12:49 PM
> > > > > Subject: Re: Using GLM-predict
> > > > > ------------------------------
> > > > >
> > > > >
> > > > >
> > > > > Hi Sourav,
> > > > >
> > > > > Yes, GLM-predict.dml gives out only the probabilities. You can put
> a
> > > > > threshold on the resulting probabilities to get the actual class
> > labels
> > > > --
> > > > > for example, prob > 0.5 is positive and <=0.5 as negative.
> > > > >
> > > > > The exact value of threshold typically depends on the data and the
> > > > > application. Different thresholds yield different classifiers with
> > > > > different performance (precision, recall, etc.). You can find the
> > best
> > > > > threshold for the given data set by finding a value that gives the
> > > > desired
> > > > > classifier performance (for example, a threshold that gives roughly
> > > equal
> > > > > precision and recall). Such an optimization is obviously done
> during
> > > the
> > > > > training phase using a held out test set.
> > > > >
> > > > > If you wish, you can also modify the DML script to perform this
> > entire
> > > > > process.
> > > > >
> > > > > Shirish
> > > > >
> > > > >
> > > > > On Tue, Dec 8, 2015 at 12:23 PM, Sourav Mazumder <
> > > > > sourav.mazumder00@gmail.com> wrote:
> > > > >
> > > > > > Hi,
> > > > > >
> > > > > > I have used GLM.dml to create a model using some sample data. It
> > > > returns
> > > > > to
> > > > > > me the matrix of Beta, B.
> > > > > >
> > > > > > Now I want to use this matrix of Beta on a new set of data points
> > and
> > > > > > generate predicted value of the dependent variable/observation.
> > > > > >
> > > > > > When I checked GLM-predict, I could see that one can pass feature
> > > > vector
> > > > > > for the new data set and also the matrix of beta.
> > > > > >
> > > > > > But I could not see any way to get the predicted value of the
> > > dependent
> > > > > > variable/observation. The output parameter only supports matrix
> of
> > > > > > predicted means/probabilities.
> > > > > >
> > > > > > Is there a way one can get the predicted value of the dependent
> > > > > > variable/observation from GLM-predict ?
> > > > > >
> > > > > > Regards,
> > > > > > Sourav
> > > > > >
> > > > >
> > > > >
> > > > >
> > > >
> > > >
> > > >
> > >
> > >
> > >
> >
> >
> >
>
>
>

Re: Using GLM-predict

Posted by Niketan Pansare <np...@us.ibm.com>.

Hi Sourav,

There are two possible options here:
1. If "unique_id" is one-based integer column: In this case, please rename
"unique_id" column to ID and use registerInput("X", DF1, true) method.

2. If "unique_id" is anything else (for example: String), then there is no
trivial way for SystemML to correlate "string-based unique id" to row index
(which is required to interpret a DataFrame into a matrix). This means you
have to explicitly add the column ID to DF1:
val dataset = RDDConverterUtilsExt.addIDToDataFrame(DF1, sqlContext, "ID")

When you get DF5 from GLM-predict.dml, you can use following two lines of
code which guarantees correct mapping:
val DF5 = outNew.getDF(sqlContext, "outPred").withColumnRenamed("C1",
"prediction") // Note: there already is a column ID in DF5 which specifies
the row index.
val output = dataset1.join(pred, dataset1.col("ID").equalTo(pred.col("ID"
)))

Note: once DataFrame is passed to SystemML via registerInput, SystemML
first converts the DataFrame into binary block (i.e.
JavaPairRDD<MatrixIndexes, MatrixBlock>) and executes GLM-predict.dml using
the binary block. After execution, the output is present in MLOutput (
https://github.com/apache/incubator-systemml/blob/master/src/main/java/org/apache/sysml/api/MLOutput.java#L89
) in binary block format. If user choses to, he/she may call getDF(...)
which does DataFrame to binary block conversion.

For DataFrame to binary block conversion, see
https://github.com/apache/incubator-systemml/blob/master/src/main/java/org/apache/sysml/runtime/instructions/spark/utils/RDDConverterUtilsExt.java#L277
 ... ordering specified by zipWithIndex (which is also used by
RDDConverterUtilsExt.addIDToDataFrame)
For binary block to DataFrame conversion, see
https://github.com/apache/incubator-systemml/blob/master/src/main/java/org/apache/sysml/runtime/instructions/spark/utils/RDDConverterUtilsExt.java#L364
 ... ordering specified by internal binary block format and hence we append
an extra column ID to specify this ordering.

Thanks,

Niketan Pansare
IBM Almaden Research Center
E-mail: npansar At us.ibm.com
http://researcher.watson.ibm.com/researcher/view.php?person=us-npansar



From:	Sourav Mazumder <so...@gmail.com>
To:	dev@systemml.incubator.apache.org
Date:	12/09/2015 06:20 PM
Subject:	Re: Using GLM-predict



Hi Niketan,

Thanks again for such a detailed explanation. I see your last point and in
agreement with the same. Also I got your point on the use of "means" for
gaussian vs other distributions.

However, I'm still not convinced about the approach you mentioned for
correlating the unique id. I've already tried a code similar to what you
sent where I've used the vectorAssembler utility of Spark ML LIb.

Let me try to explain the problem with more details -

1. Say my original data frame DF1 is distributed in 3 slave nodes in a
Spark cluster. Each has say 20 rows. Total 60 rows. The DF1 also has a
unique identifier column say unique_id.
2. Now I used your code to create the feature vector from DF1 and pass it
to GLM-predict. And GLM-predict in turn returns me another data frame (say
DF5) of "means" (in this case say prediction). However, the rows of DF5 may
be distributed in 4 slave nodes each having say 15 rows. Total 60 rows.
3. Now if I just add this new data frame (DF5) as additional two columns to
DF1 where is the guarantee that for a specific unique_id of DF1 I'm getting
right mean/predicted value corresponding to unique_id ?

Regards,
Sourav



On Wed, Dec 9, 2015 at 4:14 PM, Niketan Pansare <np...@us.ibm.com> wrote:

> Hi Sourav,
>
> Please see below comments:
>
> >> I was basically hoping for some sort of API where one can pass the
> original
> data frame and from that dataframe can specify the columns to be used as
> feature and the column to be used for label. This model can work well for
> both creating the model and getting the prediction.
> Please use the most recent jar from git. To extract X and Y from your
> dataframe without IDs, use following code:
> import
> org.apache.sysml.runtime.instructions.spark.utils.RDDConverterUtilsExt
> val features = Array("lat", "height", "precipitation", "pressure")
> val Xmc = new MatrixCharacteristics() // SystemML will set them for you
if
> the dimensions are unknown
> val Ymc = new MatrixCharacteristics()
> val X = RDDConverterUtilsExt.dataFrameToBinaryBlock(sc, df, Xmc,
features)
> val Y = RDDConverterUtilsExt.dataFrameToBinaryBlock(sc, df, Ymc,
> Array("temperature"))
>
> If you want to add specific ordering to your DataFrame rows (let's say
for
> prediction ... in most cases it is not required), use following method:
> import
> org.apache.sysml.runtime.instructions.spark.utils.RDDConverterUtilsExt
> df = RDDConverterUtilsExt.addIDToDataFrame(df, sqlContext, "ID")
>
> >> 1. Yes dependent variables are nothing but labels
> 2. The values of the dependent variable are not 1 to totalNumOfClasses.
The
> values can be any double number. For example say in a weather data set
you
> have fields like lat, long, height (from sea level), precipitation,
> pressure, temperature. Now one way you can create a model where
Temperature
> is the dependent variable and other are features (the hypothesis is
> Temperature is some function of pressure, precipitation, height, latitude
> and longitude.
> Sorry, in this case, please ignore my earlier suggestion of "Prediction =
> rowIndexMax(Prob)" as it applies only to classification.
> In your case, the returned values are "means" of the distribution family
> which was used (See
>
http://apache.github.io/incubator-systemml/algorithms-regression.html#generalized-linear-models
).
> If Gaussian distribution was used (dfam=1, vpow=0.0), and if the problem
> was linear and if you expected pointy-hat distribution (i.e. positive
> kurtosis), then you can simply return the mean as predicted label. This
is
> because in case of Gaussian distribution, mean is also the mode. In other
> case, it might not necessarily be true.
>
> You may ask why are we making it so complicated and why not just return
> the predicted labels instead of probability ?
> Well, the problem of labelling is not as simple as it appears and it
> highly depends on the problem setting. Let's consider the problem of
> multi-class classification and my earlier suggestion "Prediction =
> rowIndexMax(Prob)". Also, let the labels be as follows = {cancer, sore
> throat, birth defect, fever, normal}. If for a given test example, let's
> say GLM-predict.dml outputs following probability = {cancer: 0.2, sore
> throat: 0.15, birth defect: 0.15, fever: 0.2, normal:0.3}. Then according
> to "Prediction = rowIndexMax(Prob)", we should output the label "normal"
> and send the patient home ... right ? No. In this case, 20% probability
of
> cancer is just way too high for a doctor to send the patient home. In
this
> setting, the doctor might then say to the data scientist: I know that
based
> on the prevalence of cancer in general public, and based on that domain
> knowledge, I suggest that probability over "threshold" should always be
> flagged as cancer. Else output the label with highest probability. Using
> this suggestion, the data scientist modifies the DML as follows:
> zeroOneMat = ppred(prob[cancerColID], threshold, ">")
> prediction = zeroOneMat*cancerColID + (1-zeroOneMat)*rowIndexMax(prob)
>
> This also shows the usefulness of "Declarative Machine Learning" :)
>
> Thanks,
>
> Niketan Pansare
> IBM Almaden Research Center
> E-mail: npansar At us.ibm.com
> http://researcher.watson.ibm.com/researcher/view.php?person=us-npansar
>
> [image: Inactive hide details for Sourav Mazumder ---12/09/2015 01:15:30
> PM---Hi Niketan, Firstly to answer your Qs -]Sourav Mazumder
> ---12/09/2015 01:15:30 PM---Hi Niketan, Firstly to answer your Qs -
>
> From: Sourav Mazumder <so...@gmail.com>
> To: dev@systemml.incubator.apache.org
> Date: 12/09/2015 01:15 PM
> Subject: Re: Using GLM-predict
> ------------------------------
>
>
>
> Hi Niketan,
>
> Firstly to answer your Qs -
>
> 1. Yes dependent variables are nothing but labels
> 2. The values of the dependent variable are not 1 to totalNumOfClasses.
The
> values can be any double number. For example say in a weather data set
you
> have fields like lat, long, height (from sea level), precipitation,
> pressure, temperature. Now one way you can create a model where
Temperature
> is the dependent variable and other are features (the hypothesis is
> Temperature is some function of pressure, precipitation, height, latitude
> and longitude.
>
> Not sure about the correlation between step 2 and step 3 in your mail. In
> step 3 does one have to pass 'ID' column (created in step 2) to varName
> while calling registerInput(String varName, DataFrame df, containsID) ?
>
> However the unique Id in typical case can be string. Can't that be used
as
> is instead ? This means one has to first convert the original unique id
to
> integer to create an additional unique id column and then again later on
> that integer unique id has to mapped back.
>
> I was basically hoping for some sort of API where one can pass the
original
> data frame and from that dataframe can specify the columns to be used as
> feature and the column to be used for label. This model can work well for
> both creating the model and getting the prediction.
>
> Regards,
> Sourav
>
> On Wed, Dec 9, 2015 at 12:53 PM, Niketan Pansare <np...@us.ibm.com>
> wrote:
>
> > Hi Sourav,
> >
> > Couple of questions to make sure we are on same page: does the
"dependent
> > variable (double)" represents the class labels ? Are the values of the
> > class labels from 1 to numClasses (i..e one-based) ?
> >
> > Here are few comments regarding correlating IDs:
> >
> > To represent an unordered collection (i.e. DataFrame) to an ordered
> > collection ("Matrix"), we add special column "ID" which represents
> *one-based
> > row index*. Please perform following steps:
> > 1. Accept recent changes from
> https://github.com/apache/incubator-systemml
> > and use the generated jar.
> >
> > 2. Map the unique id in DF1 to int (*1 to number of rows*) and call
that
> > column 'ID'.
> >
> > 3. Use the variant of registerInput for both X (both for training and
> > predicting) and Y:
> > registerInput(String varName, DataFrame df, *b**oolean* containsID)
> >
> > As a side note: instead of separate double columns, you can represent
> them
> > using VectorUDT and use our converter "JavaPairRDD<MatrixIndexes,
> > MatrixBlock> vectorDataFrameToBinaryBlock(JavaSparkContext sc,
DataFrame
> > inputDF, MatrixCharacteristics mcOut, *boolean* containsID, String
> > vectorColumnName) "
> >
> > Thanks,
> >
> > Niketan Pansare
> > IBM Almaden Research Center
> > E-mail: npansar At us.ibm.com
> > http://researcher.watson.ibm.com/researcher/view.php?person=us-npansar
> >
> > [image: Inactive hide details for Sourav Mazumder ---12/09/2015
11:15:19
> > AM---Hi Niketan, The code you provided works fine. The use of]Sourav
> > Mazumder ---12/09/2015 11:15:19 AM---Hi Niketan, The code you provided
> > works fine. The use of getMatrixCharacteristics
> >
> > From: Sourav Mazumder <so...@gmail.com>
> > To: dev@systemml.incubator.apache.org
> > Date: 12/09/2015 11:15 AM
> > Subject: Re: Using GLM-predict
> > ------------------------------
> >
> >
> >
> > Hi Niketan,
> >
> > The code you provided works fine. The use of getMatrixCharacteristics
> > solves the basic execution problem.
> >
> > However, question #3 is probably not yet unresolved. Let me explain the
> use
> > case scenario I'm trying to build.
> >
> > 1. Say I have a data frame (DF1) with a Unique Id (string), a bunch of
> > columns (say 4) which are to be used as features (double), and a column
> for
> > the dependent variable (double).
> > 2. When I created the model I created a data frame (DF2) from DF1 using
> > only the feature vectors and pass that as X. And the column with
> dependent
> > value is passed as Y.
> > 3. For calling the GLM-predict I'm using another data frame (DF3) of
same
> > structure but with different Unique ID (essentially different
> > records/rows). From that data frame I'm first creating another data
frame
> > (DF4) containing the columns representing the features. Then I'm
sending
> > DF4 to GLM-predict which has only feature vectors.
> > 4. The response I get from GLM-predict is the 'means'. Then I'm using
the
> > inline predict script which returns another data frame {DF5) with ID
and
> > Predicted values.
> >
> > The question is how do I correlate the ID I'm getting from DF5 with the
> > Unique ID of the data frame DF3 ?
> >
> > Regards,
> > Sourav
> >
> >
> >
> >
> > On Wed, Dec 9, 2015 at 9:17 AM, Niketan Pansare <np...@us.ibm.com>
> > wrote:
> >
> > > Hi Sourav,
> > >
> > > 1. In the GLM-predict.dml I could see 'means' is the output variable.
> In
> > my
> > > understanding it is same as the probability matrix u have mentioned
in
> > your
> > > mail (to be used to compute the prediction). Am I right ?
> > > Yes, that's correct.
> > >
> > > 2. From GLM.dml I get the 'betas' as output using
> > > outputs.getBinaryBlockedRDD("beta_out"). The same I pass to
> > GLM-predict.dml
> > > as B.
> > >
> > > Can you try this ?
> > > // Get output from GLM
> > > val beta = outputs.getBinaryBlockedRDD("beta_out")
> > > val betaMC = outputs.getMatrixCharacteristics("beta_out") // This way
> you
> > > don't have to worry about dimensions.
> > > // -----------------------------------------
> > > val Xin = DataFrame/RDD of values (or even text/csv file) you want to
> > > predict
> > > // -----------------------------------------
> > > // Execute GLM-predict
> > > ml.reset()
> > > // Please read
> > >
> >
>
https://github.com/apache/incubator-systemml/blob/master/scripts/algorithms/GLM.dml

> > > // dfam Int 1 Distribution family code: 1 = Power, 2 = Binomial
> > > val cmdLineParamsPredict = Map("X" -> " ", "B" -> " ", "dfam" ->
"...")
> > //
> > > family of distribution ?
> > > ml.registerInput("X", Xin)
> > > ml.registerInput("B_full", beta, betaMC)
> > > ml.registerOutput("means")
> > > val outputsPredict =
> > > ml.execute
("/home/system-ml-0.9.0-SNAPSHOT/algorithms/GLM-predict.dml",
> > > cmdLineParamsPredict)
> > > val prob = out.getBinaryBlockedRDD("means");
> > > val probMC = out.getMatrixCharacteristics("means");
> > > // -----------------------------------------
> > > // Get predicted label
> > > ml.reset()
> > > ml.registerInput("Prob",prob, probMC)
> > > ml.registerOutput("Prediction")
> > > val outputsLabels = = mlNew.executeScript("Prob = read(\"temp1\"); "
> > > + "Prediction = rowIndexMax(Prob); "
> > > + "write(Prediction, \"tempOut\", \"csv\")")
> > > val pred = outputsLabels.getDF(sqlContext,
> > > "Prediction").withColumnRenamed("C1", "prediction")
> > > // -----------------------------------------
> > >
> > >
> > > 3. Say I get back prediction matrix as an output (from predictions =
> > > rowIndexMax(means);). Now can I read add that as a column to my
> original
> > > data frame (the one from which I created the feature vector for the
> > > original model) ? My concern is whether adding back will ensure the
> right
> > > order so that teh key for the feature vector and the predicted value
> > remain
> > > same ? If not how to achieve the same ?
> > > In above example 'pred' is a DataFrame with column 'ID' which
provides
> > the
> > > row ID.
> > >
> > > Thanks,
> > >
> > > Niketan Pansare
> > > IBM Almaden Research Center
> > > E-mail: npansar At us.ibm.com
> > >
http://researcher.watson.ibm.com/researcher/view.php?person=us-npansar
> > >
> > > [image: Inactive hide details for Sourav Mazumder ---12/08/2015
> 10:53:40
> > > PM---Hi Niketan, Thanks again for the detailed inputs.]Sourav
Mazumder
> > > ---12/08/2015 10:53:40 PM---Hi Niketan, Thanks again for the detailed
> > > inputs.
> > >
> > > From: Sourav Mazumder <so...@gmail.com>
> > > To: dev@systemml.incubator.apache.org, Niketan
> Pansare/Almaden/IBM@IBMUS
> > > Date: 12/08/2015 10:53 PM
> > > Subject: Re: Using GLM-predict
> > > ------------------------------
> > >
> > >
> > >
> > > Hi Niketan,
> > >
> > > Thanks again for the detailed inputs.
> > >
> > > Some more follow up Qs -
> > >
> > > 1. In the GLM-predict.dml I could see 'means' is the output variable.
> In
> > my
> > > understanding it is same as the probability matrix u have mentioned
in
> > your
> > > mail (to be used to compute the prediction). Am I right ?
> > >
> > > 2. From GLM.dml I get the 'betas' as output using
> > > outputs.getBinaryBlockedRDD("beta_out"). The same I pass to
> > GLM-predict.dml
> > > as B. For registering B following statements are used
> > > val beta = outputs.getBinaryBlockedRDD("beta_out")
> > > ml.registerInput("B", beta, 1, 4) // I have four feature vectors so I
> > get 4
> > > coefficients
> > >
> > > However, when I execute GLM-predict.dml I get following error.
> > >
> > > val outputs =
> > > ml.execute
("/home/system-ml-0.9.0-SNAPSHOT/algorithms/GLM-predict.dml",
> > > cmdLineParams)
> > >
> > > 15/12/09 05:32:47 WARN Expression: Metadata file:  .mtd not provided
> > > 15/12/09 05:32:47 ERROR Expression: ERROR:
> > > /home/system-ml-0.9.0-SNAPSHOT/algori
> > > thms/GLM-predict.dml -- line 117, column 8 -- Missing or incomplete
> > > dimensio
> > > n information in read statement:  .mtd
> > > com.ibm.bi.dml.parser.LanguageException: Invalid Parameters : ERROR:
> > > /home/syste
> > > m-ml-0.9.0-SNAPSHOT/algorithms/GLM-predict.dml -- line 117, column 8
--
> > > Miss
> > > ing or incomplete dimension information in read statement:  .mtd
> > >
> > > In line 117 we have following statement : X = read (fileX);
> > >
> > > 3. Say I get back prediction matrix as an output (from predictions =
> > > rowIndexMax(means);). Now can I read add that as a column to my
> original
> > > data frame (the one from which I created the feature vector for the
> > > original model) ? My concern is whether adding back will ensure the
> right
> > > order so that teh key for the feature vector and the predicted value
> > remain
> > > same ? If not how to achieve the same ?
> > >
> > > Regards,
> > > Sourav
> > >
> > >
> > >
> > >
> > >
> > > On Tue, Dec 8, 2015 at 2:08 PM, Niketan Pansare <np...@us.ibm.com>
> > > wrote:
> > >
> > > > Hi Sourav,
> > > >
> > > > For some reason, I didn't get your email on "*Tue, 08 Dec 2015
> 12:56:38
> > > > -0800*
> > > > <
> > >
> >
>
https://www.mail-archive.com/search?l=dev@systemml.incubator.apache.org&q=date:20151208

> > >
> > > "
> > > > (which I noticed in the archive).
> > > >
> > > > >> Not sure how exactly I can modify the GLM-predict.dml to get
some
> > > > prediction to start with.
> > > > There are two options here:
> > > > 1. Modify GLM-predict.dml as suggested by Shirish (better approach
> with
> > > > respect to the SystemML optimizer) or
> > > >
> > > > 2. Run a new script on the output of GLM-predict. Please see:
> > > >
> > >
> >
>
https://github.com/apache/incubator-systemml/blob/master/src/main/java/org/apache/sysml/api/ml/LogisticRegressionModel.java#L163

> > > > If you chose to go with option 2, you might also want to read the
> > > > documentation of following two built-in functions:
> > > > a. rowIndexMax (See
> > > >
> > >
> >
>
http://apache.github.io/incubator-systemml/dml-language-reference.html#matrix-andor-scalar-comparison-built-in-functions

> > > > <
> > >
> >
>
http://apache.github.io/incubator-systemml/dml-language-reference.html#matrix-andor-scalar-comparison-built-in-functions

> > > >
> > > > )
> > > > b. ppred
> > > >
> > > > >> Can you give me some idea how from here I can calculate the
> > predicted
> > > > value of the label using some value of probability threshold ?
> > > > Very simple way to predict the label given probability matrix:
> > > > Prediction = rowIndexMax(Prob) # predicts the label with highest
> > > > probability. This assumes one-based labels.
> > > >
> > > > Thanks,
> > > >
> > > > Niketan Pansare
> > > > IBM Almaden Research Center
> > > > E-mail: npansar At us.ibm.com
> > > >
> http://researcher.watson.ibm.com/researcher/view.php?person=us-npansar
> > > >
> > > > [image: Inactive hide details for Shirish Tatikonda ---12/08/2015
> > > 12:49:47
> > > > PM---Hi Sourav, Yes, GLM-predict.dml gives out only the
prob]Shirish
> > > > Tatikonda ---12/08/2015 12:49:47 PM---Hi Sourav, Yes,
GLM-predict.dml
> > > gives
> > > > out only the probabilities. You can put a
> > > >
> > > > From: Shirish Tatikonda <sh...@gmail.com>
> > > > To: dev@systemml.incubator.apache.org
> > > > Date: 12/08/2015 12:49 PM
> > > > Subject: Re: Using GLM-predict
> > > > ------------------------------
> > > >
> > > >
> > > >
> > > > Hi Sourav,
> > > >
> > > > Yes, GLM-predict.dml gives out only the probabilities. You can put
a
> > > > threshold on the resulting probabilities to get the actual class
> labels
> > > --
> > > > for example, prob > 0.5 is positive and <=0.5 as negative.
> > > >
> > > > The exact value of threshold typically depends on the data and the
> > > > application. Different thresholds yield different classifiers with
> > > > different performance (precision, recall, etc.). You can find the
> best
> > > > threshold for the given data set by finding a value that gives the
> > > desired
> > > > classifier performance (for example, a threshold that gives roughly
> > equal
> > > > precision and recall). Such an optimization is obviously done
during
> > the
> > > > training phase using a held out test set.
> > > >
> > > > If you wish, you can also modify the DML script to perform this
> entire
> > > > process.
> > > >
> > > > Shirish
> > > >
> > > >
> > > > On Tue, Dec 8, 2015 at 12:23 PM, Sourav Mazumder <
> > > > sourav.mazumder00@gmail.com> wrote:
> > > >
> > > > > Hi,
> > > > >
> > > > > I have used GLM.dml to create a model using some sample data. It
> > > returns
> > > > to
> > > > > me the matrix of Beta, B.
> > > > >
> > > > > Now I want to use this matrix of Beta on a new set of data points
> and
> > > > > generate predicted value of the dependent variable/observation.
> > > > >
> > > > > When I checked GLM-predict, I could see that one can pass feature
> > > vector
> > > > > for the new data set and also the matrix of beta.
> > > > >
> > > > > But I could not see any way to get the predicted value of the
> > dependent
> > > > > variable/observation. The output parameter only supports matrix
of
> > > > > predicted means/probabilities.
> > > > >
> > > > > Is there a way one can get the predicted value of the dependent
> > > > > variable/observation from GLM-predict ?
> > > > >
> > > > > Regards,
> > > > > Sourav
> > > > >
> > > >
> > > >
> > > >
> > >
> > >
> > >
> >
> >
> >
>
>
>

Re: Using GLM-predict

Posted by Sourav Mazumder <so...@gmail.com>.

Hi Niketan,

Thanks again for such a detailed explanation. I see your last point and in
agreement with the same. Also I got your point on the use of "means" for
gaussian vs other distributions.

However, I'm still not convinced about the approach you mentioned for
correlating the unique id. I've already tried a code similar to what you
sent where I've used the vectorAssembler utility of Spark ML LIb.

Let me try to explain the problem with more details -

1. Say my original data frame DF1 is distributed in 3 slave nodes in a
Spark cluster. Each has say 20 rows. Total 60 rows. The DF1 also has a
unique identifier column say unique_id.
2. Now I used your code to create the feature vector from DF1 and pass it
to GLM-predict. And GLM-predict in turn returns me another data frame (say
DF5) of "means" (in this case say prediction). However, the rows of DF5 may
be distributed in 4 slave nodes each having say 15 rows. Total 60 rows.
3. Now if I just add this new data frame (DF5) as additional two columns to
DF1 where is the guarantee that for a specific unique_id of DF1 I'm getting
right mean/predicted value corresponding to unique_id ?

Regards,
Sourav



On Wed, Dec 9, 2015 at 4:14 PM, Niketan Pansare <np...@us.ibm.com> wrote:

> Hi Sourav,
>
> Please see below comments:
>
> >> I was basically hoping for some sort of API where one can pass the
> original
> data frame and from that dataframe can specify the columns to be used as
> feature and the column to be used for label. This model can work well for
> both creating the model and getting the prediction.
> Please use the most recent jar from git. To extract X and Y from your
> dataframe without IDs, use following code:
> import
> org.apache.sysml.runtime.instructions.spark.utils.RDDConverterUtilsExt
> val features = Array("lat", "height", "precipitation", "pressure")
> val Xmc = new MatrixCharacteristics() // SystemML will set them for you if
> the dimensions are unknown
> val Ymc = new MatrixCharacteristics()
> val X = RDDConverterUtilsExt.dataFrameToBinaryBlock(sc, df, Xmc, features)
> val Y = RDDConverterUtilsExt.dataFrameToBinaryBlock(sc, df, Ymc,
> Array("temperature"))
>
> If you want to add specific ordering to your DataFrame rows (let's say for
> prediction ... in most cases it is not required), use following method:
> import
> org.apache.sysml.runtime.instructions.spark.utils.RDDConverterUtilsExt
> df = RDDConverterUtilsExt.addIDToDataFrame(df, sqlContext, "ID")
>
> >> 1. Yes dependent variables are nothing but labels
> 2. The values of the dependent variable are not 1 to totalNumOfClasses. The
> values can be any double number. For example say in a weather data set you
> have fields like lat, long, height (from sea level), precipitation,
> pressure, temperature. Now one way you can create a model where Temperature
> is the dependent variable and other are features (the hypothesis is
> Temperature is some function of pressure, precipitation, height, latitude
> and longitude.
> Sorry, in this case, please ignore my earlier suggestion of "Prediction =
> rowIndexMax(Prob)" as it applies only to classification.
> In your case, the returned values are "means" of the distribution family
> which was used (See
> http://apache.github.io/incubator-systemml/algorithms-regression.html#generalized-linear-models).
> If Gaussian distribution was used (dfam=1, vpow=0.0), and if the problem
> was linear and if you expected pointy-hat distribution (i.e. positive
> kurtosis), then you can simply return the mean as predicted label. This is
> because in case of Gaussian distribution, mean is also the mode. In other
> case, it might not necessarily be true.
>
> You may ask why are we making it so complicated and why not just return
> the predicted labels instead of probability ?
> Well, the problem of labelling is not as simple as it appears and it
> highly depends on the problem setting. Let's consider the problem of
> multi-class classification and my earlier suggestion "Prediction =
> rowIndexMax(Prob)". Also, let the labels be as follows = {cancer, sore
> throat, birth defect, fever, normal}. If for a given test example, let's
> say GLM-predict.dml outputs following probability = {cancer: 0.2, sore
> throat: 0.15, birth defect: 0.15, fever: 0.2, normal:0.3}. Then according
> to "Prediction = rowIndexMax(Prob)", we should output the label "normal"
> and send the patient home ... right ? No. In this case, 20% probability of
> cancer is just way too high for a doctor to send the patient home. In this
> setting, the doctor might then say to the data scientist: I know that based
> on the prevalence of cancer in general public, and based on that domain
> knowledge, I suggest that probability over "threshold" should always be
> flagged as cancer. Else output the label with highest probability. Using
> this suggestion, the data scientist modifies the DML as follows:
> zeroOneMat = ppred(prob[cancerColID], threshold, ">")
> prediction = zeroOneMat*cancerColID + (1-zeroOneMat)*rowIndexMax(prob)
>
> This also shows the usefulness of "Declarative Machine Learning" :)
>
> Thanks,
>
> Niketan Pansare
> IBM Almaden Research Center
> E-mail: npansar At us.ibm.com
> http://researcher.watson.ibm.com/researcher/view.php?person=us-npansar
>
> [image: Inactive hide details for Sourav Mazumder ---12/09/2015 01:15:30
> PM---Hi Niketan, Firstly to answer your Qs -]Sourav Mazumder
> ---12/09/2015 01:15:30 PM---Hi Niketan, Firstly to answer your Qs -
>
> From: Sourav Mazumder <so...@gmail.com>
> To: dev@systemml.incubator.apache.org
> Date: 12/09/2015 01:15 PM
> Subject: Re: Using GLM-predict
> ------------------------------
>
>
>
> Hi Niketan,
>
> Firstly to answer your Qs -
>
> 1. Yes dependent variables are nothing but labels
> 2. The values of the dependent variable are not 1 to totalNumOfClasses. The
> values can be any double number. For example say in a weather data set you
> have fields like lat, long, height (from sea level), precipitation,
> pressure, temperature. Now one way you can create a model where Temperature
> is the dependent variable and other are features (the hypothesis is
> Temperature is some function of pressure, precipitation, height, latitude
> and longitude.
>
> Not sure about the correlation between step 2 and step 3 in your mail. In
> step 3 does one have to pass 'ID' column (created in step 2) to varName
> while calling registerInput(String varName, DataFrame df, containsID) ?
>
> However the unique Id in typical case can be string. Can't that be used as
> is instead ? This means one has to first convert the original unique id to
> integer to create an additional unique id column and then again later on
> that integer unique id has to mapped back.
>
> I was basically hoping for some sort of API where one can pass the original
> data frame and from that dataframe can specify the columns to be used as
> feature and the column to be used for label. This model can work well for
> both creating the model and getting the prediction.
>
> Regards,
> Sourav
>
> On Wed, Dec 9, 2015 at 12:53 PM, Niketan Pansare <np...@us.ibm.com>
> wrote:
>
> > Hi Sourav,
> >
> > Couple of questions to make sure we are on same page: does the "dependent
> > variable (double)" represents the class labels ? Are the values of the
> > class labels from 1 to numClasses (i..e one-based) ?
> >
> > Here are few comments regarding correlating IDs:
> >
> > To represent an unordered collection (i.e. DataFrame) to an ordered
> > collection ("Matrix"), we add special column "ID" which represents
> *one-based
> > row index*. Please perform following steps:
> > 1. Accept recent changes from
> https://github.com/apache/incubator-systemml
> > and use the generated jar.
> >
> > 2. Map the unique id in DF1 to int (*1 to number of rows*) and call that
> > column 'ID'.
> >
> > 3. Use the variant of registerInput for both X (both for training and
> > predicting) and Y:
> > registerInput(String varName, DataFrame df, *b**oolean* containsID)
> >
> > As a side note: instead of separate double columns, you can represent
> them
> > using VectorUDT and use our converter "JavaPairRDD<MatrixIndexes,
> > MatrixBlock> vectorDataFrameToBinaryBlock(JavaSparkContext sc, DataFrame
> > inputDF, MatrixCharacteristics mcOut, *boolean* containsID, String
> > vectorColumnName) "
> >
> > Thanks,
> >
> > Niketan Pansare
> > IBM Almaden Research Center
> > E-mail: npansar At us.ibm.com
> > http://researcher.watson.ibm.com/researcher/view.php?person=us-npansar
> >
> > [image: Inactive hide details for Sourav Mazumder ---12/09/2015 11:15:19
> > AM---Hi Niketan, The code you provided works fine. The use of]Sourav
> > Mazumder ---12/09/2015 11:15:19 AM---Hi Niketan, The code you provided
> > works fine. The use of getMatrixCharacteristics
> >
> > From: Sourav Mazumder <so...@gmail.com>
> > To: dev@systemml.incubator.apache.org
> > Date: 12/09/2015 11:15 AM
> > Subject: Re: Using GLM-predict
> > ------------------------------
> >
> >
> >
> > Hi Niketan,
> >
> > The code you provided works fine. The use of getMatrixCharacteristics
> > solves the basic execution problem.
> >
> > However, question #3 is probably not yet unresolved. Let me explain the
> use
> > case scenario I'm trying to build.
> >
> > 1. Say I have a data frame (DF1) with a Unique Id (string), a bunch of
> > columns (say 4) which are to be used as features (double), and a column
> for
> > the dependent variable (double).
> > 2. When I created the model I created a data frame (DF2) from DF1 using
> > only the feature vectors and pass that as X. And the column with
> dependent
> > value is passed as Y.
> > 3. For calling the GLM-predict I'm using another data frame (DF3) of same
> > structure but with different Unique ID (essentially different
> > records/rows). From that data frame I'm first creating another data frame
> > (DF4) containing the columns representing the features. Then I'm sending
> > DF4 to GLM-predict which has only feature vectors.
> > 4. The response I get from GLM-predict is the 'means'. Then I'm using the
> > inline predict script which returns another data frame {DF5) with ID and
> > Predicted values.
> >
> > The question is how do I correlate the ID I'm getting from DF5 with the
> > Unique ID of the data frame DF3 ?
> >
> > Regards,
> > Sourav
> >
> >
> >
> >
> > On Wed, Dec 9, 2015 at 9:17 AM, Niketan Pansare <np...@us.ibm.com>
> > wrote:
> >
> > > Hi Sourav,
> > >
> > > 1. In the GLM-predict.dml I could see 'means' is the output variable.
> In
> > my
> > > understanding it is same as the probability matrix u have mentioned in
> > your
> > > mail (to be used to compute the prediction). Am I right ?
> > > Yes, that's correct.
> > >
> > > 2. From GLM.dml I get the 'betas' as output using
> > > outputs.getBinaryBlockedRDD("beta_out"). The same I pass to
> > GLM-predict.dml
> > > as B.
> > >
> > > Can you try this ?
> > > // Get output from GLM
> > > val beta = outputs.getBinaryBlockedRDD("beta_out")
> > > val betaMC = outputs.getMatrixCharacteristics("beta_out") // This way
> you
> > > don't have to worry about dimensions.
> > > // -----------------------------------------
> > > val Xin = DataFrame/RDD of values (or even text/csv file) you want to
> > > predict
> > > // -----------------------------------------
> > > // Execute GLM-predict
> > > ml.reset()
> > > // Please read
> > >
> >
> https://github.com/apache/incubator-systemml/blob/master/scripts/algorithms/GLM.dml
> > > // dfam Int 1 Distribution family code: 1 = Power, 2 = Binomial
> > > val cmdLineParamsPredict = Map("X" -> " ", "B" -> " ", "dfam" -> "...")
> > //
> > > family of distribution ?
> > > ml.registerInput("X", Xin)
> > > ml.registerInput("B_full", beta, betaMC)
> > > ml.registerOutput("means")
> > > val outputsPredict =
> > > ml.execute("/home/system-ml-0.9.0-SNAPSHOT/algorithms/GLM-predict.dml",
> > > cmdLineParamsPredict)
> > > val prob = out.getBinaryBlockedRDD("means");
> > > val probMC = out.getMatrixCharacteristics("means");
> > > // -----------------------------------------
> > > // Get predicted label
> > > ml.reset()
> > > ml.registerInput("Prob",prob, probMC)
> > > ml.registerOutput("Prediction")
> > > val outputsLabels = = mlNew.executeScript("Prob = read(\"temp1\"); "
> > > + "Prediction = rowIndexMax(Prob); "
> > > + "write(Prediction, \"tempOut\", \"csv\")")
> > > val pred = outputsLabels.getDF(sqlContext,
> > > "Prediction").withColumnRenamed("C1", "prediction")
> > > // -----------------------------------------
> > >
> > >
> > > 3. Say I get back prediction matrix as an output (from predictions =
> > > rowIndexMax(means);). Now can I read add that as a column to my
> original
> > > data frame (the one from which I created the feature vector for the
> > > original model) ? My concern is whether adding back will ensure the
> right
> > > order so that teh key for the feature vector and the predicted value
> > remain
> > > same ? If not how to achieve the same ?
> > > In above example 'pred' is a DataFrame with column 'ID' which provides
> > the
> > > row ID.
> > >
> > > Thanks,
> > >
> > > Niketan Pansare
> > > IBM Almaden Research Center
> > > E-mail: npansar At us.ibm.com
> > > http://researcher.watson.ibm.com/researcher/view.php?person=us-npansar
> > >
> > > [image: Inactive hide details for Sourav Mazumder ---12/08/2015
> 10:53:40
> > > PM---Hi Niketan, Thanks again for the detailed inputs.]Sourav Mazumder
> > > ---12/08/2015 10:53:40 PM---Hi Niketan, Thanks again for the detailed
> > > inputs.
> > >
> > > From: Sourav Mazumder <so...@gmail.com>
> > > To: dev@systemml.incubator.apache.org, Niketan
> Pansare/Almaden/IBM@IBMUS
> > > Date: 12/08/2015 10:53 PM
> > > Subject: Re: Using GLM-predict
> > > ------------------------------
> > >
> > >
> > >
> > > Hi Niketan,
> > >
> > > Thanks again for the detailed inputs.
> > >
> > > Some more follow up Qs -
> > >
> > > 1. In the GLM-predict.dml I could see 'means' is the output variable.
> In
> > my
> > > understanding it is same as the probability matrix u have mentioned in
> > your
> > > mail (to be used to compute the prediction). Am I right ?
> > >
> > > 2. From GLM.dml I get the 'betas' as output using
> > > outputs.getBinaryBlockedRDD("beta_out"). The same I pass to
> > GLM-predict.dml
> > > as B. For registering B following statements are used
> > > val beta = outputs.getBinaryBlockedRDD("beta_out")
> > > ml.registerInput("B", beta, 1, 4) // I have four feature vectors so I
> > get 4
> > > coefficients
> > >
> > > However, when I execute GLM-predict.dml I get following error.
> > >
> > > val outputs =
> > > ml.execute("/home/system-ml-0.9.0-SNAPSHOT/algorithms/GLM-predict.dml",
> > > cmdLineParams)
> > >
> > > 15/12/09 05:32:47 WARN Expression: Metadata file:  .mtd not provided
> > > 15/12/09 05:32:47 ERROR Expression: ERROR:
> > > /home/system-ml-0.9.0-SNAPSHOT/algori
> > > thms/GLM-predict.dml -- line 117, column 8 -- Missing or incomplete
> > > dimensio
> > > n information in read statement:  .mtd
> > > com.ibm.bi.dml.parser.LanguageException: Invalid Parameters : ERROR:
> > > /home/syste
> > > m-ml-0.9.0-SNAPSHOT/algorithms/GLM-predict.dml -- line 117, column 8 --
> > > Miss
> > > ing or incomplete dimension information in read statement:  .mtd
> > >
> > > In line 117 we have following statement : X = read (fileX);
> > >
> > > 3. Say I get back prediction matrix as an output (from predictions =
> > > rowIndexMax(means);). Now can I read add that as a column to my
> original
> > > data frame (the one from which I created the feature vector for the
> > > original model) ? My concern is whether adding back will ensure the
> right
> > > order so that teh key for the feature vector and the predicted value
> > remain
> > > same ? If not how to achieve the same ?
> > >
> > > Regards,
> > > Sourav
> > >
> > >
> > >
> > >
> > >
> > > On Tue, Dec 8, 2015 at 2:08 PM, Niketan Pansare <np...@us.ibm.com>
> > > wrote:
> > >
> > > > Hi Sourav,
> > > >
> > > > For some reason, I didn't get your email on "*Tue, 08 Dec 2015
> 12:56:38
> > > > -0800*
> > > > <
> > >
> >
> https://www.mail-archive.com/search?l=dev@systemml.incubator.apache.org&q=date:20151208
> > >
> > > "
> > > > (which I noticed in the archive).
> > > >
> > > > >> Not sure how exactly I can modify the GLM-predict.dml to get some
> > > > prediction to start with.
> > > > There are two options here:
> > > > 1. Modify GLM-predict.dml as suggested by Shirish (better approach
> with
> > > > respect to the SystemML optimizer) or
> > > >
> > > > 2. Run a new script on the output of GLM-predict. Please see:
> > > >
> > >
> >
> https://github.com/apache/incubator-systemml/blob/master/src/main/java/org/apache/sysml/api/ml/LogisticRegressionModel.java#L163
> > > > If you chose to go with option 2, you might also want to read the
> > > > documentation of following two built-in functions:
> > > > a. rowIndexMax (See
> > > >
> > >
> >
> http://apache.github.io/incubator-systemml/dml-language-reference.html#matrix-andor-scalar-comparison-built-in-functions
> > > > <
> > >
> >
> http://apache.github.io/incubator-systemml/dml-language-reference.html#matrix-andor-scalar-comparison-built-in-functions
> > > >
> > > > )
> > > > b. ppred
> > > >
> > > > >> Can you give me some idea how from here I can calculate the
> > predicted
> > > > value of the label using some value of probability threshold ?
> > > > Very simple way to predict the label given probability matrix:
> > > > Prediction = rowIndexMax(Prob) # predicts the label with highest
> > > > probability. This assumes one-based labels.
> > > >
> > > > Thanks,
> > > >
> > > > Niketan Pansare
> > > > IBM Almaden Research Center
> > > > E-mail: npansar At us.ibm.com
> > > >
> http://researcher.watson.ibm.com/researcher/view.php?person=us-npansar
> > > >
> > > > [image: Inactive hide details for Shirish Tatikonda ---12/08/2015
> > > 12:49:47
> > > > PM---Hi Sourav, Yes, GLM-predict.dml gives out only the prob]Shirish
> > > > Tatikonda ---12/08/2015 12:49:47 PM---Hi Sourav, Yes, GLM-predict.dml
> > > gives
> > > > out only the probabilities. You can put a
> > > >
> > > > From: Shirish Tatikonda <sh...@gmail.com>
> > > > To: dev@systemml.incubator.apache.org
> > > > Date: 12/08/2015 12:49 PM
> > > > Subject: Re: Using GLM-predict
> > > > ------------------------------
> > > >
> > > >
> > > >
> > > > Hi Sourav,
> > > >
> > > > Yes, GLM-predict.dml gives out only the probabilities. You can put a
> > > > threshold on the resulting probabilities to get the actual class
> labels
> > > --
> > > > for example, prob > 0.5 is positive and <=0.5 as negative.
> > > >
> > > > The exact value of threshold typically depends on the data and the
> > > > application. Different thresholds yield different classifiers with
> > > > different performance (precision, recall, etc.). You can find the
> best
> > > > threshold for the given data set by finding a value that gives the
> > > desired
> > > > classifier performance (for example, a threshold that gives roughly
> > equal
> > > > precision and recall). Such an optimization is obviously done during
> > the
> > > > training phase using a held out test set.
> > > >
> > > > If you wish, you can also modify the DML script to perform this
> entire
> > > > process.
> > > >
> > > > Shirish
> > > >
> > > >
> > > > On Tue, Dec 8, 2015 at 12:23 PM, Sourav Mazumder <
> > > > sourav.mazumder00@gmail.com> wrote:
> > > >
> > > > > Hi,
> > > > >
> > > > > I have used GLM.dml to create a model using some sample data. It
> > > returns
> > > > to
> > > > > me the matrix of Beta, B.
> > > > >
> > > > > Now I want to use this matrix of Beta on a new set of data points
> and
> > > > > generate predicted value of the dependent variable/observation.
> > > > >
> > > > > When I checked GLM-predict, I could see that one can pass feature
> > > vector
> > > > > for the new data set and also the matrix of beta.
> > > > >
> > > > > But I could not see any way to get the predicted value of the
> > dependent
> > > > > variable/observation. The output parameter only supports matrix of
> > > > > predicted means/probabilities.
> > > > >
> > > > > Is there a way one can get the predicted value of the dependent
> > > > > variable/observation from GLM-predict ?
> > > > >
> > > > > Regards,
> > > > > Sourav
> > > > >
> > > >
> > > >
> > > >
> > >
> > >
> > >
> >
> >
> >
>
>
>

Re: Using GLM-predict

Posted by Niketan Pansare <np...@us.ibm.com>.

Hi Sourav,

Please see below comments:

>> I was basically hoping for some sort of API where one can pass the
original
data frame and from that dataframe can specify the columns to be used as
feature and the column to be used for label. This model can work well for
both creating the model and getting the prediction.
Please use the most recent jar from git. To extract X and Y from your
dataframe without IDs, use following code:
import
org.apache.sysml.runtime.instructions.spark.utils.RDDConverterUtilsExt
val features = Array("lat", "height", "precipitation", "pressure")
val Xmc = new MatrixCharacteristics() // SystemML will set them for you if
the dimensions are unknown
val Ymc = new MatrixCharacteristics()
val X =  RDDConverterUtilsExt.dataFrameToBinaryBlock(sc, df, Xmc, features)
val Y = RDDConverterUtilsExt.dataFrameToBinaryBlock(sc, df, Ymc, Array
("temperature"))

If you want to add specific ordering to your DataFrame rows (let's say for
prediction ... in most cases it is not required), use following method:
import
org.apache.sysml.runtime.instructions.spark.utils.RDDConverterUtilsExt
df = RDDConverterUtilsExt.addIDToDataFrame(df, sqlContext, "ID")

>> 1. Yes dependent variables are nothing but labels
2. The values of the dependent variable are not 1 to totalNumOfClasses. The
values can be any double number. For example say in a weather data set you
have fields like lat, long, height (from sea level), precipitation,
pressure, temperature. Now one way you can create a model where Temperature
is the dependent variable and other are features (the hypothesis is
Temperature is some function of pressure, precipitation, height, latitude
and longitude.
Sorry, in this case, please ignore my earlier suggestion of "Prediction =
rowIndexMax(Prob)" as it applies only to classification.
In your case, the returned values are "means" of the distribution family
which was used (See
http://apache.github.io/incubator-systemml/algorithms-regression.html#generalized-linear-models
). If Gaussian distribution was used (dfam=1, vpow=0.0), and if the problem
was linear and if you expected pointy-hat distribution (i.e. positive
kurtosis), then you can simply return the mean as predicted label. This is
because in case of Gaussian distribution, mean is also the mode. In other
case, it might not necessarily be true.

You may ask why are we making it so complicated and why not just return the
predicted labels instead of probability ?
Well, the problem of labelling is not as simple as it appears and it highly
depends on the problem setting. Let's consider the problem of multi-class
classification and my earlier suggestion "Prediction = rowIndexMax(Prob)".
Also, let the labels be as follows = {cancer, sore throat, birth defect,
fever, normal}. If for a given test example, let's say GLM-predict.dml
outputs following probability = {cancer: 0.2, sore throat: 0.15, birth
defect: 0.15, fever: 0.2, normal:0.3}. Then according to "Prediction =
rowIndexMax(Prob)", we should output the label "normal" and send the
patient home ... right ? No. In this case, 20% probability of cancer is
just way too high for a doctor to send the patient home. In this setting,
the doctor might then say to the data scientist: I know that based on the
prevalence of cancer in general public, and based on that domain knowledge,
I suggest that probability over "threshold" should always be flagged as
cancer. Else output the label with highest probability. Using this
suggestion, the data scientist modifies the DML as follows:
zeroOneMat = ppred(prob[cancerColID], threshold, ">")
prediction = zeroOneMat*cancerColID + (1-zeroOneMat)*rowIndexMax(prob)

This also shows the usefulness of "Declarative Machine Learning" :)

Thanks,

Niketan Pansare
IBM Almaden Research Center
E-mail: npansar At us.ibm.com
http://researcher.watson.ibm.com/researcher/view.php?person=us-npansar



From:	Sourav Mazumder <so...@gmail.com>
To:	dev@systemml.incubator.apache.org
Date:	12/09/2015 01:15 PM
Subject:	Re: Using GLM-predict



Hi Niketan,

Firstly to answer your Qs -

1. Yes dependent variables are nothing but labels
2. The values of the dependent variable are not 1 to totalNumOfClasses. The
values can be any double number. For example say in a weather data set you
have fields like lat, long, height (from sea level), precipitation,
pressure, temperature. Now one way you can create a model where Temperature
is the dependent variable and other are features (the hypothesis is
Temperature is some function of pressure, precipitation, height, latitude
and longitude.

Not sure about the correlation between step 2 and step 3 in your mail. In
step 3 does one have to pass 'ID' column (created in step 2) to varName
while calling registerInput(String varName, DataFrame df, containsID) ?

However the unique Id in typical case can be string. Can't that be used as
is instead ? This means one has to first convert the original unique id to
integer to create an additional unique id column and then again later on
that integer unique id has to mapped back.

I was basically hoping for some sort of API where one can pass the original
data frame and from that dataframe can specify the columns to be used as
feature and the column to be used for label. This model can work well for
both creating the model and getting the prediction.

Regards,
Sourav

On Wed, Dec 9, 2015 at 12:53 PM, Niketan Pansare <np...@us.ibm.com>
wrote:

> Hi Sourav,
>
> Couple of questions to make sure we are on same page: does the "dependent
> variable (double)" represents the class labels ? Are the values of the
> class labels from 1 to numClasses (i..e one-based) ?
>
> Here are few comments regarding correlating IDs:
>
> To represent an unordered collection (i.e. DataFrame) to an ordered
> collection ("Matrix"), we add special column "ID" which represents
*one-based
> row index*. Please perform following steps:
> 1. Accept recent changes from
https://github.com/apache/incubator-systemml
> and use the generated jar.
>
> 2. Map the unique id in DF1 to int (*1 to number of rows*) and call that
> column 'ID'.
>
> 3. Use the variant of registerInput for both X (both for training and
> predicting) and Y:
> registerInput(String varName, DataFrame df, *b**oolean* containsID)
>
> As a side note: instead of separate double columns, you can represent
them
> using VectorUDT and use our converter "JavaPairRDD<MatrixIndexes,
> MatrixBlock> vectorDataFrameToBinaryBlock(JavaSparkContext sc, DataFrame
> inputDF, MatrixCharacteristics mcOut, *boolean* containsID, String
> vectorColumnName) "
>
> Thanks,
>
> Niketan Pansare
> IBM Almaden Research Center
> E-mail: npansar At us.ibm.com
> http://researcher.watson.ibm.com/researcher/view.php?person=us-npansar
>
> [image: Inactive hide details for Sourav Mazumder ---12/09/2015 11:15:19
> AM---Hi Niketan, The code you provided works fine. The use of]Sourav
> Mazumder ---12/09/2015 11:15:19 AM---Hi Niketan, The code you provided
> works fine. The use of getMatrixCharacteristics
>
> From: Sourav Mazumder <so...@gmail.com>
> To: dev@systemml.incubator.apache.org
> Date: 12/09/2015 11:15 AM
> Subject: Re: Using GLM-predict
> ------------------------------
>
>
>
> Hi Niketan,
>
> The code you provided works fine. The use of getMatrixCharacteristics
> solves the basic execution problem.
>
> However, question #3 is probably not yet unresolved. Let me explain the
use
> case scenario I'm trying to build.
>
> 1. Say I have a data frame (DF1) with a Unique Id (string), a bunch of
> columns (say 4) which are to be used as features (double), and a column
for
> the dependent variable (double).
> 2. When I created the model I created a data frame (DF2) from DF1 using
> only the feature vectors and pass that as X. And the column with
dependent
> value is passed as Y.
> 3. For calling the GLM-predict I'm using another data frame (DF3) of same
> structure but with different Unique ID (essentially different
> records/rows). From that data frame I'm first creating another data frame
> (DF4) containing the columns representing the features. Then I'm sending
> DF4 to GLM-predict which has only feature vectors.
> 4. The response I get from GLM-predict is the 'means'. Then I'm using the
> inline predict script which returns another data frame {DF5) with ID and
> Predicted values.
>
> The question is how do I correlate the ID I'm getting from DF5 with the
> Unique ID of the data frame DF3 ?
>
> Regards,
> Sourav
>
>
>
>
> On Wed, Dec 9, 2015 at 9:17 AM, Niketan Pansare <np...@us.ibm.com>
> wrote:
>
> > Hi Sourav,
> >
> > 1. In the GLM-predict.dml I could see 'means' is the output variable.
In
> my
> > understanding it is same as the probability matrix u have mentioned in
> your
> > mail (to be used to compute the prediction). Am I right ?
> > Yes, that's correct.
> >
> > 2. From GLM.dml I get the 'betas' as output using
> > outputs.getBinaryBlockedRDD("beta_out"). The same I pass to
> GLM-predict.dml
> > as B.
> >
> > Can you try this ?
> > // Get output from GLM
> > val beta = outputs.getBinaryBlockedRDD("beta_out")
> > val betaMC = outputs.getMatrixCharacteristics("beta_out") // This way
you
> > don't have to worry about dimensions.
> > // -----------------------------------------
> > val Xin = DataFrame/RDD of values (or even text/csv file) you want to
> > predict
> > // -----------------------------------------
> > // Execute GLM-predict
> > ml.reset()
> > // Please read
> >
>
https://github.com/apache/incubator-systemml/blob/master/scripts/algorithms/GLM.dml

> > // dfam Int 1 Distribution family code: 1 = Power, 2 = Binomial
> > val cmdLineParamsPredict = Map("X" -> " ", "B" -> " ", "dfam" -> "...")
> //
> > family of distribution ?
> > ml.registerInput("X", Xin)
> > ml.registerInput("B_full", beta, betaMC)
> > ml.registerOutput("means")
> > val outputsPredict =
> > ml.execute("/home/system-ml-0.9.0-SNAPSHOT/algorithms/GLM-predict.dml",
> > cmdLineParamsPredict)
> > val prob = out.getBinaryBlockedRDD("means");
> > val probMC = out.getMatrixCharacteristics("means");
> > // -----------------------------------------
> > // Get predicted label
> > ml.reset()
> > ml.registerInput("Prob",prob, probMC)
> > ml.registerOutput("Prediction")
> > val outputsLabels = = mlNew.executeScript("Prob = read(\"temp1\"); "
> > + "Prediction = rowIndexMax(Prob); "
> > + "write(Prediction, \"tempOut\", \"csv\")")
> > val pred = outputsLabels.getDF(sqlContext,
> > "Prediction").withColumnRenamed("C1", "prediction")
> > // -----------------------------------------
> >
> >
> > 3. Say I get back prediction matrix as an output (from predictions =
> > rowIndexMax(means);). Now can I read add that as a column to my
original
> > data frame (the one from which I created the feature vector for the
> > original model) ? My concern is whether adding back will ensure the
right
> > order so that teh key for the feature vector and the predicted value
> remain
> > same ? If not how to achieve the same ?
> > In above example 'pred' is a DataFrame with column 'ID' which provides
> the
> > row ID.
> >
> > Thanks,
> >
> > Niketan Pansare
> > IBM Almaden Research Center
> > E-mail: npansar At us.ibm.com
> > http://researcher.watson.ibm.com/researcher/view.php?person=us-npansar
> >
> > [image: Inactive hide details for Sourav Mazumder ---12/08/2015
10:53:40
> > PM---Hi Niketan, Thanks again for the detailed inputs.]Sourav Mazumder
> > ---12/08/2015 10:53:40 PM---Hi Niketan, Thanks again for the detailed
> > inputs.
> >
> > From: Sourav Mazumder <so...@gmail.com>
> > To: dev@systemml.incubator.apache.org, Niketan
Pansare/Almaden/IBM@IBMUS
> > Date: 12/08/2015 10:53 PM
> > Subject: Re: Using GLM-predict
> > ------------------------------
> >
> >
> >
> > Hi Niketan,
> >
> > Thanks again for the detailed inputs.
> >
> > Some more follow up Qs -
> >
> > 1. In the GLM-predict.dml I could see 'means' is the output variable.
In
> my
> > understanding it is same as the probability matrix u have mentioned in
> your
> > mail (to be used to compute the prediction). Am I right ?
> >
> > 2. From GLM.dml I get the 'betas' as output using
> > outputs.getBinaryBlockedRDD("beta_out"). The same I pass to
> GLM-predict.dml
> > as B. For registering B following statements are used
> > val beta = outputs.getBinaryBlockedRDD("beta_out")
> > ml.registerInput("B", beta, 1, 4) // I have four feature vectors so I
> get 4
> > coefficients
> >
> > However, when I execute GLM-predict.dml I get following error.
> >
> > val outputs =
> > ml.execute("/home/system-ml-0.9.0-SNAPSHOT/algorithms/GLM-predict.dml",
> > cmdLineParams)
> >
> > 15/12/09 05:32:47 WARN Expression: Metadata file:  .mtd not provided
> > 15/12/09 05:32:47 ERROR Expression: ERROR:
> > /home/system-ml-0.9.0-SNAPSHOT/algori
> > thms/GLM-predict.dml -- line 117, column 8 -- Missing or incomplete
> > dimensio
> > n information in read statement:  .mtd
> > com.ibm.bi.dml.parser.LanguageException: Invalid Parameters : ERROR:
> > /home/syste
> > m-ml-0.9.0-SNAPSHOT/algorithms/GLM-predict.dml -- line 117, column 8 --
> > Miss
> > ing or incomplete dimension information in read statement:  .mtd
> >
> > In line 117 we have following statement : X = read (fileX);
> >
> > 3. Say I get back prediction matrix as an output (from predictions =
> > rowIndexMax(means);). Now can I read add that as a column to my
original
> > data frame (the one from which I created the feature vector for the
> > original model) ? My concern is whether adding back will ensure the
right
> > order so that teh key for the feature vector and the predicted value
> remain
> > same ? If not how to achieve the same ?
> >
> > Regards,
> > Sourav
> >
> >
> >
> >
> >
> > On Tue, Dec 8, 2015 at 2:08 PM, Niketan Pansare <np...@us.ibm.com>
> > wrote:
> >
> > > Hi Sourav,
> > >
> > > For some reason, I didn't get your email on "*Tue, 08 Dec 2015
12:56:38
> > > -0800*
> > > <
> >
>
https://www.mail-archive.com/search?l=dev@systemml.incubator.apache.org&q=date:20151208

> >
> > "
> > > (which I noticed in the archive).
> > >
> > > >> Not sure how exactly I can modify the GLM-predict.dml to get some
> > > prediction to start with.
> > > There are two options here:
> > > 1. Modify GLM-predict.dml as suggested by Shirish (better approach
with
> > > respect to the SystemML optimizer) or
> > >
> > > 2. Run a new script on the output of GLM-predict. Please see:
> > >
> >
>
https://github.com/apache/incubator-systemml/blob/master/src/main/java/org/apache/sysml/api/ml/LogisticRegressionModel.java#L163

> > > If you chose to go with option 2, you might also want to read the
> > > documentation of following two built-in functions:
> > > a. rowIndexMax (See
> > >
> >
>
http://apache.github.io/incubator-systemml/dml-language-reference.html#matrix-andor-scalar-comparison-built-in-functions

> > > <
> >
>
http://apache.github.io/incubator-systemml/dml-language-reference.html#matrix-andor-scalar-comparison-built-in-functions

> > >
> > > )
> > > b. ppred
> > >
> > > >> Can you give me some idea how from here I can calculate the
> predicted
> > > value of the label using some value of probability threshold ?
> > > Very simple way to predict the label given probability matrix:
> > > Prediction = rowIndexMax(Prob) # predicts the label with highest
> > > probability. This assumes one-based labels.
> > >
> > > Thanks,
> > >
> > > Niketan Pansare
> > > IBM Almaden Research Center
> > > E-mail: npansar At us.ibm.com
> > >
http://researcher.watson.ibm.com/researcher/view.php?person=us-npansar
> > >
> > > [image: Inactive hide details for Shirish Tatikonda ---12/08/2015
> > 12:49:47
> > > PM---Hi Sourav, Yes, GLM-predict.dml gives out only the prob]Shirish
> > > Tatikonda ---12/08/2015 12:49:47 PM---Hi Sourav, Yes, GLM-predict.dml
> > gives
> > > out only the probabilities. You can put a
> > >
> > > From: Shirish Tatikonda <sh...@gmail.com>
> > > To: dev@systemml.incubator.apache.org
> > > Date: 12/08/2015 12:49 PM
> > > Subject: Re: Using GLM-predict
> > > ------------------------------
> > >
> > >
> > >
> > > Hi Sourav,
> > >
> > > Yes, GLM-predict.dml gives out only the probabilities. You can put a
> > > threshold on the resulting probabilities to get the actual class
labels
> > --
> > > for example, prob > 0.5 is positive and <=0.5 as negative.
> > >
> > > The exact value of threshold typically depends on the data and the
> > > application. Different thresholds yield different classifiers with
> > > different performance (precision, recall, etc.). You can find the
best
> > > threshold for the given data set by finding a value that gives the
> > desired
> > > classifier performance (for example, a threshold that gives roughly
> equal
> > > precision and recall). Such an optimization is obviously done during
> the
> > > training phase using a held out test set.
> > >
> > > If you wish, you can also modify the DML script to perform this
entire
> > > process.
> > >
> > > Shirish
> > >
> > >
> > > On Tue, Dec 8, 2015 at 12:23 PM, Sourav Mazumder <
> > > sourav.mazumder00@gmail.com> wrote:
> > >
> > > > Hi,
> > > >
> > > > I have used GLM.dml to create a model using some sample data. It
> > returns
> > > to
> > > > me the matrix of Beta, B.
> > > >
> > > > Now I want to use this matrix of Beta on a new set of data points
and
> > > > generate predicted value of the dependent variable/observation.
> > > >
> > > > When I checked GLM-predict, I could see that one can pass feature
> > vector
> > > > for the new data set and also the matrix of beta.
> > > >
> > > > But I could not see any way to get the predicted value of the
> dependent
> > > > variable/observation. The output parameter only supports matrix of
> > > > predicted means/probabilities.
> > > >
> > > > Is there a way one can get the predicted value of the dependent
> > > > variable/observation from GLM-predict ?
> > > >
> > > > Regards,
> > > > Sourav
> > > >
> > >
> > >
> > >
> >
> >
> >
>
>
>

Re: Using GLM-predict

Posted by Sourav Mazumder <so...@gmail.com>.

Hi Niketan,

Firstly to answer your Qs -

1. Yes dependent variables are nothing but labels
2. The values of the dependent variable are not 1 to totalNumOfClasses. The
values can be any double number. For example say in a weather data set you
have fields like lat, long, height (from sea level), precipitation,
pressure, temperature. Now one way you can create a model where Temperature
is the dependent variable and other are features (the hypothesis is
Temperature is some function of pressure, precipitation, height, latitude
and longitude.

Not sure about the correlation between step 2 and step 3 in your mail. In
step 3 does one have to pass 'ID' column (created in step 2) to varName
while calling registerInput(String varName, DataFrame df, containsID) ?

However the unique Id in typical case can be string. Can't that be used as
is instead ? This means one has to first convert the original unique id to
integer to create an additional unique id column and then again later on
that integer unique id has to mapped back.

I was basically hoping for some sort of API where one can pass the original
data frame and from that dataframe can specify the columns to be used as
feature and the column to be used for label. This model can work well for
both creating the model and getting the prediction.

Regards,
Sourav

On Wed, Dec 9, 2015 at 12:53 PM, Niketan Pansare <np...@us.ibm.com> wrote:

> Hi Sourav,
>
> Couple of questions to make sure we are on same page: does the "dependent
> variable (double)" represents the class labels ? Are the values of the
> class labels from 1 to numClasses (i..e one-based) ?
>
> Here are few comments regarding correlating IDs:
>
> To represent an unordered collection (i.e. DataFrame) to an ordered
> collection ("Matrix"), we add special column "ID" which represents *one-based
> row index*. Please perform following steps:
> 1. Accept recent changes from https://github.com/apache/incubator-systemml
> and use the generated jar.
>
> 2. Map the unique id in DF1 to int (*1 to number of rows*) and call that
> column 'ID'.
>
> 3. Use the variant of registerInput for both X (both for training and
> predicting) and Y:
> registerInput(String varName, DataFrame df, *b**oolean* containsID)
>
> As a side note: instead of separate double columns, you can represent them
> using VectorUDT and use our converter "JavaPairRDD<MatrixIndexes,
> MatrixBlock> vectorDataFrameToBinaryBlock(JavaSparkContext sc, DataFrame
> inputDF, MatrixCharacteristics mcOut, *boolean* containsID, String
> vectorColumnName) "
>
> Thanks,
>
> Niketan Pansare
> IBM Almaden Research Center
> E-mail: npansar At us.ibm.com
> http://researcher.watson.ibm.com/researcher/view.php?person=us-npansar
>
> [image: Inactive hide details for Sourav Mazumder ---12/09/2015 11:15:19
> AM---Hi Niketan, The code you provided works fine. The use of]Sourav
> Mazumder ---12/09/2015 11:15:19 AM---Hi Niketan, The code you provided
> works fine. The use of getMatrixCharacteristics
>
> From: Sourav Mazumder <so...@gmail.com>
> To: dev@systemml.incubator.apache.org
> Date: 12/09/2015 11:15 AM
> Subject: Re: Using GLM-predict
> ------------------------------
>
>
>
> Hi Niketan,
>
> The code you provided works fine. The use of getMatrixCharacteristics
> solves the basic execution problem.
>
> However, question #3 is probably not yet unresolved. Let me explain the use
> case scenario I'm trying to build.
>
> 1. Say I have a data frame (DF1) with a Unique Id (string), a bunch of
> columns (say 4) which are to be used as features (double), and a column for
> the dependent variable (double).
> 2. When I created the model I created a data frame (DF2) from DF1 using
> only the feature vectors and pass that as X. And the column with dependent
> value is passed as Y.
> 3. For calling the GLM-predict I'm using another data frame (DF3) of same
> structure but with different Unique ID (essentially different
> records/rows). From that data frame I'm first creating another data frame
> (DF4) containing the columns representing the features. Then I'm sending
> DF4 to GLM-predict which has only feature vectors.
> 4. The response I get from GLM-predict is the 'means'. Then I'm using the
> inline predict script which returns another data frame {DF5) with ID and
> Predicted values.
>
> The question is how do I correlate the ID I'm getting from DF5 with the
> Unique ID of the data frame DF3 ?
>
> Regards,
> Sourav
>
>
>
>
> On Wed, Dec 9, 2015 at 9:17 AM, Niketan Pansare <np...@us.ibm.com>
> wrote:
>
> > Hi Sourav,
> >
> > 1. In the GLM-predict.dml I could see 'means' is the output variable. In
> my
> > understanding it is same as the probability matrix u have mentioned in
> your
> > mail (to be used to compute the prediction). Am I right ?
> > Yes, that's correct.
> >
> > 2. From GLM.dml I get the 'betas' as output using
> > outputs.getBinaryBlockedRDD("beta_out"). The same I pass to
> GLM-predict.dml
> > as B.
> >
> > Can you try this ?
> > // Get output from GLM
> > val beta = outputs.getBinaryBlockedRDD("beta_out")
> > val betaMC = outputs.getMatrixCharacteristics("beta_out") // This way you
> > don't have to worry about dimensions.
> > // -----------------------------------------
> > val Xin = DataFrame/RDD of values (or even text/csv file) you want to
> > predict
> > // -----------------------------------------
> > // Execute GLM-predict
> > ml.reset()
> > // Please read
> >
> https://github.com/apache/incubator-systemml/blob/master/scripts/algorithms/GLM.dml
> > // dfam Int 1 Distribution family code: 1 = Power, 2 = Binomial
> > val cmdLineParamsPredict = Map("X" -> " ", "B" -> " ", "dfam" -> "...")
> //
> > family of distribution ?
> > ml.registerInput("X", Xin)
> > ml.registerInput("B_full", beta, betaMC)
> > ml.registerOutput("means")
> > val outputsPredict =
> > ml.execute("/home/system-ml-0.9.0-SNAPSHOT/algorithms/GLM-predict.dml",
> > cmdLineParamsPredict)
> > val prob = out.getBinaryBlockedRDD("means");
> > val probMC = out.getMatrixCharacteristics("means");
> > // -----------------------------------------
> > // Get predicted label
> > ml.reset()
> > ml.registerInput("Prob",prob, probMC)
> > ml.registerOutput("Prediction")
> > val outputsLabels = = mlNew.executeScript("Prob = read(\"temp1\"); "
> > + "Prediction = rowIndexMax(Prob); "
> > + "write(Prediction, \"tempOut\", \"csv\")")
> > val pred = outputsLabels.getDF(sqlContext,
> > "Prediction").withColumnRenamed("C1", "prediction")
> > // -----------------------------------------
> >
> >
> > 3. Say I get back prediction matrix as an output (from predictions =
> > rowIndexMax(means);). Now can I read add that as a column to my original
> > data frame (the one from which I created the feature vector for the
> > original model) ? My concern is whether adding back will ensure the right
> > order so that teh key for the feature vector and the predicted value
> remain
> > same ? If not how to achieve the same ?
> > In above example 'pred' is a DataFrame with column 'ID' which provides
> the
> > row ID.
> >
> > Thanks,
> >
> > Niketan Pansare
> > IBM Almaden Research Center
> > E-mail: npansar At us.ibm.com
> > http://researcher.watson.ibm.com/researcher/view.php?person=us-npansar
> >
> > [image: Inactive hide details for Sourav Mazumder ---12/08/2015 10:53:40
> > PM---Hi Niketan, Thanks again for the detailed inputs.]Sourav Mazumder
> > ---12/08/2015 10:53:40 PM---Hi Niketan, Thanks again for the detailed
> > inputs.
> >
> > From: Sourav Mazumder <so...@gmail.com>
> > To: dev@systemml.incubator.apache.org, Niketan Pansare/Almaden/IBM@IBMUS
> > Date: 12/08/2015 10:53 PM
> > Subject: Re: Using GLM-predict
> > ------------------------------
> >
> >
> >
> > Hi Niketan,
> >
> > Thanks again for the detailed inputs.
> >
> > Some more follow up Qs -
> >
> > 1. In the GLM-predict.dml I could see 'means' is the output variable. In
> my
> > understanding it is same as the probability matrix u have mentioned in
> your
> > mail (to be used to compute the prediction). Am I right ?
> >
> > 2. From GLM.dml I get the 'betas' as output using
> > outputs.getBinaryBlockedRDD("beta_out"). The same I pass to
> GLM-predict.dml
> > as B. For registering B following statements are used
> > val beta = outputs.getBinaryBlockedRDD("beta_out")
> > ml.registerInput("B", beta, 1, 4) // I have four feature vectors so I
> get 4
> > coefficients
> >
> > However, when I execute GLM-predict.dml I get following error.
> >
> > val outputs =
> > ml.execute("/home/system-ml-0.9.0-SNAPSHOT/algorithms/GLM-predict.dml",
> > cmdLineParams)
> >
> > 15/12/09 05:32:47 WARN Expression: Metadata file:  .mtd not provided
> > 15/12/09 05:32:47 ERROR Expression: ERROR:
> > /home/system-ml-0.9.0-SNAPSHOT/algori
> > thms/GLM-predict.dml -- line 117, column 8 -- Missing or incomplete
> > dimensio
> > n information in read statement:  .mtd
> > com.ibm.bi.dml.parser.LanguageException: Invalid Parameters : ERROR:
> > /home/syste
> > m-ml-0.9.0-SNAPSHOT/algorithms/GLM-predict.dml -- line 117, column 8 --
> > Miss
> > ing or incomplete dimension information in read statement:  .mtd
> >
> > In line 117 we have following statement : X = read (fileX);
> >
> > 3. Say I get back prediction matrix as an output (from predictions =
> > rowIndexMax(means);). Now can I read add that as a column to my original
> > data frame (the one from which I created the feature vector for the
> > original model) ? My concern is whether adding back will ensure the right
> > order so that teh key for the feature vector and the predicted value
> remain
> > same ? If not how to achieve the same ?
> >
> > Regards,
> > Sourav
> >
> >
> >
> >
> >
> > On Tue, Dec 8, 2015 at 2:08 PM, Niketan Pansare <np...@us.ibm.com>
> > wrote:
> >
> > > Hi Sourav,
> > >
> > > For some reason, I didn't get your email on "*Tue, 08 Dec 2015 12:56:38
> > > -0800*
> > > <
> >
> https://www.mail-archive.com/search?l=dev@systemml.incubator.apache.org&q=date:20151208
> >
> > "
> > > (which I noticed in the archive).
> > >
> > > >> Not sure how exactly I can modify the GLM-predict.dml to get some
> > > prediction to start with.
> > > There are two options here:
> > > 1. Modify GLM-predict.dml as suggested by Shirish (better approach with
> > > respect to the SystemML optimizer) or
> > >
> > > 2. Run a new script on the output of GLM-predict. Please see:
> > >
> >
> https://github.com/apache/incubator-systemml/blob/master/src/main/java/org/apache/sysml/api/ml/LogisticRegressionModel.java#L163
> > > If you chose to go with option 2, you might also want to read the
> > > documentation of following two built-in functions:
> > > a. rowIndexMax (See
> > >
> >
> http://apache.github.io/incubator-systemml/dml-language-reference.html#matrix-andor-scalar-comparison-built-in-functions
> > > <
> >
> http://apache.github.io/incubator-systemml/dml-language-reference.html#matrix-andor-scalar-comparison-built-in-functions
> > >
> > > )
> > > b. ppred
> > >
> > > >> Can you give me some idea how from here I can calculate the
> predicted
> > > value of the label using some value of probability threshold ?
> > > Very simple way to predict the label given probability matrix:
> > > Prediction = rowIndexMax(Prob) # predicts the label with highest
> > > probability. This assumes one-based labels.
> > >
> > > Thanks,
> > >
> > > Niketan Pansare
> > > IBM Almaden Research Center
> > > E-mail: npansar At us.ibm.com
> > > http://researcher.watson.ibm.com/researcher/view.php?person=us-npansar
> > >
> > > [image: Inactive hide details for Shirish Tatikonda ---12/08/2015
> > 12:49:47
> > > PM---Hi Sourav, Yes, GLM-predict.dml gives out only the prob]Shirish
> > > Tatikonda ---12/08/2015 12:49:47 PM---Hi Sourav, Yes, GLM-predict.dml
> > gives
> > > out only the probabilities. You can put a
> > >
> > > From: Shirish Tatikonda <sh...@gmail.com>
> > > To: dev@systemml.incubator.apache.org
> > > Date: 12/08/2015 12:49 PM
> > > Subject: Re: Using GLM-predict
> > > ------------------------------
> > >
> > >
> > >
> > > Hi Sourav,
> > >
> > > Yes, GLM-predict.dml gives out only the probabilities. You can put a
> > > threshold on the resulting probabilities to get the actual class labels
> > --
> > > for example, prob > 0.5 is positive and <=0.5 as negative.
> > >
> > > The exact value of threshold typically depends on the data and the
> > > application. Different thresholds yield different classifiers with
> > > different performance (precision, recall, etc.). You can find the best
> > > threshold for the given data set by finding a value that gives the
> > desired
> > > classifier performance (for example, a threshold that gives roughly
> equal
> > > precision and recall). Such an optimization is obviously done during
> the
> > > training phase using a held out test set.
> > >
> > > If you wish, you can also modify the DML script to perform this entire
> > > process.
> > >
> > > Shirish
> > >
> > >
> > > On Tue, Dec 8, 2015 at 12:23 PM, Sourav Mazumder <
> > > sourav.mazumder00@gmail.com> wrote:
> > >
> > > > Hi,
> > > >
> > > > I have used GLM.dml to create a model using some sample data. It
> > returns
> > > to
> > > > me the matrix of Beta, B.
> > > >
> > > > Now I want to use this matrix of Beta on a new set of data points and
> > > > generate predicted value of the dependent variable/observation.
> > > >
> > > > When I checked GLM-predict, I could see that one can pass feature
> > vector
> > > > for the new data set and also the matrix of beta.
> > > >
> > > > But I could not see any way to get the predicted value of the
> dependent
> > > > variable/observation. The output parameter only supports matrix of
> > > > predicted means/probabilities.
> > > >
> > > > Is there a way one can get the predicted value of the dependent
> > > > variable/observation from GLM-predict ?
> > > >
> > > > Regards,
> > > > Sourav
> > > >
> > >
> > >
> > >
> >
> >
> >
>
>
>

Re: Using GLM-predict

Posted by Niketan Pansare <np...@us.ibm.com>.

Hi Sourav,

Couple of questions to make sure we are on same page: does the "dependent
variable (double)" represents the class labels ? Are the values of the
class labels from 1 to numClasses (i..e one-based) ?

Here are few comments regarding correlating IDs:

To represent an unordered collection (i.e. DataFrame) to an ordered
collection ("Matrix"), we add special column "ID" which represents
one-based row index. Please perform following steps:
1. Accept recent changes from https://github.com/apache/incubator-systemml
and use the generated jar.

2. Map the unique id in DF1 to int (1 to number of rows) and call that
column 'ID'.

3. Use the variant of registerInput for both X (both for training and
predicting) and Y:
registerInput(String varName, DataFrame df, boolean containsID)

As a side note: instead of separate double columns, you can represent them
using VectorUDT and use our converter "JavaPairRDD<MatrixIndexes,
MatrixBlock> vectorDataFrameToBinaryBlock(JavaSparkContext sc, DataFrame
inputDF, MatrixCharacteristics mcOut, boolean containsID, String
vectorColumnName) "

Thanks,

Niketan Pansare
IBM Almaden Research Center
E-mail: npansar At us.ibm.com
http://researcher.watson.ibm.com/researcher/view.php?person=us-npansar



From:	Sourav Mazumder <so...@gmail.com>
To:	dev@systemml.incubator.apache.org
Date:	12/09/2015 11:15 AM
Subject:	Re: Using GLM-predict



Hi Niketan,

The code you provided works fine. The use of getMatrixCharacteristics
solves the basic execution problem.

However, question #3 is probably not yet unresolved. Let me explain the use
case scenario I'm trying to build.

1. Say I have a data frame (DF1) with a Unique Id (string), a bunch of
columns (say 4) which are to be used as features (double), and a column for
the dependent variable (double).
2. When I created the model I created a data frame (DF2) from DF1 using
only the feature vectors and pass that as X. And the column with dependent
value is passed as Y.
3. For calling the GLM-predict I'm using another data frame (DF3) of same
structure but with different Unique ID (essentially different
records/rows). From that data frame I'm first creating another data frame
(DF4) containing the columns representing the features. Then I'm sending
DF4 to GLM-predict which has only feature vectors.
4. The response I get from GLM-predict is the 'means'. Then I'm using the
inline predict script which returns another data frame {DF5) with ID and
Predicted values.

The question is how do I correlate the ID I'm getting from DF5 with the
Unique ID of the data frame DF3 ?

Regards,
Sourav




On Wed, Dec 9, 2015 at 9:17 AM, Niketan Pansare <np...@us.ibm.com> wrote:

> Hi Sourav,
>
> 1. In the GLM-predict.dml I could see 'means' is the output variable. In
my
> understanding it is same as the probability matrix u have mentioned in
your
> mail (to be used to compute the prediction). Am I right ?
> Yes, that's correct.
>
> 2. From GLM.dml I get the 'betas' as output using
> outputs.getBinaryBlockedRDD("beta_out"). The same I pass to
GLM-predict.dml
> as B.
>
> Can you try this ?
> // Get output from GLM
> val beta = outputs.getBinaryBlockedRDD("beta_out")
> val betaMC = outputs.getMatrixCharacteristics("beta_out") // This way you
> don't have to worry about dimensions.
> // -----------------------------------------
> val Xin = DataFrame/RDD of values (or even text/csv file) you want to
> predict
> // -----------------------------------------
> // Execute GLM-predict
> ml.reset()
> // Please read
>
https://github.com/apache/incubator-systemml/blob/master/scripts/algorithms/GLM.dml

> // dfam Int 1 Distribution family code: 1 = Power, 2 = Binomial
> val cmdLineParamsPredict = Map("X" -> " ", "B" -> " ", "dfam" ->
"...") //
> family of distribution ?
> ml.registerInput("X", Xin)
> ml.registerInput("B_full", beta, betaMC)
> ml.registerOutput("means")
> val outputsPredict =
> ml.execute("/home/system-ml-0.9.0-SNAPSHOT/algorithms/GLM-predict.dml",
> cmdLineParamsPredict)
> val prob = out.getBinaryBlockedRDD("means");
> val probMC = out.getMatrixCharacteristics("means");
> // -----------------------------------------
> // Get predicted label
> ml.reset()
> ml.registerInput("Prob",prob, probMC)
> ml.registerOutput("Prediction")
> val outputsLabels = = mlNew.executeScript("Prob = read(\"temp1\"); "
> + "Prediction = rowIndexMax(Prob); "
> + "write(Prediction, \"tempOut\", \"csv\")")
> val pred = outputsLabels.getDF(sqlContext,
> "Prediction").withColumnRenamed("C1", "prediction")
> // -----------------------------------------
>
>
> 3. Say I get back prediction matrix as an output (from predictions =
> rowIndexMax(means);). Now can I read add that as a column to my original
> data frame (the one from which I created the feature vector for the
> original model) ? My concern is whether adding back will ensure the right
> order so that teh key for the feature vector and the predicted value
remain
> same ? If not how to achieve the same ?
> In above example 'pred' is a DataFrame with column 'ID' which provides
the
> row ID.
>
> Thanks,
>
> Niketan Pansare
> IBM Almaden Research Center
> E-mail: npansar At us.ibm.com
> http://researcher.watson.ibm.com/researcher/view.php?person=us-npansar
>
> [image: Inactive hide details for Sourav Mazumder ---12/08/2015 10:53:40
> PM---Hi Niketan, Thanks again for the detailed inputs.]Sourav Mazumder
> ---12/08/2015 10:53:40 PM---Hi Niketan, Thanks again for the detailed
> inputs.
>
> From: Sourav Mazumder <so...@gmail.com>
> To: dev@systemml.incubator.apache.org, Niketan Pansare/Almaden/IBM@IBMUS
> Date: 12/08/2015 10:53 PM
> Subject: Re: Using GLM-predict
> ------------------------------
>
>
>
> Hi Niketan,
>
> Thanks again for the detailed inputs.
>
> Some more follow up Qs -
>
> 1. In the GLM-predict.dml I could see 'means' is the output variable. In
my
> understanding it is same as the probability matrix u have mentioned in
your
> mail (to be used to compute the prediction). Am I right ?
>
> 2. From GLM.dml I get the 'betas' as output using
> outputs.getBinaryBlockedRDD("beta_out"). The same I pass to
GLM-predict.dml
> as B. For registering B following statements are used
> val beta = outputs.getBinaryBlockedRDD("beta_out")
> ml.registerInput("B", beta, 1, 4) // I have four feature vectors so I get
4
> coefficients
>
> However, when I execute GLM-predict.dml I get following error.
>
> val outputs =
> ml.execute("/home/system-ml-0.9.0-SNAPSHOT/algorithms/GLM-predict.dml",
> cmdLineParams)
>
> 15/12/09 05:32:47 WARN Expression: Metadata file:  .mtd not provided
> 15/12/09 05:32:47 ERROR Expression: ERROR:
> /home/system-ml-0.9.0-SNAPSHOT/algori
> thms/GLM-predict.dml -- line 117, column 8 -- Missing or incomplete
> dimensio
> n information in read statement:  .mtd
> com.ibm.bi.dml.parser.LanguageException: Invalid Parameters : ERROR:
> /home/syste
> m-ml-0.9.0-SNAPSHOT/algorithms/GLM-predict.dml -- line 117, column 8 --
> Miss
> ing or incomplete dimension information in read statement:  .mtd
>
> In line 117 we have following statement : X = read (fileX);
>
> 3. Say I get back prediction matrix as an output (from predictions =
> rowIndexMax(means);). Now can I read add that as a column to my original
> data frame (the one from which I created the feature vector for the
> original model) ? My concern is whether adding back will ensure the right
> order so that teh key for the feature vector and the predicted value
remain
> same ? If not how to achieve the same ?
>
> Regards,
> Sourav
>
>
>
>
>
> On Tue, Dec 8, 2015 at 2:08 PM, Niketan Pansare <np...@us.ibm.com>
> wrote:
>
> > Hi Sourav,
> >
> > For some reason, I didn't get your email on "*Tue, 08 Dec 2015 12:56:38
> > -0800*
> > <
>
https://www.mail-archive.com/search?l=dev@systemml.incubator.apache.org&q=date:20151208
>
> "
> > (which I noticed in the archive).
> >
> > >> Not sure how exactly I can modify the GLM-predict.dml to get some
> > prediction to start with.
> > There are two options here:
> > 1. Modify GLM-predict.dml as suggested by Shirish (better approach with
> > respect to the SystemML optimizer) or
> >
> > 2. Run a new script on the output of GLM-predict. Please see:
> >
>
https://github.com/apache/incubator-systemml/blob/master/src/main/java/org/apache/sysml/api/ml/LogisticRegressionModel.java#L163

> > If you chose to go with option 2, you might also want to read the
> > documentation of following two built-in functions:
> > a. rowIndexMax (See
> >
>
http://apache.github.io/incubator-systemml/dml-language-reference.html#matrix-andor-scalar-comparison-built-in-functions

> > <
>
http://apache.github.io/incubator-systemml/dml-language-reference.html#matrix-andor-scalar-comparison-built-in-functions

> >
> > )
> > b. ppred
> >
> > >> Can you give me some idea how from here I can calculate the
predicted
> > value of the label using some value of probability threshold ?
> > Very simple way to predict the label given probability matrix:
> > Prediction = rowIndexMax(Prob) # predicts the label with highest
> > probability. This assumes one-based labels.
> >
> > Thanks,
> >
> > Niketan Pansare
> > IBM Almaden Research Center
> > E-mail: npansar At us.ibm.com
> > http://researcher.watson.ibm.com/researcher/view.php?person=us-npansar
> >
> > [image: Inactive hide details for Shirish Tatikonda ---12/08/2015
> 12:49:47
> > PM---Hi Sourav, Yes, GLM-predict.dml gives out only the prob]Shirish
> > Tatikonda ---12/08/2015 12:49:47 PM---Hi Sourav, Yes, GLM-predict.dml
> gives
> > out only the probabilities. You can put a
> >
> > From: Shirish Tatikonda <sh...@gmail.com>
> > To: dev@systemml.incubator.apache.org
> > Date: 12/08/2015 12:49 PM
> > Subject: Re: Using GLM-predict
> > ------------------------------
> >
> >
> >
> > Hi Sourav,
> >
> > Yes, GLM-predict.dml gives out only the probabilities. You can put a
> > threshold on the resulting probabilities to get the actual class labels
> --
> > for example, prob > 0.5 is positive and <=0.5 as negative.
> >
> > The exact value of threshold typically depends on the data and the
> > application. Different thresholds yield different classifiers with
> > different performance (precision, recall, etc.). You can find the best
> > threshold for the given data set by finding a value that gives the
> desired
> > classifier performance (for example, a threshold that gives roughly
equal
> > precision and recall). Such an optimization is obviously done during
the
> > training phase using a held out test set.
> >
> > If you wish, you can also modify the DML script to perform this entire
> > process.
> >
> > Shirish
> >
> >
> > On Tue, Dec 8, 2015 at 12:23 PM, Sourav Mazumder <
> > sourav.mazumder00@gmail.com> wrote:
> >
> > > Hi,
> > >
> > > I have used GLM.dml to create a model using some sample data. It
> returns
> > to
> > > me the matrix of Beta, B.
> > >
> > > Now I want to use this matrix of Beta on a new set of data points and
> > > generate predicted value of the dependent variable/observation.
> > >
> > > When I checked GLM-predict, I could see that one can pass feature
> vector
> > > for the new data set and also the matrix of beta.
> > >
> > > But I could not see any way to get the predicted value of the
dependent
> > > variable/observation. The output parameter only supports matrix of
> > > predicted means/probabilities.
> > >
> > > Is there a way one can get the predicted value of the dependent
> > > variable/observation from GLM-predict ?
> > >
> > > Regards,
> > > Sourav
> > >
> >
> >
> >
>
>
>

Re: Using GLM-predict

Posted by Sourav Mazumder <so...@gmail.com>.

Hi Niketan,

The code you provided works fine. The use of getMatrixCharacteristics
solves the basic execution problem.

However, question #3 is probably not yet unresolved. Let me explain the use
case scenario I'm trying to build.

1. Say I have a data frame (DF1) with a Unique Id (string), a bunch of
columns (say 4) which are to be used as features (double), and a column for
the dependent variable (double).
2. When I created the model I created a data frame (DF2) from DF1 using
only the feature vectors and pass that as X. And the column with dependent
value is passed as Y.
3. For calling the GLM-predict I'm using another data frame (DF3) of same
structure but with different Unique ID (essentially different
records/rows). From that data frame I'm first creating another data frame
(DF4) containing the columns representing the features. Then I'm sending
DF4 to GLM-predict which has only feature vectors.
4. The response I get from GLM-predict is the 'means'. Then I'm using the
inline predict script which returns another data frame {DF5) with ID and
Predicted values.

The question is how do I correlate the ID I'm getting from DF5 with the
Unique ID of the data frame DF3 ?

Regards,
Sourav




On Wed, Dec 9, 2015 at 9:17 AM, Niketan Pansare <np...@us.ibm.com> wrote:

> Hi Sourav,
>
> 1. In the GLM-predict.dml I could see 'means' is the output variable. In my
> understanding it is same as the probability matrix u have mentioned in your
> mail (to be used to compute the prediction). Am I right ?
> Yes, that's correct.
>
> 2. From GLM.dml I get the 'betas' as output using
> outputs.getBinaryBlockedRDD("beta_out"). The same I pass to GLM-predict.dml
> as B.
>
> Can you try this ?
> // Get output from GLM
> val beta = outputs.getBinaryBlockedRDD("beta_out")
> val betaMC = outputs.getMatrixCharacteristics("beta_out") // This way you
> don't have to worry about dimensions.
> // -----------------------------------------
> val Xin = DataFrame/RDD of values (or even text/csv file) you want to
> predict
> // -----------------------------------------
> // Execute GLM-predict
> ml.reset()
> // Please read
> https://github.com/apache/incubator-systemml/blob/master/scripts/algorithms/GLM.dml
> // dfam Int 1 Distribution family code: 1 = Power, 2 = Binomial
> val cmdLineParamsPredict = Map("X" -> " ", "B" -> " ", "dfam" -> "...") //
> family of distribution ?
> ml.registerInput("X", Xin)
> ml.registerInput("B_full", beta, betaMC)
> ml.registerOutput("means")
> val outputsPredict =
> ml.execute("/home/system-ml-0.9.0-SNAPSHOT/algorithms/GLM-predict.dml",
> cmdLineParamsPredict)
> val prob = out.getBinaryBlockedRDD("means");
> val probMC = out.getMatrixCharacteristics("means");
> // -----------------------------------------
> // Get predicted label
> ml.reset()
> ml.registerInput("Prob",prob, probMC)
> ml.registerOutput("Prediction")
> val outputsLabels = = mlNew.executeScript("Prob = read(\"temp1\"); "
> + "Prediction = rowIndexMax(Prob); "
> + "write(Prediction, \"tempOut\", \"csv\")")
> val pred = outputsLabels.getDF(sqlContext,
> "Prediction").withColumnRenamed("C1", "prediction")
> // -----------------------------------------
>
>
> 3. Say I get back prediction matrix as an output (from predictions =
> rowIndexMax(means);). Now can I read add that as a column to my original
> data frame (the one from which I created the feature vector for the
> original model) ? My concern is whether adding back will ensure the right
> order so that teh key for the feature vector and the predicted value remain
> same ? If not how to achieve the same ?
> In above example 'pred' is a DataFrame with column 'ID' which provides the
> row ID.
>
> Thanks,
>
> Niketan Pansare
> IBM Almaden Research Center
> E-mail: npansar At us.ibm.com
> http://researcher.watson.ibm.com/researcher/view.php?person=us-npansar
>
> [image: Inactive hide details for Sourav Mazumder ---12/08/2015 10:53:40
> PM---Hi Niketan, Thanks again for the detailed inputs.]Sourav Mazumder
> ---12/08/2015 10:53:40 PM---Hi Niketan, Thanks again for the detailed
> inputs.
>
> From: Sourav Mazumder <so...@gmail.com>
> To: dev@systemml.incubator.apache.org, Niketan Pansare/Almaden/IBM@IBMUS
> Date: 12/08/2015 10:53 PM
> Subject: Re: Using GLM-predict
> ------------------------------
>
>
>
> Hi Niketan,
>
> Thanks again for the detailed inputs.
>
> Some more follow up Qs -
>
> 1. In the GLM-predict.dml I could see 'means' is the output variable. In my
> understanding it is same as the probability matrix u have mentioned in your
> mail (to be used to compute the prediction). Am I right ?
>
> 2. From GLM.dml I get the 'betas' as output using
> outputs.getBinaryBlockedRDD("beta_out"). The same I pass to GLM-predict.dml
> as B. For registering B following statements are used
> val beta = outputs.getBinaryBlockedRDD("beta_out")
> ml.registerInput("B", beta, 1, 4) // I have four feature vectors so I get 4
> coefficients
>
> However, when I execute GLM-predict.dml I get following error.
>
> val outputs =
> ml.execute("/home/system-ml-0.9.0-SNAPSHOT/algorithms/GLM-predict.dml",
> cmdLineParams)
>
> 15/12/09 05:32:47 WARN Expression: Metadata file:  .mtd not provided
> 15/12/09 05:32:47 ERROR Expression: ERROR:
> /home/system-ml-0.9.0-SNAPSHOT/algori
> thms/GLM-predict.dml -- line 117, column 8 -- Missing or incomplete
> dimensio
> n information in read statement:  .mtd
> com.ibm.bi.dml.parser.LanguageException: Invalid Parameters : ERROR:
> /home/syste
> m-ml-0.9.0-SNAPSHOT/algorithms/GLM-predict.dml -- line 117, column 8 --
> Miss
> ing or incomplete dimension information in read statement:  .mtd
>
> In line 117 we have following statement : X = read (fileX);
>
> 3. Say I get back prediction matrix as an output (from predictions =
> rowIndexMax(means);). Now can I read add that as a column to my original
> data frame (the one from which I created the feature vector for the
> original model) ? My concern is whether adding back will ensure the right
> order so that teh key for the feature vector and the predicted value remain
> same ? If not how to achieve the same ?
>
> Regards,
> Sourav
>
>
>
>
>
> On Tue, Dec 8, 2015 at 2:08 PM, Niketan Pansare <np...@us.ibm.com>
> wrote:
>
> > Hi Sourav,
> >
> > For some reason, I didn't get your email on "*Tue, 08 Dec 2015 12:56:38
> > -0800*
> > <
> https://www.mail-archive.com/search?l=dev@systemml.incubator.apache.org&q=date:20151208>
> "
> > (which I noticed in the archive).
> >
> > >> Not sure how exactly I can modify the GLM-predict.dml to get some
> > prediction to start with.
> > There are two options here:
> > 1. Modify GLM-predict.dml as suggested by Shirish (better approach with
> > respect to the SystemML optimizer) or
> >
> > 2. Run a new script on the output of GLM-predict. Please see:
> >
> https://github.com/apache/incubator-systemml/blob/master/src/main/java/org/apache/sysml/api/ml/LogisticRegressionModel.java#L163
> > If you chose to go with option 2, you might also want to read the
> > documentation of following two built-in functions:
> > a. rowIndexMax (See
> >
> http://apache.github.io/incubator-systemml/dml-language-reference.html#matrix-andor-scalar-comparison-built-in-functions
> > <
> http://apache.github.io/incubator-systemml/dml-language-reference.html#matrix-andor-scalar-comparison-built-in-functions
> >
> > )
> > b. ppred
> >
> > >> Can you give me some idea how from here I can calculate the predicted
> > value of the label using some value of probability threshold ?
> > Very simple way to predict the label given probability matrix:
> > Prediction = rowIndexMax(Prob) # predicts the label with highest
> > probability. This assumes one-based labels.
> >
> > Thanks,
> >
> > Niketan Pansare
> > IBM Almaden Research Center
> > E-mail: npansar At us.ibm.com
> > http://researcher.watson.ibm.com/researcher/view.php?person=us-npansar
> >
> > [image: Inactive hide details for Shirish Tatikonda ---12/08/2015
> 12:49:47
> > PM---Hi Sourav, Yes, GLM-predict.dml gives out only the prob]Shirish
> > Tatikonda ---12/08/2015 12:49:47 PM---Hi Sourav, Yes, GLM-predict.dml
> gives
> > out only the probabilities. You can put a
> >
> > From: Shirish Tatikonda <sh...@gmail.com>
> > To: dev@systemml.incubator.apache.org
> > Date: 12/08/2015 12:49 PM
> > Subject: Re: Using GLM-predict
> > ------------------------------
> >
> >
> >
> > Hi Sourav,
> >
> > Yes, GLM-predict.dml gives out only the probabilities. You can put a
> > threshold on the resulting probabilities to get the actual class labels
> --
> > for example, prob > 0.5 is positive and <=0.5 as negative.
> >
> > The exact value of threshold typically depends on the data and the
> > application. Different thresholds yield different classifiers with
> > different performance (precision, recall, etc.). You can find the best
> > threshold for the given data set by finding a value that gives the
> desired
> > classifier performance (for example, a threshold that gives roughly equal
> > precision and recall). Such an optimization is obviously done during the
> > training phase using a held out test set.
> >
> > If you wish, you can also modify the DML script to perform this entire
> > process.
> >
> > Shirish
> >
> >
> > On Tue, Dec 8, 2015 at 12:23 PM, Sourav Mazumder <
> > sourav.mazumder00@gmail.com> wrote:
> >
> > > Hi,
> > >
> > > I have used GLM.dml to create a model using some sample data. It
> returns
> > to
> > > me the matrix of Beta, B.
> > >
> > > Now I want to use this matrix of Beta on a new set of data points and
> > > generate predicted value of the dependent variable/observation.
> > >
> > > When I checked GLM-predict, I could see that one can pass feature
> vector
> > > for the new data set and also the matrix of beta.
> > >
> > > But I could not see any way to get the predicted value of the dependent
> > > variable/observation. The output parameter only supports matrix of
> > > predicted means/probabilities.
> > >
> > > Is there a way one can get the predicted value of the dependent
> > > variable/observation from GLM-predict ?
> > >
> > > Regards,
> > > Sourav
> > >
> >
> >
> >
>
>
>

Re: Using GLM-predict

Posted by Niketan Pansare <np...@us.ibm.com>.

Hi Sourav,

1. In the GLM-predict.dml I could see 'means' is the output variable. In my
understanding it is same as the probability matrix u have mentioned in your
mail (to be used to compute the prediction). Am I right ?
Yes, that's correct.

2. From GLM.dml I get the 'betas' as output using
outputs.getBinaryBlockedRDD("beta_out"). The same I pass to GLM-predict.dml
as B.

Can you try this ?
// Get output from GLM
val beta = outputs.getBinaryBlockedRDD("beta_out")
val betaMC = outputs.getMatrixCharacteristics("beta_out") // This way you
don't have to worry about dimensions.
// -----------------------------------------
val Xin = DataFrame/RDD of values (or even text/csv file) you want to
predict
// -----------------------------------------
// Execute GLM-predict
ml.reset()
// Please read
https://github.com/apache/incubator-systemml/blob/master/scripts/algorithms/GLM.dml
// dfam Int 1 Distribution family code: 1 = Power, 2 = Binomial
val cmdLineParamsPredict = Map("X" -> " ", "B" -> " ", "dfam" -> "...") //
family of distribution ?
ml.registerInput("X", Xin)
ml.registerInput("B_full", beta, betaMC)
ml.registerOutput("means")
val outputsPredict = ml.execute
("/home/system-ml-0.9.0-SNAPSHOT/algorithms/GLM-predict.dml",
cmdLineParamsPredict)
val prob = out.getBinaryBlockedRDD("means");
val probMC = out.getMatrixCharacteristics("means");
// -----------------------------------------
// Get predicted label
ml.reset()
ml.registerInput("Prob",prob, probMC)
ml.registerOutput("Prediction")
val outputsLabels = = mlNew.executeScript("Prob = read(\"temp1\"); "
+ "Prediction = rowIndexMax(Prob); "
+ "write(Prediction, \"tempOut\", \"csv\")")
val pred = outputsLabels.getDF(sqlContext, "Prediction").withColumnRenamed
("C1", "prediction")
// -----------------------------------------


3. Say I get back prediction matrix as an output (from predictions =
rowIndexMax(means);). Now can I read add that as a column to my original
data frame (the one from which I created the feature vector for the
original model) ? My concern is whether adding back will ensure the right
order so that teh key for the feature vector and the predicted value remain
same ? If not how to achieve the same ?
In above example 'pred' is a DataFrame with column 'ID' which provides the
row ID.

Thanks,

Niketan Pansare
IBM Almaden Research Center
E-mail: npansar At us.ibm.com
http://researcher.watson.ibm.com/researcher/view.php?person=us-npansar



From:	Sourav Mazumder <so...@gmail.com>
To:	dev@systemml.incubator.apache.org, Niketan
            Pansare/Almaden/IBM@IBMUS
Date:	12/08/2015 10:53 PM
Subject:	Re: Using GLM-predict



Hi Niketan,

Thanks again for the detailed inputs.

Some more follow up Qs -

1. In the GLM-predict.dml I could see 'means' is the output variable. In my
understanding it is same as the probability matrix u have mentioned in your
mail (to be used to compute the prediction). Am I right ?

2. From GLM.dml I get the 'betas' as output using
outputs.getBinaryBlockedRDD("beta_out"). The same I pass to GLM-predict.dml
as B. For registering B following statements are used
val beta = outputs.getBinaryBlockedRDD("beta_out")
ml.registerInput("B", beta, 1, 4) // I have four feature vectors so I get 4
coefficients

However, when I execute GLM-predict.dml I get following error.

val outputs =
ml.execute("/home/system-ml-0.9.0-SNAPSHOT/algorithms/GLM-predict.dml",
cmdLineParams)

15/12/09 05:32:47 WARN Expression: Metadata file:  .mtd not provided
15/12/09 05:32:47 ERROR Expression: ERROR:
/home/system-ml-0.9.0-SNAPSHOT/algori
thms/GLM-predict.dml -- line 117, column 8 -- Missing or incomplete
dimensio
n information in read statement:  .mtd
com.ibm.bi.dml.parser.LanguageException: Invalid Parameters : ERROR:
/home/syste
m-ml-0.9.0-SNAPSHOT/algorithms/GLM-predict.dml -- line 117, column 8 --
Miss
ing or incomplete dimension information in read statement:  .mtd

In line 117 we have following statement : X = read (fileX);

3. Say I get back prediction matrix as an output (from predictions =
rowIndexMax(means);). Now can I read add that as a column to my original
data frame (the one from which I created the feature vector for the
original model) ? My concern is whether adding back will ensure the right
order so that teh key for the feature vector and the predicted value remain
same ? If not how to achieve the same ?

Regards,
Sourav





On Tue, Dec 8, 2015 at 2:08 PM, Niketan Pansare <np...@us.ibm.com> wrote:

> Hi Sourav,
>
> For some reason, I didn't get your email on "*Tue, 08 Dec 2015 12:56:38
> -0800*
> <
https://www.mail-archive.com/search?l=dev@systemml.incubator.apache.org&q=date:20151208
> "
> (which I noticed in the archive).
>
> >> Not sure how exactly I can modify the GLM-predict.dml to get some
> prediction to start with.
> There are two options here:
> 1. Modify GLM-predict.dml as suggested by Shirish (better approach with
> respect to the SystemML optimizer) or
>
> 2. Run a new script on the output of GLM-predict. Please see:
>
https://github.com/apache/incubator-systemml/blob/master/src/main/java/org/apache/sysml/api/ml/LogisticRegressionModel.java#L163

> If you chose to go with option 2, you might also want to read the
> documentation of following two built-in functions:
> a. rowIndexMax (See
>
http://apache.github.io/incubator-systemml/dml-language-reference.html#matrix-andor-scalar-comparison-built-in-functions

> <
http://apache.github.io/incubator-systemml/dml-language-reference.html#matrix-andor-scalar-comparison-built-in-functions
>
> )
> b. ppred
>
> >> Can you give me some idea how from here I can calculate the predicted
> value of the label using some value of probability threshold ?
> Very simple way to predict the label given probability matrix:
> Prediction = rowIndexMax(Prob) # predicts the label with highest
> probability. This assumes one-based labels.
>
> Thanks,
>
> Niketan Pansare
> IBM Almaden Research Center
> E-mail: npansar At us.ibm.com
> http://researcher.watson.ibm.com/researcher/view.php?person=us-npansar
>
> [image: Inactive hide details for Shirish Tatikonda ---12/08/2015
12:49:47
> PM---Hi Sourav, Yes, GLM-predict.dml gives out only the prob]Shirish
> Tatikonda ---12/08/2015 12:49:47 PM---Hi Sourav, Yes, GLM-predict.dml
gives
> out only the probabilities. You can put a
>
> From: Shirish Tatikonda <sh...@gmail.com>
> To: dev@systemml.incubator.apache.org
> Date: 12/08/2015 12:49 PM
> Subject: Re: Using GLM-predict
> ------------------------------
>
>
>
> Hi Sourav,
>
> Yes, GLM-predict.dml gives out only the probabilities. You can put a
> threshold on the resulting probabilities to get the actual class labels
--
> for example, prob > 0.5 is positive and <=0.5 as negative.
>
> The exact value of threshold typically depends on the data and the
> application. Different thresholds yield different classifiers with
> different performance (precision, recall, etc.). You can find the best
> threshold for the given data set by finding a value that gives the
desired
> classifier performance (for example, a threshold that gives roughly equal
> precision and recall). Such an optimization is obviously done during the
> training phase using a held out test set.
>
> If you wish, you can also modify the DML script to perform this entire
> process.
>
> Shirish
>
>
> On Tue, Dec 8, 2015 at 12:23 PM, Sourav Mazumder <
> sourav.mazumder00@gmail.com> wrote:
>
> > Hi,
> >
> > I have used GLM.dml to create a model using some sample data. It
returns
> to
> > me the matrix of Beta, B.
> >
> > Now I want to use this matrix of Beta on a new set of data points and
> > generate predicted value of the dependent variable/observation.
> >
> > When I checked GLM-predict, I could see that one can pass feature
vector
> > for the new data set and also the matrix of beta.
> >
> > But I could not see any way to get the predicted value of the dependent
> > variable/observation. The output parameter only supports matrix of
> > predicted means/probabilities.
> >
> > Is there a way one can get the predicted value of the dependent
> > variable/observation from GLM-predict ?
> >
> > Regards,
> > Sourav
> >
>
>
>

Re: Using GLM-predict

Posted by Sourav Mazumder <so...@gmail.com>.

Hi Niketan,

Thanks again for the detailed inputs.

Some more follow up Qs -

1. In the GLM-predict.dml I could see 'means' is the output variable. In my
understanding it is same as the probability matrix u have mentioned in your
mail (to be used to compute the prediction). Am I right ?

2. From GLM.dml I get the 'betas' as output using
outputs.getBinaryBlockedRDD("beta_out"). The same I pass to GLM-predict.dml
as B. For registering B following statements are used
val beta = outputs.getBinaryBlockedRDD("beta_out")
ml.registerInput("B", beta, 1, 4) // I have four feature vectors so I get 4
coefficients

However, when I execute GLM-predict.dml I get following error.

val outputs =
ml.execute("/home/system-ml-0.9.0-SNAPSHOT/algorithms/GLM-predict.dml",
cmdLineParams)

15/12/09 05:32:47 WARN Expression: Metadata file:  .mtd not provided
15/12/09 05:32:47 ERROR Expression: ERROR:
/home/system-ml-0.9.0-SNAPSHOT/algori
thms/GLM-predict.dml -- line 117, column 8 -- Missing or incomplete dimensio
n information in read statement:  .mtd
com.ibm.bi.dml.parser.LanguageException: Invalid Parameters : ERROR:
/home/syste
m-ml-0.9.0-SNAPSHOT/algorithms/GLM-predict.dml -- line 117, column 8 -- Miss
ing or incomplete dimension information in read statement:  .mtd

In line 117 we have following statement : X = read (fileX);

3. Say I get back prediction matrix as an output (from predictions =
rowIndexMax(means);). Now can I read add that as a column to my original
data frame (the one from which I created the feature vector for the
original model) ? My concern is whether adding back will ensure the right
order so that teh key for the feature vector and the predicted value remain
same ? If not how to achieve the same ?

Regards,
Sourav





On Tue, Dec 8, 2015 at 2:08 PM, Niketan Pansare <np...@us.ibm.com> wrote:

> Hi Sourav,
>
> For some reason, I didn't get your email on "*Tue, 08 Dec 2015 12:56:38
> -0800*
> <https://www.mail-archive.com/search?l=dev@systemml.incubator.apache.org&q=date:20151208> "
> (which I noticed in the archive).
>
> >> Not sure how exactly I can modify the GLM-predict.dml to get some
> prediction to start with.
> There are two options here:
> 1. Modify GLM-predict.dml as suggested by Shirish (better approach with
> respect to the SystemML optimizer) or
>
> 2. Run a new script on the output of GLM-predict. Please see:
> https://github.com/apache/incubator-systemml/blob/master/src/main/java/org/apache/sysml/api/ml/LogisticRegressionModel.java#L163
> If you chose to go with option 2, you might also want to read the
> documentation of following two built-in functions:
> a. rowIndexMax (See
> http://apache.github.io/incubator-systemml/dml-language-reference.html#matrix-andor-scalar-comparison-built-in-functions
> <http://apache.github.io/incubator-systemml/dml-language-reference.html#matrix-andor-scalar-comparison-built-in-functions>
> )
> b. ppred
>
> >> Can you give me some idea how from here I can calculate the predicted
> value of the label using some value of probability threshold ?
> Very simple way to predict the label given probability matrix:
> Prediction = rowIndexMax(Prob) # predicts the label with highest
> probability. This assumes one-based labels.
>
> Thanks,
>
> Niketan Pansare
> IBM Almaden Research Center
> E-mail: npansar At us.ibm.com
> http://researcher.watson.ibm.com/researcher/view.php?person=us-npansar
>
> [image: Inactive hide details for Shirish Tatikonda ---12/08/2015 12:49:47
> PM---Hi Sourav, Yes, GLM-predict.dml gives out only the prob]Shirish
> Tatikonda ---12/08/2015 12:49:47 PM---Hi Sourav, Yes, GLM-predict.dml gives
> out only the probabilities. You can put a
>
> From: Shirish Tatikonda <sh...@gmail.com>
> To: dev@systemml.incubator.apache.org
> Date: 12/08/2015 12:49 PM
> Subject: Re: Using GLM-predict
> ------------------------------
>
>
>
> Hi Sourav,
>
> Yes, GLM-predict.dml gives out only the probabilities. You can put a
> threshold on the resulting probabilities to get the actual class labels --
> for example, prob > 0.5 is positive and <=0.5 as negative.
>
> The exact value of threshold typically depends on the data and the
> application. Different thresholds yield different classifiers with
> different performance (precision, recall, etc.). You can find the best
> threshold for the given data set by finding a value that gives the desired
> classifier performance (for example, a threshold that gives roughly equal
> precision and recall). Such an optimization is obviously done during the
> training phase using a held out test set.
>
> If you wish, you can also modify the DML script to perform this entire
> process.
>
> Shirish
>
>
> On Tue, Dec 8, 2015 at 12:23 PM, Sourav Mazumder <
> sourav.mazumder00@gmail.com> wrote:
>
> > Hi,
> >
> > I have used GLM.dml to create a model using some sample data. It returns
> to
> > me the matrix of Beta, B.
> >
> > Now I want to use this matrix of Beta on a new set of data points and
> > generate predicted value of the dependent variable/observation.
> >
> > When I checked GLM-predict, I could see that one can pass feature vector
> > for the new data set and also the matrix of beta.
> >
> > But I could not see any way to get the predicted value of the dependent
> > variable/observation. The output parameter only supports matrix of
> > predicted means/probabilities.
> >
> > Is there a way one can get the predicted value of the dependent
> > variable/observation from GLM-predict ?
> >
> > Regards,
> > Sourav
> >
>
>
>

Re: Using GLM-predict

Posted by Niketan Pansare <np...@us.ibm.com>.

Hi Sourav,

For some reason, I didn't get your email on "Tue, 08 Dec 2015 12:56:38
-0800 " (which I noticed in the archive).

>> Not sure how exactly I can modify the GLM-predict.dml to get some
prediction to start with.
There are two options here:
1. Modify GLM-predict.dml as suggested by Shirish (better approach with
respect to the SystemML optimizer) or

2. Run a new script on the output of GLM-predict. Please see:
https://github.com/apache/incubator-systemml/blob/master/src/main/java/org/apache/sysml/api/ml/LogisticRegressionModel.java#L163
If you chose to go with option 2, you might also want to read the
documentation of following two built-in functions:
a. rowIndexMax (See
http://apache.github.io/incubator-systemml/dml-language-reference.html#matrix-andor-scalar-comparison-built-in-functions
)
b. ppred

>> Can you give me some idea how from here I can calculate the predicted
value of the label using some value of probability threshold ?
Very simple way to predict the label given probability matrix:
Prediction = rowIndexMax(Prob)  # predicts the label with highest
probability. This assumes one-based labels.

Thanks,

Niketan Pansare
IBM Almaden Research Center
E-mail: npansar At us.ibm.com
http://researcher.watson.ibm.com/researcher/view.php?person=us-npansar

From:	Shirish Tatikonda <sh...@gmail.com>
To:	dev@systemml.incubator.apache.org
Date:	12/08/2015 12:49 PM
Subject:	Re: Using GLM-predict

Hi Sourav,

Yes, GLM-predict.dml gives out only the probabilities. You can put a
threshold on the resulting probabilities to get the actual class labels --
for example, prob > 0.5 is positive and <=0.5 as negative.

The exact value of threshold typically depends on the data and the
application. Different thresholds yield different classifiers with
different performance (precision, recall, etc.). You can find the best
threshold for the given data set by finding a value that gives the desired
classifier performance (for example, a threshold that gives roughly equal
precision and recall). Such an optimization is obviously done during the
training phase using a held out test set.

If you wish, you can also modify the DML script to perform this entire
process.

Shirish

On Tue, Dec 8, 2015 at 12:23 PM, Sourav Mazumder <
sourav.mazumder00@gmail.com> wrote:

> Hi,
>
> I have used GLM.dml to create a model using some sample data. It returns
to
> me the matrix of Beta, B.
>
> Now I want to use this matrix of Beta on a new set of data points and
> generate predicted value of the dependent variable/observation.
>
> When I checked GLM-predict, I could see that one can pass feature vector
> for the new data set and also the matrix of beta.
>
> But I could not see any way to get the predicted value of the dependent
> variable/observation. The output parameter only supports matrix of
> predicted means/probabilities.
>
> Is there a way one can get the predicted value of the dependent
> variable/observation from GLM-predict ?
>
> Regards,
> Sourav
>

Re: Using GLM-predict

Posted by Shirish Tatikonda <sh...@gmail.com>.

Hi Sourav,

Yes, GLM-predict.dml gives out only the probabilities. You can put a
threshold on the resulting probabilities to get the actual class labels --
for example, prob > 0.5 is positive and <=0.5 as negative.

The exact value of threshold typically depends on the data and the
application. Different thresholds yield different classifiers with
different performance (precision, recall, etc.). You can find the best
threshold for the given data set by finding a value that gives the desired
classifier performance (for example, a threshold that gives roughly equal
precision and recall). Such an optimization is obviously done during the
training phase using a held out test set.

If you wish, you can also modify the DML script to perform this entire
process.

Shirish

On Tue, Dec 8, 2015 at 12:23 PM, Sourav Mazumder <
sourav.mazumder00@gmail.com> wrote:

> Hi,
>
> I have used GLM.dml to create a model using some sample data. It returns to
> me the matrix of Beta, B.
>
> Now I want to use this matrix of Beta on a new set of data points and
> generate predicted value of the dependent variable/observation.
>
> When I checked GLM-predict, I could see that one can pass feature vector
> for the new data set and also the matrix of beta.
>
> But I could not see any way to get the predicted value of the dependent
> variable/observation. The output parameter only supports matrix of
> predicted means/probabilities.
>
> Is there a way one can get the predicted value of the dependent
> variable/observation from GLM-predict ?
>
> Regards,
> Sourav
>