You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by Stuti Awasthi <st...@hcl.com> on 2016/02/12 08:03:16 UTC

mllib:Survival Analysis : assertion failed: AFTAggregator loss sum is infinity. Error for unknown reason.

Hi All,
Im wanted to try Survival Analysis on Spark 1.6. I am successfully able to run the AFT example provided. Now I tried to train the model with Ovarian data which is standard data comes with Survival library in R.
Default Column Name :  Futime,fustat,age,resid_ds,rx,ecog_ps

Here are the steps I have done :

*         Loaded the data from csv to dataframe labeled as
val ovarian_data = sqlContext.read
      .format("com.databricks.spark.csv")
      .option("header", "true") // Use first line of all files as header
      .option("inferSchema", "true") // Automatically infer data types
      .load("Ovarian.csv").toDF("label", "censor", "age", "resid_ds", "rx", "ecog_ps")

*         Utilize the VectorAssembler() to create features from "age", "resid_ds", "rx", "ecog_ps" like
val assembler = new VectorAssembler()
.setInputCols(Array("age", "resid_ds", "rx", "ecog_ps"))
.setOutputCol("features")


*         Then I create a new dataframe with only 3 colums as :
val training = finalDf.select("label", "censor", "features")



*         Finally Im passing it to AFT
val model = aft.fit(training)

Im getting the error as :
java.lang.AssertionError: assertion failed: AFTAggregator loss sum is infinity. Error for unknown reason.
       at scala.Predef$.assert(Predef.scala:179)
       at org.apache.spark.ml.regression.AFTAggregator.add(AFTSurvivalRegression.scala:480)
       at org.apache.spark.ml.regression.AFTCostFun$$anonfun$5.apply(AFTSurvivalRegression.scala:522)
       at org.apache.spark.ml.regression.AFTCostFun$$anonfun$5.apply(AFTSurvivalRegression.scala:521)
       at scala.collection.TraversableOnce$$anonfun$foldLeft$1.apply(TraversableOnce.scala:144)
       at scala.collection.TraversableOnce$$anonfun$foldLeft$1.apply(TraversableOnce.scala:144)
       at scala.collection.Iterator$class.foreach(Iterator.scala:727)

I have tried to print the schema :
()root
|-- label: double (nullable = true)
|-- censor: double (nullable = true)
|-- features: vector (nullable = true)

Sample data training looks like
[59.0,1.0,[72.3315,2.0,1.0,1.0]]
[115.0,1.0,[74.4932,2.0,1.0,1.0]]
[156.0,1.0,[66.4658,2.0,1.0,2.0]]
[421.0,0.0,[53.3644,2.0,2.0,1.0]]
[431.0,1.0,[50.3397,2.0,1.0,1.0]]

Im not able to understand about the error, as if I use same data and create the denseVector as given in Sample example of AFT, then code works completely fine. But I would like to read the data from CSV file and then proceed.

Please suggest

Thanks &Regards
Stuti Awasthi



::DISCLAIMER::
----------------------------------------------------------------------------------------------------------------------------------------------------

The contents of this e-mail and any attachment(s) are confidential and intended for the named recipient(s) only.
E-mail transmission is not guaranteed to be secure or error-free as information could be intercepted, corrupted,
lost, destroyed, arrive late or incomplete, or may contain viruses in transmission. The e mail and its contents
(with or without referred errors) shall therefore not attach any liability on the originator or HCL or its affiliates.
Views or opinions, if any, presented in this email are solely those of the author and may not necessarily reflect the
views or opinions of HCL or its affiliates. Any form of reproduction, dissemination, copying, disclosure, modification,
distribution and / or publication of this message without the prior written consent of authorized representative of
HCL is strictly prohibited. If you have received this email in error please delete it and notify the sender immediately.
Before opening any email and/or attachments, please check them for viruses and other defects.

----------------------------------------------------------------------------------------------------------------------------------------------------

RE: mllib:Survival Analysis : assertion failed: AFTAggregator loss sum is infinity. Error for unknown reason.

Posted by Stuti Awasthi <st...@hcl.com>.
Thanks a lot Yanbo, this will really help. Since I was unaware of this, I was speculating if my vectors were not getting generated correctly.  Thanks !!

Thanks &Regards
Stuti Awasthi

From: Yanbo Liang [mailto:ybliang8@gmail.com]
Sent: Wednesday, February 17, 2016 11:51 AM
To: Stuti Awasthi
Cc: user@spark.apache.org
Subject: Re: mllib:Survival Analysis : assertion failed: AFTAggregator loss sum is infinity. Error for unknown reason.

Hi Stuti,

The features should be standardized before training the model. Currently AFTSurvivalRegression does not support standardization. Here is the work around for this issue, and I will send a PR to fix this issue soon.

val ovarian = sqlContext.read
      .format("com.databricks.spark.csv")
      .option("header", "true") // Use first line of all files as header
      .option("inferSchema", "true") // Automatically infer data types
      .load("......")
      .toDF("label", "censor", "age", "resid_ds", "rx", "ecog_ps")

    val assembler = new VectorAssembler()
      .setInputCols(Array("age", "resid_ds", "rx", "ecog_ps"))
      .setOutputCol("features")

    val ovarian2 = assembler.transform(ovarian)
      .select(col("censor").cast(DoubleType), col("label").cast(DoubleType), col("features"))

    val standardScaler = new StandardScaler()
      .setInputCol("features")
      .setOutputCol("standardized_features")
    val ssModel = standardScaler.fit(ovarian2)
    val ovarian3 = ssModel.transform(ovarian2)

    val aft = new AFTSurvivalRegression().setFeaturesCol("standardized_features")

    val model = aft.fit(ovarian3)

    val newCoefficients = model.coefficients.toArray.zip(ssModel.std.toArray).map { x =>
      x._1 / x._2
    }
    println(newCoefficients.toSeq.mkString(","))
    println(model.intercept)
    println(model.scale)

Yanbo

2016-02-15 16:07 GMT+08:00 Yanbo Liang <yb...@gmail.com>>:
Hi Stuti,

This is a bug of AFTSurvivalRegression, we did not handle "lossSum == infinity" properly.
I have open https://issues.apache.org/jira/browse/SPARK-13322 to track this issue and will send a PR.
Thanks for reporting this issue.

Yanbo

2016-02-12 15:03 GMT+08:00 Stuti Awasthi <st...@hcl.com>>:
Hi All,
Im wanted to try Survival Analysis on Spark 1.6. I am successfully able to run the AFT example provided. Now I tried to train the model with Ovarian data which is standard data comes with Survival library in R.
Default Column Name :  Futime,fustat,age,resid_ds,rx,ecog_ps

Here are the steps I have done :

•         Loaded the data from csv to dataframe labeled as
val ovarian_data = sqlContext.read
      .format("com.databricks.spark.csv")
      .option("header", "true") // Use first line of all files as header
      .option("inferSchema", "true") // Automatically infer data types
      .load("Ovarian.csv").toDF("label", "censor", "age", "resid_ds", "rx", "ecog_ps")

•         Utilize the VectorAssembler() to create features from "age", "resid_ds", "rx", "ecog_ps" like
val assembler = new VectorAssembler()
.setInputCols(Array("age", "resid_ds", "rx", "ecog_ps"))
.setOutputCol("features")


•         Then I create a new dataframe with only 3 colums as :
val training = finalDf.select("label", "censor", "features")



•         Finally Im passing it to AFT
val model = aft.fit(training)

Im getting the error as :
java.lang.AssertionError: assertion failed: AFTAggregator loss sum is infinity. Error for unknown reason.
       at scala.Predef$.assert(Predef.scala:179)
       at org.apache.spark.ml.regression.AFTAggregator.add(AFTSurvivalRegression.scala:480)
       at org.apache.spark.ml.regression.AFTCostFun$$anonfun$5.apply(AFTSurvivalRegression.scala:522)
       at org.apache.spark.ml.regression.AFTCostFun$$anonfun$5.apply(AFTSurvivalRegression.scala:521)
       at scala.collection.TraversableOnce$$anonfun$foldLeft$1.apply(TraversableOnce.scala:144)
       at scala.collection.TraversableOnce$$anonfun$foldLeft$1.apply(TraversableOnce.scala:144)
       at scala.collection.Iterator$class.foreach(Iterator.scala:727)

I have tried to print the schema :
()root
|-- label: double (nullable = true)
|-- censor: double (nullable = true)
|-- features: vector (nullable = true)

Sample data training looks like
[59.0,1.0,[72.3315,2.0,1.0,1.0]]
[115.0,1.0,[74.4932,2.0,1.0,1.0]]
[156.0,1.0,[66.4658,2.0,1.0,2.0]]
[421.0,0.0,[53.3644,2.0,2.0,1.0]]
[431.0,1.0,[50.3397,2.0,1.0,1.0]]

Im not able to understand about the error, as if I use same data and create the denseVector as given in Sample example of AFT, then code works completely fine. But I would like to read the data from CSV file and then proceed.

Please suggest

Thanks &Regards
Stuti Awasthi



::DISCLAIMER::
----------------------------------------------------------------------------------------------------------------------------------------------------
The contents of this e-mail and any attachment(s) are confidential and intended for the named recipient(s) only.
E-mail transmission is not guaranteed to be secure or error-free as information could be intercepted, corrupted,
lost, destroyed, arrive late or incomplete, or may contain viruses in transmission. The e mail and its contents
(with or without referred errors) shall therefore not attach any liability on the originator or HCL or its affiliates.
Views or opinions, if any, presented in this email are solely those of the author and may not necessarily reflect the
views or opinions of HCL or its affiliates. Any form of reproduction, dissemination, copying, disclosure, modification,
distribution and / or publication of this message without the prior written consent of authorized representative of
HCL is strictly prohibited. If you have received this email in error please delete it and notify the sender immediately.
Before opening any email and/or attachments, please check them for viruses and other defects.
----------------------------------------------------------------------------------------------------------------------------------------------------



Re: mllib:Survival Analysis : assertion failed: AFTAggregator loss sum is infinity. Error for unknown reason.

Posted by Yanbo Liang <yb...@gmail.com>.
Hi Stuti,

The features should be standardized before training the model. Currently
AFTSurvivalRegression does not support standardization. Here is the work
around for this issue, and I will send a PR to fix this issue soon.

val ovarian = sqlContext.read
      .format("com.databricks.spark.csv")
      .option("header", "true") // Use first line of all files as header
      .option("inferSchema", "true") // Automatically infer data types
      .load("......")
      .toDF("label", "censor", "age", "resid_ds", "rx", "ecog_ps")

    val assembler = new VectorAssembler()
      .setInputCols(Array("age", "resid_ds", "rx", "ecog_ps"))
      .setOutputCol("features")

    val ovarian2 = assembler.transform(ovarian)
      .select(col("censor").cast(DoubleType),
col("label").cast(DoubleType), col("features"))

    val standardScaler = new StandardScaler()
      .setInputCol("features")
      .setOutputCol("standardized_features")
    val ssModel = standardScaler.fit(ovarian2)
    val ovarian3 = ssModel.transform(ovarian2)

    val aft = new
AFTSurvivalRegression().setFeaturesCol("standardized_features")

    val model = aft.fit(ovarian3)

    val newCoefficients =
model.coefficients.toArray.zip(ssModel.std.toArray).map { x =>
      x._1 / x._2
    }
    println(newCoefficients.toSeq.mkString(","))
    println(model.intercept)
    println(model.scale)

Yanbo

2016-02-15 16:07 GMT+08:00 Yanbo Liang <yb...@gmail.com>:

> Hi Stuti,
>
> This is a bug of AFTSurvivalRegression, we did not handle "lossSum ==
> infinity" properly.
> I have open https://issues.apache.org/jira/browse/SPARK-13322 to track
> this issue and will send a PR.
> Thanks for reporting this issue.
>
> Yanbo
>
> 2016-02-12 15:03 GMT+08:00 Stuti Awasthi <st...@hcl.com>:
>
>> Hi All,
>>
>> Im wanted to try Survival Analysis on Spark 1.6. I am successfully able
>> to run the AFT example provided. Now I tried to train the model with
>> Ovarian data which is standard data comes with Survival library in R.
>>
>> Default Column Name :  *Futime,fustat,age,resid_ds,rx,ecog_ps*
>>
>>
>>
>> Here are the steps I have done :
>>
>> ·         Loaded the data from csv to dataframe labeled as
>>
>> *val* ovarian_data = sqlContext.read
>>
>>       .format("com.databricks.spark.csv")
>>
>>       .option("header", "true") // Use first line of all files as header
>>
>>       .option("inferSchema", "true") // Automatically infer data types
>>
>>       .load("Ovarian.csv").toDF("label", "censor", "age", "resid_ds",
>> "rx", "ecog_ps")
>>
>> ·         Utilize the VectorAssembler() to create features from "age",
>> "resid_ds", "rx", "ecog_ps" like
>>
>> *val* assembler = *new* VectorAssembler()
>>
>> .setInputCols(Array("age", "resid_ds", "rx", "ecog_ps"))
>>
>> .setOutputCol("features")
>>
>>
>>
>> ·         Then I create a new dataframe with only 3 colums as :
>>
>> *val* training = finalDf.select("label", "censor", "features")
>>
>>
>>
>> ·         Finally Im passing it to AFT
>>
>> *val* model = aft.fit(training)
>>
>>
>>
>> Im getting the error as :
>>
>> java.lang.AssertionError: *assertion failed: AFTAggregator loss sum is
>> infinity. Error for unknown reason.*
>>
>>        at scala.Predef$.assert(*Predef.scala:179*)
>>
>>        at org.apache.spark.ml.regression.AFTAggregator.add(
>> *AFTSurvivalRegression.scala:480*)
>>
>>        at org.apache.spark.ml.regression.AFTCostFun$$anonfun$5.apply(
>> *AFTSurvivalRegression.scala:522*)
>>
>>        at org.apache.spark.ml.regression.AFTCostFun$$anonfun$5.apply(
>> *AFTSurvivalRegression.scala:521*)
>>
>>        at scala.collection.TraversableOnce$$anonfun$foldLeft$1.apply(
>> *TraversableOnce.scala:144*)
>>
>>        at scala.collection.TraversableOnce$$anonfun$foldLeft$1.apply(
>> *TraversableOnce.scala:144*)
>>
>>        at scala.collection.Iterator$class.foreach(*Iterator.scala:727*)
>>
>>
>>
>> I have tried to print the schema :
>>
>> ()root
>>
>> |-- label: double (nullable = true)
>>
>> |-- censor: double (nullable = true)
>>
>> |-- features: vector (nullable = true)
>>
>>
>>
>> Sample data training looks like
>>
>> [59.0,1.0,[72.3315,2.0,1.0,1.0]]
>>
>> [115.0,1.0,[74.4932,2.0,1.0,1.0]]
>>
>> [156.0,1.0,[66.4658,2.0,1.0,2.0]]
>>
>> [421.0,0.0,[53.3644,2.0,2.0,1.0]]
>>
>> [431.0,1.0,[50.3397,2.0,1.0,1.0]]
>>
>>
>>
>> Im not able to understand about the error, as if I use same data and
>> create the denseVector as given in Sample example of AFT, then code works
>> completely fine. But I would like to read the data from CSV file and then
>> proceed.
>>
>>
>>
>> Please suggest
>>
>>
>>
>> Thanks &Regards
>>
>> Stuti Awasthi
>>
>>
>>
>>
>>
>> ::DISCLAIMER::
>>
>> ----------------------------------------------------------------------------------------------------------------------------------------------------
>>
>> The contents of this e-mail and any attachment(s) are confidential and
>> intended for the named recipient(s) only.
>> E-mail transmission is not guaranteed to be secure or error-free as
>> information could be intercepted, corrupted,
>> lost, destroyed, arrive late or incomplete, or may contain viruses in
>> transmission. The e mail and its contents
>> (with or without referred errors) shall therefore not attach any
>> liability on the originator or HCL or its affiliates.
>> Views or opinions, if any, presented in this email are solely those of
>> the author and may not necessarily reflect the
>> views or opinions of HCL or its affiliates. Any form of reproduction,
>> dissemination, copying, disclosure, modification,
>> distribution and / or publication of this message without the prior
>> written consent of authorized representative of
>> HCL is strictly prohibited. If you have received this email in error
>> please delete it and notify the sender immediately.
>> Before opening any email and/or attachments, please check them for
>> viruses and other defects.
>>
>>
>> ----------------------------------------------------------------------------------------------------------------------------------------------------
>>
>
>

Re: mllib:Survival Analysis : assertion failed: AFTAggregator loss sum is infinity. Error for unknown reason.

Posted by Yanbo Liang <yb...@gmail.com>.
Hi Stuti,

This is a bug of AFTSurvivalRegression, we did not handle "lossSum ==
infinity" properly.
I have open https://issues.apache.org/jira/browse/SPARK-13322 to track this
issue and will send a PR.
Thanks for reporting this issue.

Yanbo

2016-02-12 15:03 GMT+08:00 Stuti Awasthi <st...@hcl.com>:

> Hi All,
>
> Im wanted to try Survival Analysis on Spark 1.6. I am successfully able to
> run the AFT example provided. Now I tried to train the model with Ovarian
> data which is standard data comes with Survival library in R.
>
> Default Column Name :  *Futime,fustat,age,resid_ds,rx,ecog_ps*
>
>
>
> Here are the steps I have done :
>
> ·         Loaded the data from csv to dataframe labeled as
>
> *val* ovarian_data = sqlContext.read
>
>       .format("com.databricks.spark.csv")
>
>       .option("header", "true") // Use first line of all files as header
>
>       .option("inferSchema", "true") // Automatically infer data types
>
>       .load("Ovarian.csv").toDF("label", "censor", "age", "resid_ds", "rx",
> "ecog_ps")
>
> ·         Utilize the VectorAssembler() to create features from "age",
> "resid_ds", "rx", "ecog_ps" like
>
> *val* assembler = *new* VectorAssembler()
>
> .setInputCols(Array("age", "resid_ds", "rx", "ecog_ps"))
>
> .setOutputCol("features")
>
>
>
> ·         Then I create a new dataframe with only 3 colums as :
>
> *val* training = finalDf.select("label", "censor", "features")
>
>
>
> ·         Finally Im passing it to AFT
>
> *val* model = aft.fit(training)
>
>
>
> Im getting the error as :
>
> java.lang.AssertionError: *assertion failed: AFTAggregator loss sum is
> infinity. Error for unknown reason.*
>
>        at scala.Predef$.assert(*Predef.scala:179*)
>
>        at org.apache.spark.ml.regression.AFTAggregator.add(
> *AFTSurvivalRegression.scala:480*)
>
>        at org.apache.spark.ml.regression.AFTCostFun$$anonfun$5.apply(
> *AFTSurvivalRegression.scala:522*)
>
>        at org.apache.spark.ml.regression.AFTCostFun$$anonfun$5.apply(
> *AFTSurvivalRegression.scala:521*)
>
>        at scala.collection.TraversableOnce$$anonfun$foldLeft$1.apply(
> *TraversableOnce.scala:144*)
>
>        at scala.collection.TraversableOnce$$anonfun$foldLeft$1.apply(
> *TraversableOnce.scala:144*)
>
>        at scala.collection.Iterator$class.foreach(*Iterator.scala:727*)
>
>
>
> I have tried to print the schema :
>
> ()root
>
> |-- label: double (nullable = true)
>
> |-- censor: double (nullable = true)
>
> |-- features: vector (nullable = true)
>
>
>
> Sample data training looks like
>
> [59.0,1.0,[72.3315,2.0,1.0,1.0]]
>
> [115.0,1.0,[74.4932,2.0,1.0,1.0]]
>
> [156.0,1.0,[66.4658,2.0,1.0,2.0]]
>
> [421.0,0.0,[53.3644,2.0,2.0,1.0]]
>
> [431.0,1.0,[50.3397,2.0,1.0,1.0]]
>
>
>
> Im not able to understand about the error, as if I use same data and
> create the denseVector as given in Sample example of AFT, then code works
> completely fine. But I would like to read the data from CSV file and then
> proceed.
>
>
>
> Please suggest
>
>
>
> Thanks &Regards
>
> Stuti Awasthi
>
>
>
>
>
> ::DISCLAIMER::
>
> ----------------------------------------------------------------------------------------------------------------------------------------------------
>
> The contents of this e-mail and any attachment(s) are confidential and
> intended for the named recipient(s) only.
> E-mail transmission is not guaranteed to be secure or error-free as
> information could be intercepted, corrupted,
> lost, destroyed, arrive late or incomplete, or may contain viruses in
> transmission. The e mail and its contents
> (with or without referred errors) shall therefore not attach any liability
> on the originator or HCL or its affiliates.
> Views or opinions, if any, presented in this email are solely those of the
> author and may not necessarily reflect the
> views or opinions of HCL or its affiliates. Any form of reproduction,
> dissemination, copying, disclosure, modification,
> distribution and / or publication of this message without the prior
> written consent of authorized representative of
> HCL is strictly prohibited. If you have received this email in error
> please delete it and notify the sender immediately.
> Before opening any email and/or attachments, please check them for viruses
> and other defects.
>
>
> ----------------------------------------------------------------------------------------------------------------------------------------------------
>