You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Yanbo Liang (JIRA)" <ji...@apache.org> on 2016/01/27 12:10:39 UTC

[jira] [Comment Edited] (SPARK-13010) Survival analysis in SparkR

    [ https://issues.apache.org/jira/browse/SPARK-13010?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15119054#comment-15119054 ] 

Yanbo Liang edited comment on SPARK-13010 at 1/27/16 11:09 AM:
---------------------------------------------------------------

There are two issues that we should discuss:
1, Support AFTSurvivalRegression under the SparkR::glm interface or not?
I vote for not, we can have a new function named “survreg”(R have the same function). “ survreg” also return a PipelineModel like SparkR::glm and can be predicted by Spark::predict.
We should first reorg SparkRWrappers to make it support more models, although
it’s simple.
2, The response variable of the R formula should be pairs for Survival analysis.
Take R survival analysis as examples:
{code}
survreg(Surv(futime, fustat) ~ ecog.ps + rx, ovarian, dist="exponential”)
survfit(coxph(Surv(time,censor)~1), type="aalen”)
{code}
It wraps the pair of “labelCol” and “censorCol” as the response variable of R formula. 
So the first step is to make RFormula support pair as label. 
One possible way is to support “cbind” in SparkR, it returns a Scala Tuple2/Vector column and then make the label of RFormula supports the type of Tuple2/Vector.
GLM with binomial family can also benefit from this feature. But we should also concern about whether “cbind” conflicts with other functions of SparkR, and we need to keep consistent semantics.

Looking forward to hear your thoughts. [~mengxr]


was (Author: yanboliang):
There are two issues that we should discuss:
1, Support AFTSurvivalRegression under the SparkR::glm interface or not?
I vote for not, we can have a new function named “survreg”(R have the same function). “ survreg” also return a PipelineModel like SparkR::glm and can be predicted by Spark::predict.
We should first reorg SparkRWrappers to make it support more models, although
it’s simple.
2, The response variable of the R formula should be pairs for Survival analysis.
Take R survival analysis as examples:
survreg(Surv(futime, fustat) ~ ecog.ps + rx, ovarian, dist="exponential”)
survfit(coxph(Surv(time,censor)~1), type="aalen”)
It wraps the pair of “labelCol” and “censorCol” as the response variable of R formula. 
So the first step is to make RFormula support pair as label. 
One possible way is to support “cbind” in SparkR, it returns a Scala Tuple2/Vector column and then make the label of RFormula supports the type of Tuple2/Vector.
GLM with binomial family can also benefit from this feature. But we should also concern about whether “cbind” conflicts with other functions of SparkR, and we need to keep consistent semantics.

Looking forward to hear your thoughts. [~mengxr]

> Survival analysis in SparkR
> ---------------------------
>
>                 Key: SPARK-13010
>                 URL: https://issues.apache.org/jira/browse/SPARK-13010
>             Project: Spark
>          Issue Type: New Feature
>          Components: ML, SparkR
>            Reporter: Xiangrui Meng
>            Assignee: Yanbo Liang
>
> Implement a simple wrapper of AFTSurvivalRegression in SparkR to support survival analysis.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org