You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by Alexis Peña <al...@exalitica.com> on 2017/10/24 02:56:37 UTC
Zero Coefficient in logistic regression
Hi Guys,
We are fitting a Logistic model using the following code.
val Chisqselector = new ChiSqSelector().setNumTopFeatures(10).setFeaturesCol("VECTOR_1").setLabelCol("TARGET").setOutputCol("selectedFeatures")
val assembler = new VectorAssembler().setInputCols(Array("FEATURES", "selectedFeatures", "PROM_MESES_DIST", "RECENCIA", "TEMP_MIN", "TEMP_MAX", "PRECIPITACIONES")).setOutputCol("Union")
val lr = new LogisticRegression().setLabelCol("TARGET").setFeaturesCol("Union")
val pipeline = new Pipeline().setStages(Array(Chisqselector, assembler, lr))
do you know why the coeff for the following features are zero estimate, is it produced in ChisqSelector or Logistic model?
Thanks in advance!!
CODIGOPARAMETROCOEFICIENTES_MUESTREO_BALANCEADO
PROPIASCV_UM0,276866756
PROPIASCV_U3M-0,241851427
PROPIASCV_U6M-0,568312819
PROPIASCV_U12M0,134706601
PROPIASM_UM5,47E-06
PROPIASM_U3M-7,10E-06
PROPIASM_U6M1,73E-05
PROPIASM_U12M-5,41E-06
PROPIASCP_UM-0,050750105
PROPIASCP_U3M0,125483162
PROPIASCP_U6M-0,353906788
PROPIASCP_U12M0,159538155
PROPIASTUM-0,020217902
PROPIASTU3M0,002101906
PROPIASTU6M-0,005481915
PROPIASTU12M0,003443081
CRUZADAS23030
CRUZADAS39010
CRUZADAS39050
CRUZADAS39070
CRUZADAS39090
CRUZADAS41020
CRUZADAS43070
CRUZADAS45010
CRUZADAS49070,247624087
CRUZADAS5304-0,161424508
LPPROM_MESES_DIST-0,680356554
PROPIASRECENCIA-0,00289069
EXTERNASTEMP_MIN0,006488683
EXTERNASTEMP_MAX-0,013497441
EXTERNASPRECIPITACIONES-0,007607086
INTERCEPTO2,401593191
Re: Zero Coefficient in logistic regression
Posted by Alexis Peña <al...@exalitica.com>.
Thanks, 8/10 coeff are zero estimate in CRUZADAS, the parameters for alpha and lambda are set in default(i think zero, the model in R and SAS was fitted using glm binary logistic.
Cheers
De: Simon Dirmeier <si...@web.de>
Fecha: martes, 24 de octubre de 2017, 08:30
Para: Alexis Peña <al...@exalitica.com>, <us...@spark.apache.org>
Asunto: Re: Zero Coefficient in logistic regression
So, all the coefficients are the same but for CRUZADAS? How are you fitting the model in R (glm)? Can you try setting zero penalty for alpha and lambda:
.setRegParam(0)
.setElasticNetParam(0)
Cheers,
S
Am 24.10.17 um 13:19 schrieb Alexis Peña:
Thanks for your Answer, the features “Cruzadas” are Binaries (0/1). The chisq statistic must be work whit 2x2 tables.
i fit the model in SAS and R and both the coeff have estimates (not significant). Two of this kind of features has estimations
CRUZADAS49070,247624087
CRUZADAS5304-0,161424508
Thanks
De: Weichen Xu <we...@databricks.com>
Fecha: martes, 24 de octubre de 2017, 07:23
Para: Alexis Peña <al...@exalitica.com>
CC: "user @spark" <us...@spark.apache.org>
Asunto: Re: Zero Coefficient in logistic regression
Yes chi-squared statistic only used in categorical features. It looks not proper here.
Thanks!
On Tue, Oct 24, 2017 at 5:13 PM, Simon Dirmeier <si...@web.de> wrote:
Hey,
as far as I know feature selection using the a chi-squared statistic, can only be done on categorical features and not on possibly continuous ones?
Furthermore, since your logistic model doesn't use any regularization, you should be fine here. So I'd check the ChiSqSeletor and possibly replace it with another feature selection method.
There is however always the chance that your response does not depend on your covariables, so you'd estimate a zero coefficient.
Cheers,
Simon
Am 24.10.17 um 04:56 schrieb Alexis Peña:
Hi Guys,
We are fitting a Logistic model using the following code.
val Chisqselector = new ChiSqSelector().setNumTopFeatures(10).setFeaturesCol("VECTOR_1").setLabelCol("TARGET").setOutputCol("selectedFeatures")
val assembler = new VectorAssembler().setInputCols(Array("FEATURES", "selectedFeatures", "PROM_MESES_DIST", "RECENCIA", "TEMP_MIN", "TEMP_MAX", "PRECIPITACIONES")).setOutputCol("Union")
val lr = new LogisticRegression().setLabelCol("TARGET").setFeaturesCol("Union")
val pipeline = new Pipeline().setStages(Array(Chisqselector, assembler, lr))
do you know why the coeff for the following features are zero estimate, is it produced in ChisqSelector or Logistic model?
Thanks in advance!!
CODIGOPARAMETROCOEFICIENTES_MUESTREO_BALANCEADO
PROPIASCV_UM0,276866756
PROPIASCV_U3M-0,241851427
PROPIASCV_U6M-0,568312819
PROPIASCV_U12M0,134706601
PROPIASM_UM5,47E-06
PROPIASM_U3M-7,10E-06
PROPIASM_U6M1,73E-05
PROPIASM_U12M-5,41E-06
PROPIASCP_UM-0,050750105
PROPIASCP_U3M0,125483162
PROPIASCP_U6M-0,353906788
PROPIASCP_U12M0,159538155
PROPIASTUM-0,020217902
PROPIASTU3M0,002101906
PROPIASTU6M-0,005481915
PROPIASTU12M0,003443081
CRUZADAS23030
CRUZADAS39010
CRUZADAS39050
CRUZADAS39070
CRUZADAS39090
CRUZADAS41020
CRUZADAS43070
CRUZADAS45010
CRUZADAS49070,247624087
CRUZADAS5304-0,161424508
LPPROM_MESES_DIST-0,680356554
PROPIASRECENCIA-0,00289069
EXTERNASTEMP_MIN0,006488683
EXTERNASTEMP_MAX-0,013497441
EXTERNASPRECIPITACIONES-0,007607086
INTERCEPTO2,401593191
Re: Zero Coefficient in logistic regression
Posted by Simon Dirmeier <si...@web.de>.
So, all the coefficients are the same but for CRUZADAS? How are you
fitting the model in R (glm)? Can you try setting zero penalty for
alpha and lambda:
.setRegParam(0)
.setElasticNetParam(0)
Cheers,
S
Am 24.10.17 um 13:19 schrieb Alexis Peña:
>
> Thanks for your Answer, the features “Cruzadas” are Binaries (0/1).
> The chisq statistic must be work whit 2x2 tables.
>
> i fit the model in SAS and R and both the coeff have estimates (not
> significant). Two of this kind of features has estimations
>
> CRUZADAS
>
>
>
> 4907
>
>
>
> 0,247624087
>
> CRUZADAS
>
>
>
> 5304
>
>
>
> -0,161424508
>
> Thanks
>
> *De: *Weichen Xu <we...@databricks.com>
> *Fecha: *martes, 24 de octubre de 2017, 07:23
> *Para: *Alexis Peña <al...@exalitica.com>
> *CC: *"user @spark" <us...@spark.apache.org>
> *Asunto: *Re: Zero Coefficient in logistic regression
>
> Yes chi-squared statistic only used in categorical features. It looks
> not proper here.
>
> Thanks!
>
> On Tue, Oct 24, 2017 at 5:13 PM, Simon Dirmeier <simon.dirmeier@web.de
> <ma...@web.de>> wrote:
>
> Hey,
>
> as far as I know feature selection using the a chi-squared
> statistic, can only be done on categorical features and not on
> possibly continuous ones?
> Furthermore, since your logistic model doesn't use any
> regularization, you should be fine here. So I'd check the
> ChiSqSeletor and possibly replace it with another feature
> selection method.
>
> There is however always the chance that your response does not
> depend on your covariables, so you'd estimate a zero coefficient.
>
> Cheers,
> Simon
>
> Am 24.10.17 um 04:56 schrieb Alexis Peña:
>
> Hi Guys,
>
> We are fitting a Logistic model using the following code.
>
> val Chisqselector = new
> ChiSqSelector().setNumTopFeatures(10).setFeaturesCol("VECTOR_1").setLabelCol("TARGET").setOutputCol("selectedFeatures")
>
> val assembler = new
> VectorAssembler().setInputCols(Array("FEATURES",
> "selectedFeatures", "PROM_MESES_DIST", "RECENCIA", "TEMP_MIN",
> "TEMP_MAX", "PRECIPITACIONES")).setOutputCol("Union")
>
> val lr = new
> LogisticRegression().setLabelCol("TARGET").setFeaturesCol("Union")
>
> val pipeline = new Pipeline().setStages(Array(Chisqselector,
> assembler, lr))
>
> do you know why the coeff for the following features are zero
> estimate, is it produced in ChisqSelector or Logistic model?
>
> Thanks in advance!!
>
> CODIGO
>
>
>
> PARAMETRO
>
>
>
> COEFICIENTES_MUESTREO_BALANCEADO
>
> PROPIAS
>
>
>
> CV_UM
>
>
>
> 0,276866756
>
> PROPIAS
>
>
>
> CV_U3M
>
>
>
> -0,241851427
>
> PROPIAS
>
>
>
> CV_U6M
>
>
>
> -0,568312819
>
> PROPIAS
>
>
>
> CV_U12M
>
>
>
> 0,134706601
>
> PROPIAS
>
>
>
> M_UM
>
>
>
> 5,47E-06
>
> PROPIAS
>
>
>
> M_U3M
>
>
>
> -7,10E-06
>
> PROPIAS
>
>
>
> M_U6M
>
>
>
> 1,73E-05
>
> PROPIAS
>
>
>
> M_U12M
>
>
>
> -5,41E-06
>
> PROPIAS
>
>
>
> CP_UM
>
>
>
> -0,050750105
>
> PROPIAS
>
>
>
> CP_U3M
>
>
>
> 0,125483162
>
> PROPIAS
>
>
>
> CP_U6M
>
>
>
> -0,353906788
>
> PROPIAS
>
>
>
> CP_U12M
>
>
>
> 0,159538155
>
> PROPIAS
>
>
>
> TUM
>
>
>
> -0,020217902
>
> PROPIAS
>
>
>
> TU3M
>
>
>
> 0,002101906
>
> PROPIAS
>
>
>
> TU6M
>
>
>
> -0,005481915
>
> PROPIAS
>
>
>
> TU12M
>
>
>
> 0,003443081
>
> CRUZADAS
>
>
>
> 2303
>
>
>
> 0
>
> CRUZADAS
>
>
>
> 3901
>
>
>
> 0
>
> CRUZADAS
>
>
>
> 3905
>
>
>
> 0
>
> CRUZADAS
>
>
>
> 3907
>
>
>
> 0
>
> CRUZADAS
>
>
>
> 3909
>
>
>
> 0
>
> CRUZADAS
>
>
>
> 4102
>
>
>
> 0
>
> CRUZADAS
>
>
>
> 4307
>
>
>
> 0
>
> CRUZADAS
>
>
>
> 4501
>
>
>
> 0
>
> CRUZADAS
>
>
>
> 4907
>
>
>
> 0,247624087
>
> CRUZADAS
>
>
>
> 5304
>
>
>
> -0,161424508
>
> LP
>
>
>
> PROM_MESES_DIST
>
>
>
> -0,680356554
>
> PROPIAS
>
>
>
> RECENCIA
>
>
>
> -0,00289069
>
> EXTERNAS
>
>
>
> TEMP_MIN
>
>
>
> 0,006488683
>
> EXTERNAS
>
>
>
> TEMP_MAX
>
>
>
> -0,013497441
>
> EXTERNAS
>
>
>
> PRECIPITACIONES
>
>
>
> -0,007607086
>
> INTERCEPTO
>
>
>
>
> 2,401593191
>
Re: Zero Coefficient in logistic regression
Posted by Alexis Peña <al...@exalitica.com>.
Thanks for your Answer, the features “Cruzadas” are Binaries (0/1). The chisq statistic must be work whit 2x2 tables.
i fit the model in SAS and R and both the coeff have estimates (not significant). Two of this kind of features has estimations
CRUZADAS49070,247624087
CRUZADAS5304-0,161424508
Thanks
De: Weichen Xu <we...@databricks.com>
Fecha: martes, 24 de octubre de 2017, 07:23
Para: Alexis Peña <al...@exalitica.com>
CC: "user @spark" <us...@spark.apache.org>
Asunto: Re: Zero Coefficient in logistic regression
Yes chi-squared statistic only used in categorical features. It looks not proper here.
Thanks!
On Tue, Oct 24, 2017 at 5:13 PM, Simon Dirmeier <si...@web.de> wrote:
Hey,
as far as I know feature selection using the a chi-squared statistic, can only be done on categorical features and not on possibly continuous ones?
Furthermore, since your logistic model doesn't use any regularization, you should be fine here. So I'd check the ChiSqSeletor and possibly replace it with another feature selection method.
There is however always the chance that your response does not depend on your covariables, so you'd estimate a zero coefficient.
Cheers,
Simon
Am 24.10.17 um 04:56 schrieb Alexis Peña:
Hi Guys,
We are fitting a Logistic model using the following code.
val Chisqselector = new ChiSqSelector().setNumTopFeatures(10).setFeaturesCol("VECTOR_1").setLabelCol("TARGET").setOutputCol("selectedFeatures")
val assembler = new VectorAssembler().setInputCols(Array("FEATURES", "selectedFeatures", "PROM_MESES_DIST", "RECENCIA", "TEMP_MIN", "TEMP_MAX", "PRECIPITACIONES")).setOutputCol("Union")
val lr = new LogisticRegression().setLabelCol("TARGET").setFeaturesCol("Union")
val pipeline = new Pipeline().setStages(Array(Chisqselector, assembler, lr))
do you know why the coeff for the following features are zero estimate, is it produced in ChisqSelector or Logistic model?
Thanks in advance!!
CODIGOPARAMETROCOEFICIENTES_MUESTREO_BALANCEADO
PROPIASCV_UM0,276866756
PROPIASCV_U3M-0,241851427
PROPIASCV_U6M-0,568312819
PROPIASCV_U12M0,134706601
PROPIASM_UM5,47E-06
PROPIASM_U3M-7,10E-06
PROPIASM_U6M1,73E-05
PROPIASM_U12M-5,41E-06
PROPIASCP_UM-0,050750105
PROPIASCP_U3M0,125483162
PROPIASCP_U6M-0,353906788
PROPIASCP_U12M0,159538155
PROPIASTUM-0,020217902
PROPIASTU3M0,002101906
PROPIASTU6M-0,005481915
PROPIASTU12M0,003443081
CRUZADAS23030
CRUZADAS39010
CRUZADAS39050
CRUZADAS39070
CRUZADAS39090
CRUZADAS41020
CRUZADAS43070
CRUZADAS45010
CRUZADAS49070,247624087
CRUZADAS5304-0,161424508
LPPROM_MESES_DIST-0,680356554
PROPIASRECENCIA-0,00289069
EXTERNASTEMP_MIN0,006488683
EXTERNASTEMP_MAX-0,013497441
EXTERNASPRECIPITACIONES-0,007607086
INTERCEPTO2,401593191
Re: Zero Coefficient in logistic regression
Posted by Weichen Xu <we...@databricks.com>.
Yes chi-squared statistic only used in categorical features. It looks not
proper here.
Thanks!
On Tue, Oct 24, 2017 at 5:13 PM, Simon Dirmeier <si...@web.de>
wrote:
> Hey,
> as far as I know feature selection using the a chi-squared statistic, can
> only be done on categorical features and not on possibly continuous ones?
> Furthermore, since your logistic model doesn't use any regularization, you
> should be fine here. So I'd check the ChiSqSeletor and possibly replace it
> with another feature selection method.
>
> There is however always the chance that your response does not depend on
> your covariables, so you'd estimate a zero coefficient.
>
> Cheers,
> Simon
>
>
> Am 24.10.17 um 04:56 schrieb Alexis Peña:
>
> Hi Guys,
>
>
>
> We are fitting a Logistic model using the following code.
>
>
>
>
>
> val Chisqselector = new ChiSqSelector().setNumTopFeatures(10).
> setFeaturesCol("VECTOR_1").setLabelCol("TARGET").setOutputCol("
> selectedFeatures")
>
> val assembler = new VectorAssembler().setInputCols(Array("FEATURES",
> "selectedFeatures", "PROM_MESES_DIST", "RECENCIA", "TEMP_MIN", "TEMP_MAX",
> "PRECIPITACIONES")).setOutputCol("Union")
>
> val lr = new LogisticRegression().setLabelCol("TARGET").
> setFeaturesCol("Union")
>
> val pipeline = new Pipeline().setStages(Array(Chisqselector, assembler,
> lr))
>
>
>
>
>
> do you know why the coeff for the following features are zero estimate,
> is it produced in ChisqSelector or Logistic model?
>
>
>
> Thanks in advance!!
>
>
>
>
>
> CODIGO
>
> PARAMETRO
>
> COEFICIENTES_MUESTREO_BALANCEADO
>
> PROPIAS
>
> CV_UM
>
> 0,276866756
>
> PROPIAS
>
> CV_U3M
>
> -0,241851427
>
> PROPIAS
>
> CV_U6M
>
> -0,568312819
>
> PROPIAS
>
> CV_U12M
>
> 0,134706601
>
> PROPIAS
>
> M_UM
>
> 5,47E-06
>
> PROPIAS
>
> M_U3M
>
> -7,10E-06
>
> PROPIAS
>
> M_U6M
>
> 1,73E-05
>
> PROPIAS
>
> M_U12M
>
> -5,41E-06
>
> PROPIAS
>
> CP_UM
>
> -0,050750105
>
> PROPIAS
>
> CP_U3M
>
> 0,125483162
>
> PROPIAS
>
> CP_U6M
>
> -0,353906788
>
> PROPIAS
>
> CP_U12M
>
> 0,159538155
>
> PROPIAS
>
> TUM
>
> -0,020217902
>
> PROPIAS
>
> TU3M
>
> 0,002101906
>
> PROPIAS
>
> TU6M
>
> -0,005481915
>
> PROPIAS
>
> TU12M
>
> 0,003443081
>
> CRUZADAS
>
> 2303
>
> 0
>
> CRUZADAS
>
> 3901
>
> 0
>
> CRUZADAS
>
> 3905
>
> 0
>
> CRUZADAS
>
> 3907
>
> 0
>
> CRUZADAS
>
> 3909
>
> 0
>
> CRUZADAS
>
> 4102
>
> 0
>
> CRUZADAS
>
> 4307
>
> 0
>
> CRUZADAS
>
> 4501
>
> 0
>
> CRUZADAS
>
> 4907
>
> 0,247624087
>
> CRUZADAS
>
> 5304
>
> -0,161424508
>
> LP
>
> PROM_MESES_DIST
>
> -0,680356554
>
> PROPIAS
>
> RECENCIA
>
> -0,00289069
>
> EXTERNAS
>
> TEMP_MIN
>
> 0,006488683
>
> EXTERNAS
>
> TEMP_MAX
>
> -0,013497441
>
> EXTERNAS
>
> PRECIPITACIONES
>
> -0,007607086
>
> INTERCEPTO
>
> 2,401593191
>
>
>
>
>
Re: Zero Coefficient in logistic regression
Posted by Simon Dirmeier <si...@web.de>.
Hey,
as far as I know feature selection using the a chi-squared statistic,
can only be done on categorical features and not on possibly continuous
ones?
Furthermore, since your logistic model doesn't use any regularization,
you should be fine here. So I'd check the ChiSqSeletor and possibly
replace it with another feature selection method.
There is however always the chance that your response does not depend on
your covariables, so you'd estimate a zero coefficient.
Cheers,
Simon
Am 24.10.17 um 04:56 schrieb Alexis Peña:
>
> Hi Guys,
>
> We are fitting a Logistic model using the following code.
>
> val Chisqselector = new
> ChiSqSelector().setNumTopFeatures(10).setFeaturesCol("VECTOR_1").setLabelCol("TARGET").setOutputCol("selectedFeatures")
>
> val assembler = new VectorAssembler().setInputCols(Array("FEATURES",
> "selectedFeatures", "PROM_MESES_DIST", "RECENCIA", "TEMP_MIN",
> "TEMP_MAX", "PRECIPITACIONES")).setOutputCol("Union")
>
> val lr = new
> LogisticRegression().setLabelCol("TARGET").setFeaturesCol("Union")
>
> val pipeline = new Pipeline().setStages(Array(Chisqselector,
> assembler, lr))
>
> do you know why the coeff for the following features are zero
> estimate, is it produced in ChisqSelector or Logistic model?
>
> Thanks in advance!!
>
> CODIGO
>
>
>
> PARAMETRO
>
>
>
> COEFICIENTES_MUESTREO_BALANCEADO
>
> PROPIAS
>
>
>
> CV_UM
>
>
>
> 0,276866756
>
> PROPIAS
>
>
>
> CV_U3M
>
>
>
> -0,241851427
>
> PROPIAS
>
>
>
> CV_U6M
>
>
>
> -0,568312819
>
> PROPIAS
>
>
>
> CV_U12M
>
>
>
> 0,134706601
>
> PROPIAS
>
>
>
> M_UM
>
>
>
> 5,47E-06
>
> PROPIAS
>
>
>
> M_U3M
>
>
>
> -7,10E-06
>
> PROPIAS
>
>
>
> M_U6M
>
>
>
> 1,73E-05
>
> PROPIAS
>
>
>
> M_U12M
>
>
>
> -5,41E-06
>
> PROPIAS
>
>
>
> CP_UM
>
>
>
> -0,050750105
>
> PROPIAS
>
>
>
> CP_U3M
>
>
>
> 0,125483162
>
> PROPIAS
>
>
>
> CP_U6M
>
>
>
> -0,353906788
>
> PROPIAS
>
>
>
> CP_U12M
>
>
>
> 0,159538155
>
> PROPIAS
>
>
>
> TUM
>
>
>
> -0,020217902
>
> PROPIAS
>
>
>
> TU3M
>
>
>
> 0,002101906
>
> PROPIAS
>
>
>
> TU6M
>
>
>
> -0,005481915
>
> PROPIAS
>
>
>
> TU12M
>
>
>
> 0,003443081
>
> CRUZADAS
>
>
>
> 2303
>
>
>
> 0
>
> CRUZADAS
>
>
>
> 3901
>
>
>
> 0
>
> CRUZADAS
>
>
>
> 3905
>
>
>
> 0
>
> CRUZADAS
>
>
>
> 3907
>
>
>
> 0
>
> CRUZADAS
>
>
>
> 3909
>
>
>
> 0
>
> CRUZADAS
>
>
>
> 4102
>
>
>
> 0
>
> CRUZADAS
>
>
>
> 4307
>
>
>
> 0
>
> CRUZADAS
>
>
>
> 4501
>
>
>
> 0
>
> CRUZADAS
>
>
>
> 4907
>
>
>
> 0,247624087
>
> CRUZADAS
>
>
>
> 5304
>
>
>
> -0,161424508
>
> LP
>
>
>
> PROM_MESES_DIST
>
>
>
> -0,680356554
>
> PROPIAS
>
>
>
> RECENCIA
>
>
>
> -0,00289069
>
> EXTERNAS
>
>
>
> TEMP_MIN
>
>
>
> 0,006488683
>
> EXTERNAS
>
>
>
> TEMP_MAX
>
>
>
> -0,013497441
>
> EXTERNAS
>
>
>
> PRECIPITACIONES
>
>
>
> -0,007607086
>
> INTERCEPTO
>
>
>
>
> 2,401593191
>