You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by Andrew Redd <an...@gmail.com> on 2019/10/28 20:00:51 UTC

Fwd: Recover RFormula Column Names

Hi All!

I'm performing an econometric analysis over several billion rows of data
and would like to use the Pyspark SparkML implementation of linear
regression. In the example below I'm trying to interact hour of day and
month of year indicators. The StringIndexer documentation tells you what
it's doing when it's one hot encoding string/factor columns (i.e. taking
out the most/least common value or first/last when sorted alphabetically)
but doesn't allow you to recover your coefficient names. This feels like
such a general case that I must be missing something. How can I get my
column names back post regression to map to coefficient values? Do I need
to basically rebuild the RFormula logic in if this isn't already
implemented? Would be happy to use a different Spark language (Scala/Java
etc. ) if implemented there.

Thanks in advance

Andrew

rform = RFormula(formula="log_outcome ~ log_treatment + hour_of_day +
month_of_year + hour_of_day:month_of_year + additional_column",
                 featuresCol="features",
                 labelCol="label")

    rform_regression_input =
rform.fit(regression_input).transform(regression_input)

    lr = LinearRegression(featuresCol='features',
                         labelCol='label',
                         solver='normal')

    lr_model = lr.fit(rform_regression_input)
    coefs = [ *lr_model.coefficients, lr_model.intercept]

    return pd.DataFrame(
        {"pvalues": lr_model.summary.pValues,
         "tvalues": lr_model.summary.tValues,
         "std_errs": lr_model.summary.coefficientStandardErrors,
         "coefs": coefs}
    )

Re: Recover RFormula Column Names

Posted by Alessandro Solimando <al...@gmail.com>.

Glad to hear that Andrew.

While looking for the aforementioned SO's answer I have stumbled upon a similar
one <https://stackoverflow.com/a/48624023/898154> for pyspark, it works and
being in Python you are also spared the "reflection" part.

If you happen to try the RWrapperUtils it would be great to have a feedback!

Best regards,
Alessandro

On Tue, 29 Oct 2019 at 13:49, Andrew Redd <an...@gmail.com> wrote:

> Thanks Alessandro!
>
> That did the trick. I all of the indices and interactions are in the
> metadata. I also wanted to confirm that this solution works in pyspark as
> the metadata is carried over.
>
> Andrew
>
> On Tue, Oct 29, 2019 at 5:26 AM Alessandro Solimando <
> alessandro.solimando@gmail.com> wrote:
>
>> Hello Andrew,
>> few years ago I had the same need and I found this SO's answer
>> <https://stackoverflow.com/a/36306784/898154> the way to go.
>>
>> Here an extract of my (Scala) code (which was doing other things on
>> top), I have removed the irrelevant parts but without testing it, so it
>> might not work out of the box, nonetheless it should help you starting:
>>
>>    private def getEncodedVectorLookupTable(df: DataFrame,
>>
>>                                           featuresColName: String):
>>> Map[Long, String] = {
>>
>>      val meta = df.select(featuresColName)
>>>       .schema.fields.head.metadata
>>>       .getMetadata("ml_attr")
>>>       .getMetadata("attrs")
>>>
>>
>>
>>     /* REFLECTION START */
>>>     val field = meta.getClass.getDeclaredField("map")
>>>     field.setAccessible(true)
>>>     val keys = field.get(meta).asInstanceOf[Map[String, Any]].keySet
>>>     field.setAccessible(false)
>>>     /* REFLECTION END */
>>
>>
>>
>>     keys.flatMap(
>>>       meta.getMetadataArray(_)
>>>         .map(m => m.getLong("idx") -> m.getString("name"))
>>>     ).toMap
>>
>>  }
>>
>>
>> It looks like there is some support now for achieving this, but I have
>> never tried it:
>> https://spark.apache.org/docs/latest/api/java/org/apache/spark/ml/r/RWrapperUtils.html
>>
>> Best regards,
>> Alessandro
>>
>> On Mon, 28 Oct 2019 at 21:01, Andrew Redd <an...@gmail.com> wrote:
>>
>>>
>>> Hi All!
>>>
>>> I'm performing an econometric analysis over several billion rows of data
>>> and would like to use the Pyspark SparkML implementation of linear
>>> regression. In the example below I'm trying to interact hour of day and
>>> month of year indicators. The StringIndexer documentation tells you what
>>> it's doing when it's one hot encoding string/factor columns (i.e. taking
>>> out the most/least common value or first/last when sorted alphabetically)
>>> but doesn't allow you to recover your coefficient names. This feels like
>>> such a general case that I must be missing something. How can I get my
>>> column names back post regression to map to coefficient values? Do I need
>>> to basically rebuild the RFormula logic in if this isn't already
>>> implemented? Would be happy to use a different Spark language (Scala/Java
>>> etc. ) if implemented there.
>>>
>>> Thanks in advance
>>>
>>> Andrew
>>>
>>> rform = RFormula(formula="log_outcome ~ log_treatment + hour_of_day +
>>> month_of_year + hour_of_day:month_of_year + additional_column",
>>>                  featuresCol="features",
>>>                  labelCol="label")
>>>
>>>     rform_regression_input =
>>> rform.fit(regression_input).transform(regression_input)
>>>
>>>     lr = LinearRegression(featuresCol='features',
>>>                          labelCol='label',
>>>                          solver='normal')
>>>
>>>     lr_model = lr.fit(rform_regression_input)
>>>     coefs = [ *lr_model.coefficients, lr_model.intercept]
>>>
>>>     return pd.DataFrame(
>>>         {"pvalues": lr_model.summary.pValues,
>>>          "tvalues": lr_model.summary.tValues,
>>>          "std_errs": lr_model.summary.coefficientStandardErrors,
>>>          "coefs": coefs}
>>>     )
>>>
>>>

Re: Recover RFormula Column Names

Posted by Andrew Redd <an...@gmail.com>.

Thanks Alessandro!

That did the trick. I all of the indices and interactions are in the
metadata. I also wanted to confirm that this solution works in pyspark as
the metadata is carried over.

Andrew

On Tue, Oct 29, 2019 at 5:26 AM Alessandro Solimando <
alessandro.solimando@gmail.com> wrote:

> Hello Andrew,
> few years ago I had the same need and I found this SO's answer
> <https://stackoverflow.com/a/36306784/898154> the way to go.
>
> Here an extract of my (Scala) code (which was doing other things on
> top), I have removed the irrelevant parts but without testing it, so it
> might not work out of the box, nonetheless it should help you starting:
>
>    private def getEncodedVectorLookupTable(df: DataFrame,
>
>                                           featuresColName: String):
>> Map[Long, String] = {
>
>      val meta = df.select(featuresColName)
>>       .schema.fields.head.metadata
>>       .getMetadata("ml_attr")
>>       .getMetadata("attrs")
>>
>
>
>     /* REFLECTION START */
>>     val field = meta.getClass.getDeclaredField("map")
>>     field.setAccessible(true)
>>     val keys = field.get(meta).asInstanceOf[Map[String, Any]].keySet
>>     field.setAccessible(false)
>>     /* REFLECTION END */
>
>
>
>     keys.flatMap(
>>       meta.getMetadataArray(_)
>>         .map(m => m.getLong("idx") -> m.getString("name"))
>>     ).toMap
>
>  }
>
>
> It looks like there is some support now for achieving this, but I have
> never tried it:
> https://spark.apache.org/docs/latest/api/java/org/apache/spark/ml/r/RWrapperUtils.html
>
> Best regards,
> Alessandro
>
> On Mon, 28 Oct 2019 at 21:01, Andrew Redd <an...@gmail.com> wrote:
>
>>
>> Hi All!
>>
>> I'm performing an econometric analysis over several billion rows of data
>> and would like to use the Pyspark SparkML implementation of linear
>> regression. In the example below I'm trying to interact hour of day and
>> month of year indicators. The StringIndexer documentation tells you what
>> it's doing when it's one hot encoding string/factor columns (i.e. taking
>> out the most/least common value or first/last when sorted alphabetically)
>> but doesn't allow you to recover your coefficient names. This feels like
>> such a general case that I must be missing something. How can I get my
>> column names back post regression to map to coefficient values? Do I need
>> to basically rebuild the RFormula logic in if this isn't already
>> implemented? Would be happy to use a different Spark language (Scala/Java
>> etc. ) if implemented there.
>>
>> Thanks in advance
>>
>> Andrew
>>
>> rform = RFormula(formula="log_outcome ~ log_treatment + hour_of_day +
>> month_of_year + hour_of_day:month_of_year + additional_column",
>>                  featuresCol="features",
>>                  labelCol="label")
>>
>>     rform_regression_input =
>> rform.fit(regression_input).transform(regression_input)
>>
>>     lr = LinearRegression(featuresCol='features',
>>                          labelCol='label',
>>                          solver='normal')
>>
>>     lr_model = lr.fit(rform_regression_input)
>>     coefs = [ *lr_model.coefficients, lr_model.intercept]
>>
>>     return pd.DataFrame(
>>         {"pvalues": lr_model.summary.pValues,
>>          "tvalues": lr_model.summary.tValues,
>>          "std_errs": lr_model.summary.coefficientStandardErrors,
>>          "coefs": coefs}
>>     )
>>
>>

Re: Recover RFormula Column Names

Posted by Alessandro Solimando <al...@gmail.com>.

Hello Andrew,
few years ago I had the same need and I found this SO's answer
<https://stackoverflow.com/a/36306784/898154> the way to go.

Here an extract of my (Scala) code (which was doing other things on top), I
have removed the irrelevant parts but without testing it, so it might not
work out of the box, nonetheless it should help you starting:

   private def getEncodedVectorLookupTable(df: DataFrame,

                                          featuresColName: String):
> Map[Long, String] = {

     val meta = df.select(featuresColName)
>       .schema.fields.head.metadata
>       .getMetadata("ml_attr")
>       .getMetadata("attrs")
>


    /* REFLECTION START */
>     val field = meta.getClass.getDeclaredField("map")
>     field.setAccessible(true)
>     val keys = field.get(meta).asInstanceOf[Map[String, Any]].keySet
>     field.setAccessible(false)
>     /* REFLECTION END */



    keys.flatMap(
>       meta.getMetadataArray(_)
>         .map(m => m.getLong("idx") -> m.getString("name"))
>     ).toMap

 }


It looks like there is some support now for achieving this, but I have
never tried it:
https://spark.apache.org/docs/latest/api/java/org/apache/spark/ml/r/RWrapperUtils.html

Best regards,
Alessandro

On Mon, 28 Oct 2019 at 21:01, Andrew Redd <an...@gmail.com> wrote:

>
> Hi All!
>
> I'm performing an econometric analysis over several billion rows of data
> and would like to use the Pyspark SparkML implementation of linear
> regression. In the example below I'm trying to interact hour of day and
> month of year indicators. The StringIndexer documentation tells you what
> it's doing when it's one hot encoding string/factor columns (i.e. taking
> out the most/least common value or first/last when sorted alphabetically)
> but doesn't allow you to recover your coefficient names. This feels like
> such a general case that I must be missing something. How can I get my
> column names back post regression to map to coefficient values? Do I need
> to basically rebuild the RFormula logic in if this isn't already
> implemented? Would be happy to use a different Spark language (Scala/Java
> etc. ) if implemented there.
>
> Thanks in advance
>
> Andrew
>
> rform = RFormula(formula="log_outcome ~ log_treatment + hour_of_day +
> month_of_year + hour_of_day:month_of_year + additional_column",
>                  featuresCol="features",
>                  labelCol="label")
>
>     rform_regression_input =
> rform.fit(regression_input).transform(regression_input)
>
>     lr = LinearRegression(featuresCol='features',
>                          labelCol='label',
>                          solver='normal')
>
>     lr_model = lr.fit(rform_regression_input)
>     coefs = [ *lr_model.coefficients, lr_model.intercept]
>
>     return pd.DataFrame(
>         {"pvalues": lr_model.summary.pValues,
>          "tvalues": lr_model.summary.tValues,
>          "std_errs": lr_model.summary.coefficientStandardErrors,
>          "coefs": coefs}
>     )
>
>