You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@spark.apache.org by Georg Heiler <ge...@gmail.com> on 2016/11/16 14:29:02 UTC

Develop custom Estimator / Transformer for pipeline

HI,

I want to develop a library with custom Estimator / Transformers for spark.
So far not a lot of documentation could be found but
http://stackoverflow.com/questions/37270446/how-to-roll-a-custom-estimator-in-pyspark-mllib


Suggest that:
Generally speaking, there is no documentation because as for Spark 1.6 /
2.0 most of the related API is not intended to be public. It should change
in Spark 2.1.0 (see SPARK-7146
<https://issues.apache.org/jira/browse/SPARK-7146>).

Where can I already find documentation today?
Is it true that my library would require residing in Sparks`s namespace
similar to https://github.com/collectivemedia/spark-ext to utilize all the
handy functionality?

Kind Regards,
Georg

Re: Develop custom Estimator / Transformer for pipeline

Posted by Georg Heiler <ge...@gmail.com>.

The estimator should perform data cleaning tasks. This means some rows will
be dropped, some columns dropped, some columns added, some values replaced
in existing columns. IT should also store the mean or min for some numeric
columns as a NaN replacement.

However,

override def transformSchema(schema: StructType): StructType = {
   schema.add(StructField("foo", IntegerType))}

only supports adding fields? I am curious how am I supposed to handle this.
Should I create a new column for each affected column, drop the old one and
rename afterward?


Regards,

Georg

On Fri, Nov 18, 2016 at 7:39 AM Georg Heiler <ge...@gmail.com>
> wrote:
>
> Yes that would be really great. Thanks a lot
> Holden Karau <ho...@pigscanfly.ca> schrieb am Fr. 18. Nov. 2016 um 07:38:
>
> Hi Greg,
>
> So while the post isn't 100% finished if you would want to review a draft
> copy I can share a google doc with you. Would that be useful?
>
> Cheers,
>
> Holden :)
>
> On Fri, Nov 18, 2016 at 7:07 AM Georg Heiler <ge...@gmail.com>
> wrote:
>
> Looking forward to the blog post.
> Thanks for for pointing me to some of the simpler classes.
> Nick Pentreath <ni...@gmail.com> schrieb am Fr. 18. Nov. 2016 um
> 02:53:
>
> @Holden look forward to the blog post - I think a user guide PR based on
> it would also be super useful :)
>
>
> On Fri, 18 Nov 2016 at 05:29 Holden Karau <ho...@gmail.com> wrote:
>
> I've been working on a blog post around this and hope to have it published
> early next month 😀
>
> On Nov 17, 2016 10:16 PM, "Joseph Bradley" <jo...@databricks.com> wrote:
>
> Hi Georg,
>
> It's true we need better documentation for this.  I'd recommend checking
> out simple algorithms within Spark for examples:
> ml.feature.Tokenizer
> ml.regression.IsotonicRegression
>
> You should not need to put your library in Spark's namespace.  The shared
> Params in SPARK-7146 are not necessary to create a custom algorithm; they
> are just niceties.
>
> Though there aren't great docs yet, you should be able to follow existing
> examples.  And I'd like to add more docs in the future!
>
> Good luck,
> Joseph
>
> On Wed, Nov 16, 2016 at 6:29 AM, Georg Heiler <ge...@gmail.com>
> wrote:
>
> HI,
>
> I want to develop a library with custom Estimator / Transformers for
> spark. So far not a lot of documentation could be found but
> http://stackoverflow.com/questions/37270446/how-to-roll-a-custom-estimator-in-pyspark-mllib
>
>
> Suggest that:
> Generally speaking, there is no documentation because as for Spark 1.6 /
> 2.0 most of the related API is not intended to be public. It should change
> in Spark 2.1.0 (see SPARK-7146
> <https://issues.apache.org/jira/browse/SPARK-7146>).
>
> Where can I already find documentation today?
> Is it true that my library would require residing in Sparks`s namespace
> similar to https://github.com/collectivemedia/spark-ext to utilize all
> the handy functionality?
>
> Kind Regards,
> Georg
>
>
>
>

Re: Develop custom Estimator / Transformer for pipeline

Posted by Georg Heiler <ge...@gmail.com>.

Looking forward to the blog post.
Thanks for for pointing me to some of the simpler classes.
Nick Pentreath <ni...@gmail.com> schrieb am Fr. 18. Nov. 2016 um
02:53:

> @Holden look forward to the blog post - I think a user guide PR based on
> it would also be super useful :)
>
>
> On Fri, 18 Nov 2016 at 05:29 Holden Karau <ho...@gmail.com> wrote:
>
> I've been working on a blog post around this and hope to have it published
> early next month 😀
>
> On Nov 17, 2016 10:16 PM, "Joseph Bradley" <jo...@databricks.com> wrote:
>
> Hi Georg,
>
> It's true we need better documentation for this.  I'd recommend checking
> out simple algorithms within Spark for examples:
> ml.feature.Tokenizer
> ml.regression.IsotonicRegression
>
> You should not need to put your library in Spark's namespace.  The shared
> Params in SPARK-7146 are not necessary to create a custom algorithm; they
> are just niceties.
>
> Though there aren't great docs yet, you should be able to follow existing
> examples.  And I'd like to add more docs in the future!
>
> Good luck,
> Joseph
>
> On Wed, Nov 16, 2016 at 6:29 AM, Georg Heiler <ge...@gmail.com>
> wrote:
>
> HI,
>
> I want to develop a library with custom Estimator / Transformers for
> spark. So far not a lot of documentation could be found but
> http://stackoverflow.com/questions/37270446/how-to-roll-a-custom-estimator-in-pyspark-mllib
>
>
> Suggest that:
> Generally speaking, there is no documentation because as for Spark 1.6 /
> 2.0 most of the related API is not intended to be public. It should change
> in Spark 2.1.0 (see SPARK-7146
> <https://issues.apache.org/jira/browse/SPARK-7146>).
>
> Where can I already find documentation today?
> Is it true that my library would require residing in Sparks`s namespace
> similar to https://github.com/collectivemedia/spark-ext to utilize all
> the handy functionality?
>
> Kind Regards,
> Georg
>
>
>
>

Re: Develop custom Estimator / Transformer for pipeline

Posted by Nick Pentreath <ni...@gmail.com>.

@Holden look forward to the blog post - I think a user guide PR based on it
would also be super useful :)

On Fri, 18 Nov 2016 at 05:29 Holden Karau <ho...@gmail.com> wrote:

> I've been working on a blog post around this and hope to have it published
> early next month 😀
>
> On Nov 17, 2016 10:16 PM, "Joseph Bradley" <jo...@databricks.com> wrote:
>
> Hi Georg,
>
> It's true we need better documentation for this.  I'd recommend checking
> out simple algorithms within Spark for examples:
> ml.feature.Tokenizer
> ml.regression.IsotonicRegression
>
> You should not need to put your library in Spark's namespace.  The shared
> Params in SPARK-7146 are not necessary to create a custom algorithm; they
> are just niceties.
>
> Though there aren't great docs yet, you should be able to follow existing
> examples.  And I'd like to add more docs in the future!
>
> Good luck,
> Joseph
>
> On Wed, Nov 16, 2016 at 6:29 AM, Georg Heiler <ge...@gmail.com>
> wrote:
>
> HI,
>
> I want to develop a library with custom Estimator / Transformers for
> spark. So far not a lot of documentation could be found but
> http://stackoverflow.com/questions/37270446/how-to-roll-a-custom-estimator-in-pyspark-mllib
>
>
> Suggest that:
> Generally speaking, there is no documentation because as for Spark 1.6 /
> 2.0 most of the related API is not intended to be public. It should change
> in Spark 2.1.0 (see SPARK-7146
> <https://issues.apache.org/jira/browse/SPARK-7146>).
>
> Where can I already find documentation today?
> Is it true that my library would require residing in Sparks`s namespace
> similar to https://github.com/collectivemedia/spark-ext to utilize all
> the handy functionality?
>
> Kind Regards,
> Georg
>
>
>
>

Re: Develop custom Estimator / Transformer for pipeline

Posted by Holden Karau <ho...@gmail.com>.

I've been working on a blog post around this and hope to have it published
early next month 😀

On Nov 17, 2016 10:16 PM, "Joseph Bradley" <jo...@databricks.com> wrote:

Hi Georg,

It's true we need better documentation for this.  I'd recommend checking
out simple algorithms within Spark for examples:
ml.feature.Tokenizer
ml.regression.IsotonicRegression

You should not need to put your library in Spark's namespace.  The shared
Params in SPARK-7146 are not necessary to create a custom algorithm; they
are just niceties.

Though there aren't great docs yet, you should be able to follow existing
examples.  And I'd like to add more docs in the future!

Good luck,
Joseph

On Wed, Nov 16, 2016 at 6:29 AM, Georg Heiler <ge...@gmail.com>
wrote:

> HI,
>
> I want to develop a library with custom Estimator / Transformers for
> spark. So far not a lot of documentation could be found but
> http://stackoverflow.com/questions/37270446/how-to-roll-
> a-custom-estimator-in-pyspark-mllib
>
> Suggest that:
> Generally speaking, there is no documentation because as for Spark 1.6 /
> 2.0 most of the related API is not intended to be public. It should change
> in Spark 2.1.0 (see SPARK-7146
> <https://issues.apache.org/jira/browse/SPARK-7146>).
>
> Where can I already find documentation today?
> Is it true that my library would require residing in Sparks`s namespace
> similar to https://github.com/collectivemedia/spark-ext to utilize all
> the handy functionality?
>
> Kind Regards,
> Georg
>

Re: Develop custom Estimator / Transformer for pipeline

Posted by Joseph Bradley <jo...@databricks.com>.

Hi Georg,

It's true we need better documentation for this.  I'd recommend checking
out simple algorithms within Spark for examples:
ml.feature.Tokenizer
ml.regression.IsotonicRegression

You should not need to put your library in Spark's namespace.  The shared
Params in SPARK-7146 are not necessary to create a custom algorithm; they
are just niceties.

Though there aren't great docs yet, you should be able to follow existing
examples.  And I'd like to add more docs in the future!

Good luck,
Joseph

On Wed, Nov 16, 2016 at 6:29 AM, Georg Heiler <ge...@gmail.com>
wrote:

> HI,
>
> I want to develop a library with custom Estimator / Transformers for
> spark. So far not a lot of documentation could be found but
> http://stackoverflow.com/questions/37270446/how-to-
> roll-a-custom-estimator-in-pyspark-mllib
>
> Suggest that:
> Generally speaking, there is no documentation because as for Spark 1.6 /
> 2.0 most of the related API is not intended to be public. It should change
> in Spark 2.1.0 (see SPARK-7146
> <https://issues.apache.org/jira/browse/SPARK-7146>).
>
> Where can I already find documentation today?
> Is it true that my library would require residing in Sparks`s namespace
> similar to https://github.com/collectivemedia/spark-ext to utilize all
> the handy functionality?
>
> Kind Regards,
> Georg
>