You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@spark.apache.org by Holden Karau <ho...@pigscanfly.ca> on 2017/11/18 22:55:16 UTC

Thoughts on extedning ML exporting in Spark?

Hi folks,

I've been giving a bit of thought to trying to improve ML exporting in
Spark to support a wider variety of formats. If you implement pipeline
stages, or you've added your own export logic, I'd especially love your
input.

A quick little draft of what I've been thinking about (after jumping back
into my ancient PR # 9207 ) is as follows:

# Background

The current Spark ML writer only supports a Spark "internal" format. This
is less than ideal since Spark MLlib supports PMML, and more formats exist.
The goal of this design document is to allow more general support of saving
Spark ML pipeline stages and models.

Additionally Spark ML has a growing ecosystem of additional pipeline stages
outside of core Spark, so any design should be usable by 3rd party pipeline
stages.

# Design sketch

Spark's DataFrameWriter interface provides a starting point for this
design. When writing the user will be able to specify a path, general
options passed to the format, and importantly the format.

Format discovery will be accomplished in a similar manner to Spark
Datasources (Java's ServiceLoader), however since individual models
providers may wish to implement their own version of a Spark supported
format the writer will be looked by "formatname+pipelinestageclassname."

This has the downside of making the code not necessarily as easy to trace
through as the current structure, but opens up the possibility of allowing
folks to provide model export in additional formats not supported inside of
the models its self.

# Migration path

External pipeline stages may already implement the current MLWriter. To
allow these to continue to work a GeneralMLWriter will be created as a
parent class to the current MLWriter which will handle delegation for other
formats as described above.

For existing stages, the MLWriter's save function will be changed to check
it's input format is the default and delegate to the current saveImpl.

We would then deprecate MLWriter in the next version, remove it in Spark 3.

Does this sound reasonable to folks? It would allow us to add PMML support
in Spark ML pipelines and open it up for other folks to fill in the gaps or
add other custom formats.

Cheers,

Holden :)

-- 
Twitter: https://twitter.com/holdenkarau

Re: Thoughts on extedning ML exporting in Spark?

Posted by Holden Karau <ho...@pigscanfly.ca>.
Right so mostly suggesting a new API so that people can add things like
PFA. Initially in spark we’d have built in support for the current Spark
custom format and support for PMML for the models who already have that in
MLlib. But ideally this API would allow folks looking at other formats to
implement them (as with DataFrame sources).

For what it’s worth the current PMML format isn’t really supported in the
ML pipelines, so this could be viewed as feature parity work we do before
we get rid of Spark MLlib.

On Sun, Nov 19, 2017 at 9:36 AM Timur Shenkao <ts...@timshenkao.su> wrote:

> Hello guys,
>
> Have you considered PFA? http://dmg.org/pfa/docs/document_structure/
>
> As Sean noticed, "there are already 1.5 supported formats " + PMML is
> quite rigid.
>
> There are, at least, 2 implementations of PFA.
> *Scala* Hadrian:  https://github.com/opendatagroup/hadrian.
> *Python* Titus: https//github.com/opendatagroup/hadrian.
>
>
>
> Tim
>
> On Sun, Nov 19, 2017 at 2:01 PM, Sean Owen <so...@cloudera.com> wrote:
>
>> To paraphrase, you are mostly suggesting a new API for reading/writing
>> models, not a new serialization? and the API should be more like the other
>> DataFrame writer APIs, and more extensible?
>>
>> That's better than introducing any new format for sure, as there are
>> already 1.5 supported formats -- the native one and partial PMML support.
>> It would also be great to somehow unify those.
>>
>> The only concern I guess is that it introduces a third API on top of the
>> existing 2, and so needs the others to go away in due course to make this
>> make sense, but yeah makes sense in Spark 3.
>>
>> On Sat, Nov 18, 2017 at 4:55 PM Holden Karau <ho...@pigscanfly.ca>
>> wrote:
>>
>>> Hi folks,
>>>
>>> I've been giving a bit of thought to trying to improve ML exporting in
>>> Spark to support a wider variety of formats. If you implement pipeline
>>> stages, or you've added your own export logic, I'd especially love your
>>> input.
>>>
>>> A quick little draft of what I've been thinking about (after jumping
>>> back into my ancient PR # 9207 ) is as follows:
>>>
>>> # Background
>>>
>>> The current Spark ML writer only supports a Spark "internal" format.
>>> This is less than ideal since Spark MLlib supports PMML, and more formats
>>> exist.
>>> The goal of this design document is to allow more general support of
>>> saving Spark ML pipeline stages and models.
>>>
>>> Additionally Spark ML has a growing ecosystem of additional pipeline
>>> stages outside of core Spark, so any design should be usable by 3rd party
>>> pipeline stages.
>>>
>>> # Design sketch
>>>
>>> Spark's DataFrameWriter interface provides a starting point for this
>>> design. When writing the user will be able to specify a path, general
>>> options passed to the format, and importantly the format.
>>>
>>> Format discovery will be accomplished in a similar manner to Spark
>>> Datasources (Java's ServiceLoader), however since individual models
>>> providers may wish to implement their own version of a Spark supported
>>> format the writer will be looked by "formatname+pipelinestageclassname."
>>>
>>> This has the downside of making the code not necessarily as easy to
>>> trace through as the current structure, but opens up the possibility of
>>> allowing folks to provide model export in additional formats not supported
>>> inside of the models its self.
>>>
>>> # Migration path
>>>
>>> External pipeline stages may already implement the current MLWriter. To
>>> allow these to continue to work a GeneralMLWriter will be created as a
>>> parent class to the current MLWriter which will handle delegation for other
>>> formats as described above.
>>>
>>> For existing stages, the MLWriter's save function will be changed to
>>> check it's input format is the default and delegate to the current saveImpl.
>>>
>>> We would then deprecate MLWriter in the next version, remove it in Spark
>>> 3.
>>>
>>> Does this sound reasonable to folks? It would allow us to add PMML
>>> support in Spark ML pipelines and open it up for other folks to fill in the
>>> gaps or add other custom formats.
>>>
>>> Cheers,
>>>
>>> Holden :)
>>>
>>>
>>> --
>>> Twitter: https://twitter.com/holdenkarau
>>>
>>
> --
Twitter: https://twitter.com/holdenkarau

Re: Thoughts on extedning ML exporting in Spark?

Posted by Timur Shenkao <ts...@timshenkao.su>.
Hello guys,

Have you considered PFA? http://dmg.org/pfa/docs/document_structure/

As Sean noticed, "there are already 1.5 supported formats " + PMML is quite
rigid.

There are, at least, 2 implementations of PFA.
*Scala* Hadrian:  https://github.com/opendatagroup/hadrian.
*Python* Titus: https//github.com/opendatagroup/hadrian.



Tim

On Sun, Nov 19, 2017 at 2:01 PM, Sean Owen <so...@cloudera.com> wrote:

> To paraphrase, you are mostly suggesting a new API for reading/writing
> models, not a new serialization? and the API should be more like the other
> DataFrame writer APIs, and more extensible?
>
> That's better than introducing any new format for sure, as there are
> already 1.5 supported formats -- the native one and partial PMML support.
> It would also be great to somehow unify those.
>
> The only concern I guess is that it introduces a third API on top of the
> existing 2, and so needs the others to go away in due course to make this
> make sense, but yeah makes sense in Spark 3.
>
> On Sat, Nov 18, 2017 at 4:55 PM Holden Karau <ho...@pigscanfly.ca> wrote:
>
>> Hi folks,
>>
>> I've been giving a bit of thought to trying to improve ML exporting in
>> Spark to support a wider variety of formats. If you implement pipeline
>> stages, or you've added your own export logic, I'd especially love your
>> input.
>>
>> A quick little draft of what I've been thinking about (after jumping back
>> into my ancient PR # 9207 ) is as follows:
>>
>> # Background
>>
>> The current Spark ML writer only supports a Spark "internal" format. This
>> is less than ideal since Spark MLlib supports PMML, and more formats exist.
>> The goal of this design document is to allow more general support of
>> saving Spark ML pipeline stages and models.
>>
>> Additionally Spark ML has a growing ecosystem of additional pipeline
>> stages outside of core Spark, so any design should be usable by 3rd party
>> pipeline stages.
>>
>> # Design sketch
>>
>> Spark's DataFrameWriter interface provides a starting point for this
>> design. When writing the user will be able to specify a path, general
>> options passed to the format, and importantly the format.
>>
>> Format discovery will be accomplished in a similar manner to Spark
>> Datasources (Java's ServiceLoader), however since individual models
>> providers may wish to implement their own version of a Spark supported
>> format the writer will be looked by "formatname+pipelinestageclassname."
>>
>> This has the downside of making the code not necessarily as easy to trace
>> through as the current structure, but opens up the possibility of allowing
>> folks to provide model export in additional formats not supported inside of
>> the models its self.
>>
>> # Migration path
>>
>> External pipeline stages may already implement the current MLWriter. To
>> allow these to continue to work a GeneralMLWriter will be created as a
>> parent class to the current MLWriter which will handle delegation for other
>> formats as described above.
>>
>> For existing stages, the MLWriter's save function will be changed to
>> check it's input format is the default and delegate to the current saveImpl.
>>
>> We would then deprecate MLWriter in the next version, remove it in Spark
>> 3.
>>
>> Does this sound reasonable to folks? It would allow us to add PMML
>> support in Spark ML pipelines and open it up for other folks to fill in the
>> gaps or add other custom formats.
>>
>> Cheers,
>>
>> Holden :)
>>
>>
>> --
>> Twitter: https://twitter.com/holdenkarau
>>
>

Re: Thoughts on extedning ML exporting in Spark?

Posted by Sean Owen <so...@cloudera.com>.
To paraphrase, you are mostly suggesting a new API for reading/writing
models, not a new serialization? and the API should be more like the other
DataFrame writer APIs, and more extensible?

That's better than introducing any new format for sure, as there are
already 1.5 supported formats -- the native one and partial PMML support.
It would also be great to somehow unify those.

The only concern I guess is that it introduces a third API on top of the
existing 2, and so needs the others to go away in due course to make this
make sense, but yeah makes sense in Spark 3.

On Sat, Nov 18, 2017 at 4:55 PM Holden Karau <ho...@pigscanfly.ca> wrote:

> Hi folks,
>
> I've been giving a bit of thought to trying to improve ML exporting in
> Spark to support a wider variety of formats. If you implement pipeline
> stages, or you've added your own export logic, I'd especially love your
> input.
>
> A quick little draft of what I've been thinking about (after jumping back
> into my ancient PR # 9207 ) is as follows:
>
> # Background
>
> The current Spark ML writer only supports a Spark "internal" format. This
> is less than ideal since Spark MLlib supports PMML, and more formats exist.
> The goal of this design document is to allow more general support of
> saving Spark ML pipeline stages and models.
>
> Additionally Spark ML has a growing ecosystem of additional pipeline
> stages outside of core Spark, so any design should be usable by 3rd party
> pipeline stages.
>
> # Design sketch
>
> Spark's DataFrameWriter interface provides a starting point for this
> design. When writing the user will be able to specify a path, general
> options passed to the format, and importantly the format.
>
> Format discovery will be accomplished in a similar manner to Spark
> Datasources (Java's ServiceLoader), however since individual models
> providers may wish to implement their own version of a Spark supported
> format the writer will be looked by "formatname+pipelinestageclassname."
>
> This has the downside of making the code not necessarily as easy to trace
> through as the current structure, but opens up the possibility of allowing
> folks to provide model export in additional formats not supported inside of
> the models its self.
>
> # Migration path
>
> External pipeline stages may already implement the current MLWriter. To
> allow these to continue to work a GeneralMLWriter will be created as a
> parent class to the current MLWriter which will handle delegation for other
> formats as described above.
>
> For existing stages, the MLWriter's save function will be changed to check
> it's input format is the default and delegate to the current saveImpl.
>
> We would then deprecate MLWriter in the next version, remove it in Spark 3.
>
> Does this sound reasonable to folks? It would allow us to add PMML support
> in Spark ML pipelines and open it up for other folks to fill in the gaps or
> add other custom formats.
>
> Cheers,
>
> Holden :)
>
>
> --
> Twitter: https://twitter.com/holdenkarau
>